All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
@ 2013-09-10  9:31 ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It has been a long time since V6 of this series and time for an update. Much
of this is now stabilised with the most important addition being the inclusion
of Peter and Rik's work on grouping tasks that share pages together.

This series has a number of goals. It reduces overhead of automatic balancing
through scan rate reduction and the avoidance of TLB flushes. It selects a
preferred node and moves tasks towards their memory as well as moving memory
toward their task. It handles shared pages and groups related tasks together.

Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

There are still gaps between this series and manual binding but it's still
an important series of steps in the right direction and the size of the
series is getting unwieldly. As before, the intention is not to complete
the work but to incrementally improve mainline and preserve bisectability
for any bug reports that crop up.

Patch 1 is a monolothic dump of patches thare are destined for upstream that
	this series indirectly depends upon.

Patches 2-3 adds sysctl documentation and comment fixlets

Patch 4 avoids accounting for a hinting fault if another thread handled the
	fault in parallel

Patches 5-6 avoid races with parallel THP migration and THP splits.

Patch 7 corrects a THP NUMA hint fault accounting bug

Patch 8 sanitizes task_numa_fault callsites to have consist semantics and
	always record the fault based on the correct location of the page.

Patch 9 avoids trying to migrate the THP zero page

Patch 10 avoids the same task being selected to perform the PTE scan within
	a shared address space.

Patch 11 continues PTE scanning even if migration rate limited

Patch 12 notes that delaying the PTE scan until a task is scheduled on an
	alternatie node misses the case where the task is only accessing
	shared memory on a partially loaded machine and reverts a patch.

Patches 13,15 initialses numa_next_scan properly so that PTE scanning is delayed
	when a process starts.

Patch 14 sets the scan rate proportional to the size of the task being
	scanned.

Patches 16-17 avoids TLB flushes during the PTE scan if no updates are made

Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.

Patch 19 tracks NUMA hinting faults per-task and per-node

Patches 20-24 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node. When initially selected the task is rescheduled on
	the preferred node if it is not running on that node already. This
	avoids waiting for the scheduler to move the task slowly.

Patch 25 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 26 avoids some unnecessary allocation

Patch 27-28 kicks away some training wheels and scans shared pages and
	small VMAs.

Patch 29 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently.

Patch 30 avoids migrating memory immediately after the load balancer moves
	a task to another node in case it's a transient migration.

Patch 31 pick the least loaded CPU based on a preferred node based on
	a scheduling domain common to both the source and destination
	NUMA node.

Patch 32 retries task migration if an earlier attempt failed

Patch 33 will begin task migration immediately if running on its preferred
	node

Patch 34 will avoid trapping hinting faults for shared read-only library
	pages as these never migrate anyway

Patch 35 avoids handling pmd hinting faults if none of the ptes below it were
	marked pte numa

Patches 36-37 introduce a mechanism for swapping tasks

Patch 38 uses a system-wide search to find tasks that can be swapped
	to improve the overall locality of the system.

Patch 39 notes that the system-wide search may ignore the preferred node and
	will use the preferred node placement if it has spare compute
	capacity.

Patches 40-42 use cpupid to track pages so potential sharing tasks can
	be quickly found

Patches 43-44 avoids grouping based on read-only pages

Patches 45-46 schedules tasks based on their numa group

Patch 47 adds some debugging aids

Patches 48-49 separately considers task and group weights when selecting the node to
	schedule a task on

Patch 50 avoids migrating tasks away from their preferred node.

Kernel 3.11-rc7 is the testing baseline.

o account-v7		Patches 1-7
o lesspmd-v7		Patches 1-35
o selectweight-v7	Patches 1-49
o avoidmove-v7		Patches 1-50

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.

specjbb

                   3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
TPut 1      26483.00 (  0.00%)     26691.00 (  0.79%)     26618.00 (  0.51%)     25450.00 ( -3.90%)
TPut 2      55009.00 (  0.00%)     54744.00 ( -0.48%)     53200.00 ( -3.29%)     53998.00 ( -1.84%)
TPut 3      86711.00 (  0.00%)     85564.00 ( -1.32%)     86547.00 ( -0.19%)     85424.00 ( -1.48%)
TPut 4     108073.00 (  0.00%)    112757.00 (  4.33%)    111408.00 (  3.09%)    113522.00 (  5.04%)
TPut 5     138128.00 (  0.00%)    137733.00 ( -0.29%)    140797.00 (  1.93%)    140930.00 (  2.03%)
TPut 6     161949.00 (  0.00%)    164499.00 (  1.57%)    164759.00 (  1.74%)    161916.00 ( -0.02%)
TPut 7     185205.00 (  0.00%)    190214.00 (  2.70%)    189409.00 (  2.27%)    191425.00 (  3.36%)
TPut 8     214152.00 (  0.00%)    216550.00 (  1.12%)    219510.00 (  2.50%)    217374.00 (  1.50%)
TPut 9     245408.00 (  0.00%)    242975.00 ( -0.99%)    241001.00 ( -1.80%)    243116.00 ( -0.93%)
TPut 10    262786.00 (  0.00%)    267812.00 (  1.91%)    260897.00 ( -0.72%)    267728.00 (  1.88%)
TPut 11    293162.00 (  0.00%)    299621.00 (  2.20%)    291130.00 ( -0.69%)    300006.00 (  2.33%)
TPut 12    310423.00 (  0.00%)    317867.00 (  2.40%)    307821.00 ( -0.84%)    317531.00 (  2.29%)
TPut 13    328542.00 (  0.00%)    347286.00 (  5.71%)    327800.00 ( -0.23%)    344849.00 (  4.96%)
TPut 14    362081.00 (  0.00%)    374173.00 (  3.34%)    342014.00 ( -5.54%)    366256.00 (  1.15%)
TPut 15    374475.00 (  0.00%)    393658.00 (  5.12%)    348941.00 ( -6.82%)    376056.00 (  0.42%)
TPut 16    407367.00 (  0.00%)    409212.00 (  0.45%)    361272.00 (-11.32%)    409353.00 (  0.49%)
TPut 17    423282.00 (  0.00%)    424424.00 (  0.27%)    377808.00 (-10.74%)    410761.00 ( -2.96%)
TPut 18    447960.00 (  0.00%)    456736.00 (  1.96%)    392421.00 (-12.40%)    437756.00 ( -2.28%)
TPut 19    449296.00 (  0.00%)    475797.00 (  5.90%)    404142.00 (-10.05%)    446286.00 ( -0.67%)
TPut 20    480073.00 (  0.00%)    487883.00 (  1.63%)    414085.00 (-13.75%)    453840.00 ( -5.46%)
TPut 21    476891.00 (  0.00%)    505589.00 (  6.02%)    422953.00 (-11.31%)    458974.00 ( -3.76%)
TPut 22    492092.00 (  0.00%)    503878.00 (  2.40%)    433232.00 (-11.96%)    461927.00 ( -6.13%)
TPut 23    500602.00 (  0.00%)    523202.00 (  4.51%)    433320.00 (-13.44%)    454256.00 ( -9.26%)
TPut 24    500408.00 (  0.00%)    509350.00 (  1.79%)    441878.00 (-11.70%)    460559.00 ( -7.96%)
TPut 25    503390.00 (  0.00%)    521126.00 (  3.52%)    454313.00 ( -9.75%)    468970.00 ( -6.84%)
TPut 26    514905.00 (  0.00%)    523315.00 (  1.63%)    453013.00 (-12.02%)    455508.00 (-11.54%)
TPut 27    513125.00 (  0.00%)    529317.00 (  3.16%)    461561.00 (-10.05%)    463229.00 ( -9.72%)
TPut 28    508313.00 (  0.00%)    540357.00 (  6.30%)    460727.00 ( -9.36%)    452718.00 (-10.94%)
TPut 29    514726.00 (  0.00%)    534836.00 (  3.91%)    451867.00 (-12.21%)    449201.00 (-12.73%)
TPut 30    509362.00 (  0.00%)    526295.00 (  3.32%)    453946.00 (-10.88%)    444615.00 (-12.71%)
TPut 31    506812.00 (  0.00%)    532603.00 (  5.09%)    448303.00 (-11.54%)    450953.00 (-11.02%)
TPut 32    500600.00 (  0.00%)    524926.00 (  4.86%)    452692.00 ( -9.57%)    432748.00 (-13.55%)
TPut 33    491116.00 (  0.00%)    525059.00 (  6.91%)    436046.00 (-11.21%)    433109.00 (-11.81%)
TPut 34    483206.00 (  0.00%)    508843.00 (  5.31%)    440762.00 ( -8.78%)    408980.00 (-15.36%)
TPut 35    489281.00 (  0.00%)    504354.00 (  3.08%)    423368.00 (-13.47%)    408371.00 (-16.54%)
TPut 36    480259.00 (  0.00%)    489147.00 (  1.85%)    415108.00 (-13.57%)    397698.00 (-17.19%)
TPut 37    474611.00 (  0.00%)    497076.00 (  4.73%)    411894.00 (-13.21%)    396970.00 (-16.36%)
TPut 38    470478.00 (  0.00%)    487195.00 (  3.55%)    407295.00 (-13.43%)    389028.00 (-17.31%)
TPut 39    437255.00 (  0.00%)    477739.00 (  9.26%)    413837.00 ( -5.36%)    391655.00 (-10.43%)
TPut 40    463513.00 (  0.00%)    473658.00 (  2.19%)    407789.00 (-12.02%)    383771.00 (-17.20%)
TPut 41    426922.00 (  0.00%)    446614.00 (  4.61%)    384862.00 ( -9.85%)    376937.00 (-11.71%)
TPut 42    423707.00 (  0.00%)    442783.00 (  4.50%)    393131.00 ( -7.22%)    389373.00 ( -8.10%)
TPut 43    443489.00 (  0.00%)    444903.00 (  0.32%)    375795.00 (-15.26%)    377239.00 (-14.94%)
TPut 44    415987.00 (  0.00%)    432628.00 (  4.00%)    367343.00 (-11.69%)    383026.00 ( -7.92%)
TPut 45    409382.00 (  0.00%)    424978.00 (  3.81%)    364387.00 (-10.99%)    385429.00 ( -5.85%)
TPut 46    402538.00 (  0.00%)    393039.00 ( -2.36%)    359730.00 (-10.63%)    370411.00 ( -7.98%)
TPut 47    373125.00 (  0.00%)    406744.00 (  9.01%)    342382.00 ( -8.24%)    375368.00 (  0.60%)
TPut 48    405485.00 (  0.00%)    421600.00 (  3.97%)    347063.00 (-14.41%)    400586.00 ( -1.21%)

So this is somewhat of a bad start. The initial bulk of the patches help
but the grouping code did not work out as well. This tends to be a bit
variable as a re-run sometimes behaves very differently. Modelling the task
groupings show that threads in the same task group are still scheduled to
run on CPUs from different nodes so more work is needed there.

specjbb Peaks
                                  3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7
                               account-v7                 lesspmd-v7            selectweight-v7               avoidmove-v7   
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        373125.00 (  0.00%)        406744.00 (  9.01%)        342382.00 ( -8.24%)        375368.00 (  0.60%)
 Actual Warehouse            27.00 (  0.00%)            29.00 (  7.41%)            28.00 (  3.70%)            26.00 ( -3.70%)
 Actual Peak Bops        514905.00 (  0.00%)        540357.00 (  4.94%)        461561.00 (-10.36%)        468970.00 ( -8.92%)
 SpecJBB Bops              8275.00 (  0.00%)          8604.00 (  3.98%)          7083.00 (-14.40%)          8175.00 ( -1.21%)
 SpecJBB Bops/JVM          8275.00 (  0.00%)          8604.00 (  3.98%)          7083.00 (-14.40%)          8175.00 ( -1.21%)

The actual specjbb score for the overall series does not look as bad
as the raw figures illustrate.

          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User        43513.28    44403.17    44513.42    44406.55
System        871.01      122.46      107.05      116.15
Elapsed      1665.24     1664.94     1665.03     1665.06

A big positive at least is that system CPU overhead is slashed.

                            3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
                          account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success         133385393    14958732     9859116    12092458
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                 138454       15527       10233       12551
NUMA PTE updates              19952605      712634      674115      730464
NUMA hint faults               4113211      710022      668011      729294
NUMA hint local faults         1197939      274740      251230      273679
NUMA hint local percent             29          38          37          37
NUMA pages migrated          133385393    14958732     9859116    12092458
AutoNUMA cost                    23240        3839        3532        3881

And the source of the reduction is obvious here from the much smaller
number of PTE updates and hinting faults.


This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.

specjbb
                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
Mean   1      29995.75 (  0.00%)     30321.50 (  1.09%)     29457.25 ( -1.80%)     30791.75 (  2.65%)
Mean   2      62699.25 (  0.00%)     60564.75 ( -3.40%)     59721.00 ( -4.75%)     61050.00 ( -2.63%)
Mean   3      88312.75 (  0.00%)     89286.50 (  1.10%)     88451.50 (  0.16%)     90461.75 (  2.43%)
Mean   4     117827.00 (  0.00%)    115583.00 ( -1.90%)    116043.00 ( -1.51%)    114945.75 ( -2.45%)
Mean   5     139419.00 (  0.00%)    137869.25 ( -1.11%)    137761.00 ( -1.19%)    136841.25 ( -1.85%)
Mean   6     156185.25 (  0.00%)    155811.50 ( -0.24%)    151628.50 ( -2.92%)    149850.25 ( -4.06%)
Mean   7     162258.25 (  0.00%)    160665.25 ( -0.98%)    154775.25 ( -4.61%)    154356.25 ( -4.87%)
Mean   8     160665.00 (  0.00%)    160376.75 ( -0.18%)    150849.00 ( -6.11%)    154266.50 ( -3.98%)
Mean   9     156048.00 (  0.00%)    159689.75 (  2.33%)    150347.00 ( -3.65%)    150804.75 ( -3.36%)
Mean   10    144640.75 (  0.00%)    153683.50 (  6.25%)    146165.50 (  1.05%)    143256.00 ( -0.96%)
Mean   11    136418.75 (  0.00%)    146141.75 (  7.13%)    139216.75 (  2.05%)    137435.00 (  0.74%)
Mean   12    132808.00 (  0.00%)    141567.75 (  6.60%)    131523.25 ( -0.97%)    139129.50 (  4.76%)
Mean   13    126834.75 (  0.00%)    140738.50 ( 10.96%)    124446.25 ( -1.88%)    138181.50 (  8.95%)
Mean   14    127837.25 (  0.00%)    140882.00 ( 10.20%)    121495.50 ( -4.96%)    128275.75 (  0.34%)
Mean   15    122268.50 (  0.00%)    139983.50 ( 14.49%)    115737.25 ( -5.34%)    119838.75 ( -1.99%)
Mean   16    118739.25 (  0.00%)    142654.25 ( 20.14%)    110902.25 ( -6.60%)    123369.50 (  3.90%)
Mean   17    117972.75 (  0.00%)    136969.50 ( 16.10%)    108398.00 ( -8.12%)    115575.00 ( -2.03%)
Mean   18    116308.50 (  0.00%)    134009.50 ( 15.22%)    109094.25 ( -6.20%)    118385.75 (  1.79%)
Mean   19    114594.75 (  0.00%)    125941.75 (  9.90%)    108366.75 ( -5.43%)    117998.00 (  2.97%)
Mean   20    116338.50 (  0.00%)    121586.50 (  4.51%)    110267.25 ( -5.22%)    121703.00 (  4.61%)
Mean   21    114274.00 (  0.00%)    118586.00 (  3.77%)    105316.25 ( -7.84%)    112591.75 ( -1.47%)
Mean   22    113135.00 (  0.00%)    121886.75 (  7.74%)    108124.25 ( -4.43%)    107672.00 ( -4.83%)
Mean   23    109514.25 (  0.00%)    117894.25 (  7.65%)    111499.50 (  1.81%)    108045.75 ( -1.34%)
Mean   24    112897.00 (  0.00%)    119902.75 (  6.21%)    110615.50 ( -2.02%)    117146.00 (  3.76%)
Mean   25    107127.75 (  0.00%)    125763.50 ( 17.40%)    107750.75 (  0.58%)    116425.75 (  8.68%)
Mean   26    109338.75 (  0.00%)    119034.00 (  8.87%)    105875.25 ( -3.17%)    116591.00 (  6.63%)
Mean   27    110967.75 (  0.00%)    122720.50 ( 10.59%)     99660.75 (-10.19%)    118399.25 (  6.70%)
Mean   28    116559.50 (  0.00%)    121524.50 (  4.26%)     98095.50 (-15.84%)    116433.50 ( -0.11%)
Mean   29    113278.00 (  0.00%)    115992.75 (  2.40%)    101014.00 (-10.83%)    122954.25 (  8.54%)
Mean   30    110273.75 (  0.00%)    112436.50 (  1.96%)    103679.75 ( -5.98%)    127165.50 ( 15.32%)
Mean   31    107409.50 (  0.00%)    120160.00 ( 11.87%)    101122.75 ( -5.85%)    128566.25 ( 19.70%)
Mean   32    105624.00 (  0.00%)    122808.50 ( 16.27%)    100410.75 ( -4.94%)    126009.50 ( 19.30%)
Mean   33    107521.75 (  0.00%)    118049.50 (  9.79%)     97788.25 ( -9.05%)    124172.00 ( 15.49%)
Mean   34    108135.75 (  0.00%)    118198.75 (  9.31%)     99215.25 ( -8.25%)    129010.75 ( 19.30%)
Mean   35    104407.75 (  0.00%)    115090.50 ( 10.23%)     97804.00 ( -6.32%)    126019.75 ( 20.70%)
Mean   36    101119.00 (  0.00%)    118554.75 ( 17.24%)    101608.00 (  0.48%)    126106.00 ( 24.71%)
Mean   37    104228.25 (  0.00%)    123893.25 ( 18.87%)     99277.75 ( -4.75%)    122410.25 ( 17.44%)
Mean   38    104402.50 (  0.00%)    118543.50 ( 13.54%)     97255.00 ( -6.85%)    118682.75 ( 13.68%)
Mean   39    100158.50 (  0.00%)    116866.00 ( 16.68%)     99918.00 ( -0.24%)    122019.75 ( 21.83%)
Mean   40    101911.75 (  0.00%)    117276.25 ( 15.08%)     98766.25 ( -3.09%)    121322.00 ( 19.05%)
Mean   41    104757.50 (  0.00%)    116656.75 ( 11.36%)     97970.25 ( -6.48%)    121403.00 ( 15.89%)
Mean   42    104782.50 (  0.00%)    116385.25 ( 11.07%)     96897.25 ( -7.53%)    118765.25 ( 13.34%)
Mean   43     97073.00 (  0.00%)    113745.50 ( 17.18%)     93433.00 ( -3.75%)    118571.25 ( 22.15%)
Mean   44     99739.00 (  0.00%)    116286.00 ( 16.59%)     96193.50 ( -3.55%)    116149.75 ( 16.45%)
Mean   45    104422.25 (  0.00%)    109978.25 (  5.32%)     95737.50 ( -8.32%)    113604.75 (  8.79%)
Mean   46    103389.25 (  0.00%)    110703.00 (  7.07%)     93711.50 ( -9.36%)    110550.75 (  6.93%)
Mean   47     96092.25 (  0.00%)    108942.50 ( 13.37%)     94220.50 ( -1.95%)    104079.00 (  8.31%)
Mean   48     97596.25 (  0.00%)    109194.00 ( 11.88%)    101071.25 (  3.56%)    101543.00 (  4.04%)
Stddev 1       1326.20 (  0.00%)      1351.58 ( -1.91%)      1525.30 (-15.01%)      1048.89 ( 20.91%)
Stddev 2       1837.05 (  0.00%)      1538.27 ( 16.26%)       919.58 ( 49.94%)      1974.67 ( -7.49%)
Stddev 3       1267.24 (  0.00%)      2599.37 (-105.12%)      2323.12 (-83.32%)      2091.33 (-65.03%)
Stddev 4       6125.28 (  0.00%)      2980.50 ( 51.34%)      1706.84 ( 72.13%)      2497.81 ( 59.22%)
Stddev 5       6161.12 (  0.00%)      2495.59 ( 59.49%)      2466.47 ( 59.97%)      3077.78 ( 50.05%)
Stddev 6       5784.16 (  0.00%)      4799.20 ( 17.03%)      4580.83 ( 20.80%)      2889.81 ( 50.04%)
Stddev 7       6607.07 (  0.00%)      1167.21 ( 82.33%)      6196.26 (  6.22%)      4385.20 ( 33.63%)
Stddev 8       1671.12 (  0.00%)      6631.06 (-296.80%)      6812.80 (-307.68%)      8598.19 (-414.52%)
Stddev 9       6052.25 (  0.00%)      6954.93 (-14.91%)      6382.84 ( -5.46%)      8987.78 (-48.50%)
Stddev 10     11473.39 (  0.00%)      4442.38 ( 61.28%)      6772.50 ( 40.97%)     16758.82 (-46.07%)
Stddev 11      7093.02 (  0.00%)      4526.31 ( 36.19%)      9026.86 (-27.26%)     13353.17 (-88.26%)
Stddev 12      3865.06 (  0.00%)      2743.41 ( 29.02%)     15584.41 (-303.21%)     14112.46 (-265.13%)
Stddev 13      2777.36 (  0.00%)      1050.96 ( 62.16%)     16286.28 (-486.39%)      8243.38 (-196.81%)
Stddev 14      1795.89 (  0.00%)       536.93 ( 70.10%)     13502.75 (-651.87%)      6328.98 (-252.42%)
Stddev 15      2250.85 (  0.00%)      1135.62 ( 49.55%)      9908.63 (-340.22%)     11274.74 (-400.91%)
Stddev 16      1963.42 (  0.00%)       379.50 ( 80.67%)      9645.69 (-391.27%)      2679.87 (-36.49%)
Stddev 17      1592.42 (  0.00%)      1388.57 ( 12.80%)      6322.29 (-297.02%)      3768.27 (-136.64%)
Stddev 18      3317.92 (  0.00%)       721.81 ( 78.25%)      3065.44 (  7.61%)      6375.92 (-92.17%)
Stddev 19      4525.33 (  0.00%)      3273.36 ( 27.67%)      5565.31 (-22.98%)      3248.71 ( 28.21%)
Stddev 20      4140.94 (  0.00%)      2332.35 ( 43.68%)      8000.27 (-93.20%)      6237.91 (-50.64%)
Stddev 21      1515.71 (  0.00%)      3309.22 (-118.33%)      6587.02 (-334.58%)     10217.84 (-574.13%)
Stddev 22      5498.36 (  0.00%)      2437.41 ( 55.67%)      7920.50 (-44.05%)      8414.84 (-53.04%)
Stddev 23      5637.68 (  0.00%)      1832.68 ( 67.49%)      6543.07 (-16.06%)      5976.59 ( -6.01%)
Stddev 24      4862.89 (  0.00%)      6295.82 (-29.47%)      9229.15 (-89.79%)      9046.57 (-86.03%)
Stddev 25      1725.07 (  0.00%)      2986.87 (-73.15%)     13679.77 (-693.00%)      9521.44 (-451.95%)
Stddev 26      4590.06 (  0.00%)      1862.17 ( 59.43%)     10773.97 (-134.72%)      5417.65 (-18.03%)
Stddev 27      6060.43 (  0.00%)      1567.32 ( 74.14%)     10217.36 (-68.59%)      2934.56 ( 51.58%)
Stddev 28      2742.94 (  0.00%)      2533.06 (  7.65%)     11375.97 (-314.74%)      3713.72 (-35.39%)
Stddev 29      3878.01 (  0.00%)       783.58 ( 79.79%)      8718.86 (-124.83%)      2870.90 ( 25.97%)
Stddev 30      4446.49 (  0.00%)       852.75 ( 80.82%)      5318.24 (-19.61%)      2174.56 ( 51.09%)
Stddev 31      3825.27 (  0.00%)       876.75 ( 77.08%)      7412.96 (-93.79%)      1517.78 ( 60.32%)
Stddev 32      8118.60 (  0.00%)      1367.48 ( 83.16%)      5757.34 ( 29.08%)      1025.48 ( 87.37%)
Stddev 33      3237.05 (  0.00%)      3807.47 (-17.62%)      7493.40 (-131.49%)      4600.54 (-42.12%)
Stddev 34      7413.56 (  0.00%)      3599.54 ( 51.45%)      8514.89 (-14.86%)      2999.21 ( 59.54%)
Stddev 35      6061.77 (  0.00%)      3756.88 ( 38.02%)      5594.20 (  7.71%)      4241.61 ( 30.03%)
Stddev 36      5836.80 (  0.00%)      2944.03 ( 49.56%)     10641.97 (-82.33%)      1267.44 ( 78.29%)
Stddev 37      2719.65 (  0.00%)      3819.92 (-40.46%)      4075.76 (-49.86%)      2604.21 (  4.24%)
Stddev 38      3267.94 (  0.00%)      2148.38 ( 34.26%)      5219.19 (-59.71%)      4865.10 (-48.87%)
Stddev 39      3596.06 (  0.00%)      1042.13 ( 71.02%)      5891.17 (-63.82%)      3067.42 ( 14.70%)
Stddev 40      4303.03 (  0.00%)      2518.02 ( 41.48%)      5279.70 (-22.70%)      1750.86 ( 59.31%)
Stddev 41     10269.08 (  0.00%)      3602.25 ( 64.92%)      5907.68 ( 42.47%)      3163.17 ( 69.20%)
Stddev 42      3221.41 (  0.00%)      3707.32 (-15.08%)      6926.80 (-115.02%)      2555.18 ( 20.68%)
Stddev 43      7203.43 (  0.00%)      3082.74 ( 57.20%)      6537.72 (  9.24%)      3912.25 ( 45.69%)
Stddev 44      6164.48 (  0.00%)      2946.14 ( 52.21%)      4702.32 ( 23.72%)      3228.17 ( 47.63%)
Stddev 45      7696.65 (  0.00%)      2461.14 ( 68.02%)      4697.11 ( 38.97%)      4675.68 ( 39.25%)
Stddev 46      6989.59 (  0.00%)      3713.96 ( 46.86%)      5105.63 ( 26.95%)      5008.38 ( 28.35%)
Stddev 47      5580.13 (  0.00%)      4025.00 ( 27.87%)      4034.38 ( 27.70%)      5538.34 (  0.75%)
Stddev 48      5647.24 (  0.00%)      1694.00 ( 70.00%)      2980.82 ( 47.22%)      8123.60 (-43.85%)
TPut   1     119983.00 (  0.00%)    121286.00 (  1.09%)    117829.00 ( -1.80%)    123167.00 (  2.65%)
TPut   2     250797.00 (  0.00%)    242259.00 ( -3.40%)    238884.00 ( -4.75%)    244200.00 ( -2.63%)
TPut   3     353251.00 (  0.00%)    357146.00 (  1.10%)    353806.00 (  0.16%)    361847.00 (  2.43%)
TPut   4     471308.00 (  0.00%)    462332.00 ( -1.90%)    464172.00 ( -1.51%)    459783.00 ( -2.45%)
TPut   5     557676.00 (  0.00%)    551477.00 ( -1.11%)    551044.00 ( -1.19%)    547365.00 ( -1.85%)
TPut   6     624741.00 (  0.00%)    623246.00 ( -0.24%)    606514.00 ( -2.92%)    599401.00 ( -4.06%)
TPut   7     649033.00 (  0.00%)    642661.00 ( -0.98%)    619101.00 ( -4.61%)    617425.00 ( -4.87%)
TPut   8     642660.00 (  0.00%)    641507.00 ( -0.18%)    603396.00 ( -6.11%)    617066.00 ( -3.98%)
TPut   9     624192.00 (  0.00%)    638759.00 (  2.33%)    601388.00 ( -3.65%)    603219.00 ( -3.36%)
TPut   10    578563.00 (  0.00%)    614734.00 (  6.25%)    584662.00 (  1.05%)    573024.00 ( -0.96%)
TPut   11    545675.00 (  0.00%)    584567.00 (  7.13%)    556867.00 (  2.05%)    549740.00 (  0.74%)
TPut   12    531232.00 (  0.00%)    566271.00 (  6.60%)    526093.00 ( -0.97%)    556518.00 (  4.76%)
TPut   13    507339.00 (  0.00%)    562954.00 ( 10.96%)    497785.00 ( -1.88%)    552726.00 (  8.95%)
TPut   14    511349.00 (  0.00%)    563528.00 ( 10.20%)    485982.00 ( -4.96%)    513103.00 (  0.34%)
TPut   15    489074.00 (  0.00%)    559934.00 ( 14.49%)    462949.00 ( -5.34%)    479355.00 ( -1.99%)
TPut   16    474957.00 (  0.00%)    570617.00 ( 20.14%)    443609.00 ( -6.60%)    493478.00 (  3.90%)
TPut   17    471891.00 (  0.00%)    547878.00 ( 16.10%)    433592.00 ( -8.12%)    462300.00 ( -2.03%)
TPut   18    465234.00 (  0.00%)    536038.00 ( 15.22%)    436377.00 ( -6.20%)    473543.00 (  1.79%)
TPut   19    458379.00 (  0.00%)    503767.00 (  9.90%)    433467.00 ( -5.43%)    471992.00 (  2.97%)
TPut   20    465354.00 (  0.00%)    486346.00 (  4.51%)    441069.00 ( -5.22%)    486812.00 (  4.61%)
TPut   21    457096.00 (  0.00%)    474344.00 (  3.77%)    421265.00 ( -7.84%)    450367.00 ( -1.47%)
TPut   22    452540.00 (  0.00%)    487547.00 (  7.74%)    432497.00 ( -4.43%)    430688.00 ( -4.83%)
TPut   23    438057.00 (  0.00%)    471577.00 (  7.65%)    445998.00 (  1.81%)    432183.00 ( -1.34%)
TPut   24    451588.00 (  0.00%)    479611.00 (  6.21%)    442462.00 ( -2.02%)    468584.00 (  3.76%)
TPut   25    428511.00 (  0.00%)    503054.00 ( 17.40%)    431003.00 (  0.58%)    465703.00 (  8.68%)
TPut   26    437355.00 (  0.00%)    476136.00 (  8.87%)    423501.00 ( -3.17%)    466364.00 (  6.63%)
TPut   27    443871.00 (  0.00%)    490882.00 ( 10.59%)    398643.00 (-10.19%)    473597.00 (  6.70%)
TPut   28    466238.00 (  0.00%)    486098.00 (  4.26%)    392382.00 (-15.84%)    465734.00 ( -0.11%)
TPut   29    453112.00 (  0.00%)    463971.00 (  2.40%)    404056.00 (-10.83%)    491817.00 (  8.54%)
TPut   30    441095.00 (  0.00%)    449746.00 (  1.96%)    414719.00 ( -5.98%)    508662.00 ( 15.32%)
TPut   31    429638.00 (  0.00%)    480640.00 ( 11.87%)    404491.00 ( -5.85%)    514265.00 ( 19.70%)
TPut   32    422496.00 (  0.00%)    491234.00 ( 16.27%)    401643.00 ( -4.94%)    504038.00 ( 19.30%)
TPut   33    430087.00 (  0.00%)    472198.00 (  9.79%)    391153.00 ( -9.05%)    496688.00 ( 15.49%)
TPut   34    432543.00 (  0.00%)    472795.00 (  9.31%)    396861.00 ( -8.25%)    516043.00 ( 19.30%)
TPut   35    417631.00 (  0.00%)    460362.00 ( 10.23%)    391216.00 ( -6.32%)    504079.00 ( 20.70%)
TPut   36    404476.00 (  0.00%)    474219.00 ( 17.24%)    406432.00 (  0.48%)    504424.00 ( 24.71%)
TPut   37    416913.00 (  0.00%)    495573.00 ( 18.87%)    397111.00 ( -4.75%)    489641.00 ( 17.44%)
TPut   38    417610.00 (  0.00%)    474174.00 ( 13.54%)    389020.00 ( -6.85%)    474731.00 ( 13.68%)
TPut   39    400634.00 (  0.00%)    467464.00 ( 16.68%)    399672.00 ( -0.24%)    488079.00 ( 21.83%)
TPut   40    407647.00 (  0.00%)    469105.00 ( 15.08%)    395065.00 ( -3.09%)    485288.00 ( 19.05%)
TPut   41    419030.00 (  0.00%)    466627.00 ( 11.36%)    391881.00 ( -6.48%)    485612.00 ( 15.89%)
TPut   42    419130.00 (  0.00%)    465541.00 ( 11.07%)    387589.00 ( -7.53%)    475061.00 ( 13.34%)
TPut   43    388292.00 (  0.00%)    454982.00 ( 17.18%)    373732.00 ( -3.75%)    474285.00 ( 22.15%)
TPut   44    398956.00 (  0.00%)    465144.00 ( 16.59%)    384774.00 ( -3.55%)    464599.00 ( 16.45%)
TPut   45    417689.00 (  0.00%)    439913.00 (  5.32%)    382950.00 ( -8.32%)    454419.00 (  8.79%)
TPut   46    413557.00 (  0.00%)    442812.00 (  7.07%)    374846.00 ( -9.36%)    442203.00 (  6.93%)
TPut   47    384369.00 (  0.00%)    435770.00 ( 13.37%)    376882.00 ( -1.95%)    416316.00 (  8.31%)
TPut   48    390385.00 (  0.00%)    436776.00 ( 11.88%)    404285.00 (  3.56%)    406172.00 (  4.04%)

This is looking a bit better overall. One would generally expect this
JVM configuration to be handled better because there are far few problems
dealing with shared pages.

specjbb Peaks
                                  3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7
                               account-v7                 lesspmd-v7            selectweight-v7               avoidmove-v7   
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        545675.00 (  0.00%)        584567.00 (  7.13%)        556867.00 (  2.05%)        549740.00 (  0.74%)
 Actual Warehouse             8.00 (  0.00%)             8.00 (  0.00%)             8.00 (  0.00%)             8.00 (  0.00%)
 Actual Peak Bops        649033.00 (  0.00%)        642661.00 ( -0.98%)        619101.00 ( -4.61%)        617425.00 ( -4.87%)
 SpecJBB Bops            474931.00 (  0.00%)        523877.00 ( 10.31%)        454089.00 ( -4.39%)        482435.00 (  1.58%)
 SpecJBB Bops/JVM        118733.00 (  0.00%)        130969.00 ( 10.31%)        113522.00 ( -4.39%)        120609.00 (  1.58%)

Because the specjvm score is based on lower number of clients this does
not look as impressive but at least the overall series does not have a
worse specjbb score.


          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User       464762.73   474999.54   474756.33   475883.65
System      10593.13      725.15      752.36      689.00
Elapsed     10409.45    10414.85    10416.46    10441.17

On the other hand, look at the system CPU overhead. We are getting comparable
or better performance at a small fraction of the cost.


                            3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
                          account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success        1339274585    55904453    50672239    48174428
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                1390167       58028       52597       50005
NUMA PTE updates             501107230     9187590     8925627     9120756
NUMA hint faults              69895484     9184458     8917340     9096029
NUMA hint local faults        21848214     3778721     3832832     4025324
NUMA hint local percent             31          41          42          44
NUMA pages migrated         1339274585    55904453    50672239    48174428
AutoNUMA cost                   378431       47048       45611       46459

And again the reduced cost is from massively reduced numbers of PTE updates
and faults. This may mean some workloads may converge slower but the system
will not get hammered constantly trying to converge either.

                                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
User    NUMA01               53586.49 (  0.00%)    57956.00 ( -8.15%)    38838.20 ( 27.52%)    45977.10 ( 14.20%)
User    NUMA01_THEADLOCAL    16956.29 (  0.00%)    17070.87 ( -0.68%)    16972.80 ( -0.10%)    17262.89 ( -1.81%)
User    NUMA02                2024.02 (  0.00%)     2022.45 (  0.08%)     2035.17 ( -0.55%)     2013.42 (  0.52%)
User    NUMA02_SMT             968.96 (  0.00%)      992.63 ( -2.44%)      979.86 ( -1.12%)     1379.96 (-42.42%)
System  NUMA01                1442.97 (  0.00%)      542.69 ( 62.39%)      309.92 ( 78.52%)      405.48 ( 71.90%)
System  NUMA01_THEADLOCAL      117.16 (  0.00%)       72.08 ( 38.48%)       75.60 ( 35.47%)       91.56 ( 21.85%)
System  NUMA02                   7.12 (  0.00%)        7.86 (-10.39%)        7.84 (-10.11%)        6.38 ( 10.39%)
System  NUMA02_SMT               8.49 (  0.00%)        3.74 ( 55.95%)        3.53 ( 58.42%)        6.26 ( 26.27%)
Elapsed NUMA01                1216.88 (  0.00%)     1372.29 (-12.77%)      918.05 ( 24.56%)     1065.63 ( 12.43%)
Elapsed NUMA01_THEADLOCAL      375.15 (  0.00%)      388.68 ( -3.61%)      386.02 ( -2.90%)      382.63 ( -1.99%)
Elapsed NUMA02                  48.61 (  0.00%)       52.19 ( -7.36%)       49.65 ( -2.14%)       51.85 ( -6.67%)
Elapsed NUMA02_SMT              49.68 (  0.00%)       51.23 ( -3.12%)       50.36 ( -1.37%)       80.91 (-62.86%)
CPU     NUMA01                4522.00 (  0.00%)     4262.00 (  5.75%)     4264.00 (  5.71%)     4352.00 (  3.76%)
CPU     NUMA01_THEADLOCAL     4551.00 (  0.00%)     4410.00 (  3.10%)     4416.00 (  2.97%)     4535.00 (  0.35%)
CPU     NUMA02                4178.00 (  0.00%)     3890.00 (  6.89%)     4114.00 (  1.53%)     3895.00 (  6.77%)
CPU     NUMA02_SMT            1967.00 (  0.00%)     1944.00 (  1.17%)     1952.00 (  0.76%)     1713.00 ( 12.91%)

Elapsed figures here are poor. The numa01 test case saw an improvement but
it's an adverse workload and not that interesting per-se. Its main benefit
is from the reduction of system overhead. numa02_smt suffered badly due
to the last patch in the series that needs addressing.

nas-omp
                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
Time bt.C      187.22 (  0.00%)      188.02 ( -0.43%)      188.19 ( -0.52%)      197.49 ( -5.49%)
Time cg.C       61.58 (  0.00%)       49.64 ( 19.39%)       61.44 (  0.23%)       56.84 (  7.70%)
Time ep.C       13.28 (  0.00%)       13.28 (  0.00%)       14.05 ( -5.80%)       13.34 ( -0.45%)
Time ft.C       38.35 (  0.00%)       37.39 (  2.50%)       35.08 (  8.53%)       37.05 (  3.39%)
Time is.C        2.12 (  0.00%)        1.75 ( 17.45%)        2.20 ( -3.77%)        2.14 ( -0.94%)
Time lu.C      180.71 (  0.00%)      183.01 ( -1.27%)      186.64 ( -3.28%)      169.77 (  6.05%)
Time mg.C       32.02 (  0.00%)       31.57 (  1.41%)       29.45 (  8.03%)       31.98 (  0.12%)
Time sp.C      413.92 (  0.00%)      396.36 (  4.24%)      400.92 (  3.14%)      388.54 (  6.13%)
Time ua.C      200.27 (  0.00%)      204.68 ( -2.20%)      211.46 ( -5.59%)      194.92 (  2.67%)

This is Nasa Parallel Benchmark (npb) running with openmp. Some small improvements.

          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User        47694.80    47262.68    47998.27    46282.27
System        421.02      136.12      129.91      131.36
Elapsed      1265.34     1242.74     1267.06     1229.70

With large reductions of system CPU usage.

So overall it is still a bit of a mixed bag. There is not a universal
performance win but there are massive reductions in system CPU overhead
which may of big benefit on larger machines meaning the series is still
worth considering.  The ratio of local/remote NUMA hinting faults is still
very slow and the fact that there are tasks sharing a numa group running
on different nodes should be examined closer.

 Documentation/sysctl/kernel.txt   |   73 ++
 arch/x86/mm/numa.c                |    6 +-
 fs/proc/array.c                   |    2 +
 include/linux/migrate.h           |    7 +-
 include/linux/mm.h                |  107 ++-
 include/linux/mm_types.h          |   14 +-
 include/linux/page-flags-layout.h |   28 +-
 include/linux/sched.h             |   45 +-
 include/linux/stop_machine.h      |    1 +
 kernel/bounds.c                   |    4 +
 kernel/fork.c                     |    5 +-
 kernel/sched/core.c               |  196 ++++-
 kernel/sched/debug.c              |   60 +-
 kernel/sched/fair.c               | 1523 ++++++++++++++++++++++++++++++-------
 kernel/sched/features.h           |   19 +-
 kernel/sched/idle_task.c          |    2 +-
 kernel/sched/rt.c                 |    5 +-
 kernel/sched/sched.h              |   19 +-
 kernel/sched/stop_task.c          |    2 +-
 kernel/stop_machine.c             |  272 ++++---
 kernel/sysctl.c                   |    7 +
 lib/vsprintf.c                    |    5 +
 mm/huge_memory.c                  |  103 ++-
 mm/memory.c                       |   95 ++-
 mm/mempolicy.c                    |   24 +-
 mm/migrate.c                      |   21 +-
 mm/mm_init.c                      |   18 +-
 mm/mmzone.c                       |   14 +-
 mm/mprotect.c                     |   70 +-
 mm/page_alloc.c                   |    4 +-
 30 files changed, 2147 insertions(+), 604 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 361+ messages in thread

* [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
@ 2013-09-10  9:31 ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It has been a long time since V6 of this series and time for an update. Much
of this is now stabilised with the most important addition being the inclusion
of Peter and Rik's work on grouping tasks that share pages together.

This series has a number of goals. It reduces overhead of automatic balancing
through scan rate reduction and the avoidance of TLB flushes. It selects a
preferred node and moves tasks towards their memory as well as moving memory
toward their task. It handles shared pages and groups related tasks together.

Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
  easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
  instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
  task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
  preferred node
o Laughably basic accounting of a compute overloaded node when selecting
  the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

There are still gaps between this series and manual binding but it's still
an important series of steps in the right direction and the size of the
series is getting unwieldly. As before, the intention is not to complete
the work but to incrementally improve mainline and preserve bisectability
for any bug reports that crop up.

Patch 1 is a monolothic dump of patches thare are destined for upstream that
	this series indirectly depends upon.

Patches 2-3 adds sysctl documentation and comment fixlets

Patch 4 avoids accounting for a hinting fault if another thread handled the
	fault in parallel

Patches 5-6 avoid races with parallel THP migration and THP splits.

Patch 7 corrects a THP NUMA hint fault accounting bug

Patch 8 sanitizes task_numa_fault callsites to have consist semantics and
	always record the fault based on the correct location of the page.

Patch 9 avoids trying to migrate the THP zero page

Patch 10 avoids the same task being selected to perform the PTE scan within
	a shared address space.

Patch 11 continues PTE scanning even if migration rate limited

Patch 12 notes that delaying the PTE scan until a task is scheduled on an
	alternatie node misses the case where the task is only accessing
	shared memory on a partially loaded machine and reverts a patch.

Patches 13,15 initialses numa_next_scan properly so that PTE scanning is delayed
	when a process starts.

Patch 14 sets the scan rate proportional to the size of the task being
	scanned.

Patches 16-17 avoids TLB flushes during the PTE scan if no updates are made

Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.

Patch 19 tracks NUMA hinting faults per-task and per-node

Patches 20-24 selects a preferred node at the end of a PTE scan based on what
	node incurrent the highest number of NUMA faults. When the balancer
	is comparing two CPU it will prefer to locate tasks on their
	preferred node. When initially selected the task is rescheduled on
	the preferred node if it is not running on that node already. This
	avoids waiting for the scheduler to move the task slowly.

Patch 25 adds infrastructure to allow separate tracking of shared/private
	pages but treats all faults as if they are private accesses. Laying
	it out this way reduces churn later in the series when private
	fault detection is introduced

Patch 26 avoids some unnecessary allocation

Patch 27-28 kicks away some training wheels and scans shared pages and
	small VMAs.

Patch 29 introduces private fault detection based on the PID of the faulting
	process and accounts for shared/private accesses differently.

Patch 30 avoids migrating memory immediately after the load balancer moves
	a task to another node in case it's a transient migration.

Patch 31 pick the least loaded CPU based on a preferred node based on
	a scheduling domain common to both the source and destination
	NUMA node.

Patch 32 retries task migration if an earlier attempt failed

Patch 33 will begin task migration immediately if running on its preferred
	node

Patch 34 will avoid trapping hinting faults for shared read-only library
	pages as these never migrate anyway

Patch 35 avoids handling pmd hinting faults if none of the ptes below it were
	marked pte numa

Patches 36-37 introduce a mechanism for swapping tasks

Patch 38 uses a system-wide search to find tasks that can be swapped
	to improve the overall locality of the system.

Patch 39 notes that the system-wide search may ignore the preferred node and
	will use the preferred node placement if it has spare compute
	capacity.

Patches 40-42 use cpupid to track pages so potential sharing tasks can
	be quickly found

Patches 43-44 avoids grouping based on read-only pages

Patches 45-46 schedules tasks based on their numa group

Patch 47 adds some debugging aids

Patches 48-49 separately considers task and group weights when selecting the node to
	schedule a task on

Patch 50 avoids migrating tasks away from their preferred node.

Kernel 3.11-rc7 is the testing baseline.

o account-v7		Patches 1-7
o lesspmd-v7		Patches 1-35
o selectweight-v7	Patches 1-49
o avoidmove-v7		Patches 1-50

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.

specjbb

                   3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
TPut 1      26483.00 (  0.00%)     26691.00 (  0.79%)     26618.00 (  0.51%)     25450.00 ( -3.90%)
TPut 2      55009.00 (  0.00%)     54744.00 ( -0.48%)     53200.00 ( -3.29%)     53998.00 ( -1.84%)
TPut 3      86711.00 (  0.00%)     85564.00 ( -1.32%)     86547.00 ( -0.19%)     85424.00 ( -1.48%)
TPut 4     108073.00 (  0.00%)    112757.00 (  4.33%)    111408.00 (  3.09%)    113522.00 (  5.04%)
TPut 5     138128.00 (  0.00%)    137733.00 ( -0.29%)    140797.00 (  1.93%)    140930.00 (  2.03%)
TPut 6     161949.00 (  0.00%)    164499.00 (  1.57%)    164759.00 (  1.74%)    161916.00 ( -0.02%)
TPut 7     185205.00 (  0.00%)    190214.00 (  2.70%)    189409.00 (  2.27%)    191425.00 (  3.36%)
TPut 8     214152.00 (  0.00%)    216550.00 (  1.12%)    219510.00 (  2.50%)    217374.00 (  1.50%)
TPut 9     245408.00 (  0.00%)    242975.00 ( -0.99%)    241001.00 ( -1.80%)    243116.00 ( -0.93%)
TPut 10    262786.00 (  0.00%)    267812.00 (  1.91%)    260897.00 ( -0.72%)    267728.00 (  1.88%)
TPut 11    293162.00 (  0.00%)    299621.00 (  2.20%)    291130.00 ( -0.69%)    300006.00 (  2.33%)
TPut 12    310423.00 (  0.00%)    317867.00 (  2.40%)    307821.00 ( -0.84%)    317531.00 (  2.29%)
TPut 13    328542.00 (  0.00%)    347286.00 (  5.71%)    327800.00 ( -0.23%)    344849.00 (  4.96%)
TPut 14    362081.00 (  0.00%)    374173.00 (  3.34%)    342014.00 ( -5.54%)    366256.00 (  1.15%)
TPut 15    374475.00 (  0.00%)    393658.00 (  5.12%)    348941.00 ( -6.82%)    376056.00 (  0.42%)
TPut 16    407367.00 (  0.00%)    409212.00 (  0.45%)    361272.00 (-11.32%)    409353.00 (  0.49%)
TPut 17    423282.00 (  0.00%)    424424.00 (  0.27%)    377808.00 (-10.74%)    410761.00 ( -2.96%)
TPut 18    447960.00 (  0.00%)    456736.00 (  1.96%)    392421.00 (-12.40%)    437756.00 ( -2.28%)
TPut 19    449296.00 (  0.00%)    475797.00 (  5.90%)    404142.00 (-10.05%)    446286.00 ( -0.67%)
TPut 20    480073.00 (  0.00%)    487883.00 (  1.63%)    414085.00 (-13.75%)    453840.00 ( -5.46%)
TPut 21    476891.00 (  0.00%)    505589.00 (  6.02%)    422953.00 (-11.31%)    458974.00 ( -3.76%)
TPut 22    492092.00 (  0.00%)    503878.00 (  2.40%)    433232.00 (-11.96%)    461927.00 ( -6.13%)
TPut 23    500602.00 (  0.00%)    523202.00 (  4.51%)    433320.00 (-13.44%)    454256.00 ( -9.26%)
TPut 24    500408.00 (  0.00%)    509350.00 (  1.79%)    441878.00 (-11.70%)    460559.00 ( -7.96%)
TPut 25    503390.00 (  0.00%)    521126.00 (  3.52%)    454313.00 ( -9.75%)    468970.00 ( -6.84%)
TPut 26    514905.00 (  0.00%)    523315.00 (  1.63%)    453013.00 (-12.02%)    455508.00 (-11.54%)
TPut 27    513125.00 (  0.00%)    529317.00 (  3.16%)    461561.00 (-10.05%)    463229.00 ( -9.72%)
TPut 28    508313.00 (  0.00%)    540357.00 (  6.30%)    460727.00 ( -9.36%)    452718.00 (-10.94%)
TPut 29    514726.00 (  0.00%)    534836.00 (  3.91%)    451867.00 (-12.21%)    449201.00 (-12.73%)
TPut 30    509362.00 (  0.00%)    526295.00 (  3.32%)    453946.00 (-10.88%)    444615.00 (-12.71%)
TPut 31    506812.00 (  0.00%)    532603.00 (  5.09%)    448303.00 (-11.54%)    450953.00 (-11.02%)
TPut 32    500600.00 (  0.00%)    524926.00 (  4.86%)    452692.00 ( -9.57%)    432748.00 (-13.55%)
TPut 33    491116.00 (  0.00%)    525059.00 (  6.91%)    436046.00 (-11.21%)    433109.00 (-11.81%)
TPut 34    483206.00 (  0.00%)    508843.00 (  5.31%)    440762.00 ( -8.78%)    408980.00 (-15.36%)
TPut 35    489281.00 (  0.00%)    504354.00 (  3.08%)    423368.00 (-13.47%)    408371.00 (-16.54%)
TPut 36    480259.00 (  0.00%)    489147.00 (  1.85%)    415108.00 (-13.57%)    397698.00 (-17.19%)
TPut 37    474611.00 (  0.00%)    497076.00 (  4.73%)    411894.00 (-13.21%)    396970.00 (-16.36%)
TPut 38    470478.00 (  0.00%)    487195.00 (  3.55%)    407295.00 (-13.43%)    389028.00 (-17.31%)
TPut 39    437255.00 (  0.00%)    477739.00 (  9.26%)    413837.00 ( -5.36%)    391655.00 (-10.43%)
TPut 40    463513.00 (  0.00%)    473658.00 (  2.19%)    407789.00 (-12.02%)    383771.00 (-17.20%)
TPut 41    426922.00 (  0.00%)    446614.00 (  4.61%)    384862.00 ( -9.85%)    376937.00 (-11.71%)
TPut 42    423707.00 (  0.00%)    442783.00 (  4.50%)    393131.00 ( -7.22%)    389373.00 ( -8.10%)
TPut 43    443489.00 (  0.00%)    444903.00 (  0.32%)    375795.00 (-15.26%)    377239.00 (-14.94%)
TPut 44    415987.00 (  0.00%)    432628.00 (  4.00%)    367343.00 (-11.69%)    383026.00 ( -7.92%)
TPut 45    409382.00 (  0.00%)    424978.00 (  3.81%)    364387.00 (-10.99%)    385429.00 ( -5.85%)
TPut 46    402538.00 (  0.00%)    393039.00 ( -2.36%)    359730.00 (-10.63%)    370411.00 ( -7.98%)
TPut 47    373125.00 (  0.00%)    406744.00 (  9.01%)    342382.00 ( -8.24%)    375368.00 (  0.60%)
TPut 48    405485.00 (  0.00%)    421600.00 (  3.97%)    347063.00 (-14.41%)    400586.00 ( -1.21%)

So this is somewhat of a bad start. The initial bulk of the patches help
but the grouping code did not work out as well. This tends to be a bit
variable as a re-run sometimes behaves very differently. Modelling the task
groupings show that threads in the same task group are still scheduled to
run on CPUs from different nodes so more work is needed there.

specjbb Peaks
                                  3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7
                               account-v7                 lesspmd-v7            selectweight-v7               avoidmove-v7   
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        373125.00 (  0.00%)        406744.00 (  9.01%)        342382.00 ( -8.24%)        375368.00 (  0.60%)
 Actual Warehouse            27.00 (  0.00%)            29.00 (  7.41%)            28.00 (  3.70%)            26.00 ( -3.70%)
 Actual Peak Bops        514905.00 (  0.00%)        540357.00 (  4.94%)        461561.00 (-10.36%)        468970.00 ( -8.92%)
 SpecJBB Bops              8275.00 (  0.00%)          8604.00 (  3.98%)          7083.00 (-14.40%)          8175.00 ( -1.21%)
 SpecJBB Bops/JVM          8275.00 (  0.00%)          8604.00 (  3.98%)          7083.00 (-14.40%)          8175.00 ( -1.21%)

The actual specjbb score for the overall series does not look as bad
as the raw figures illustrate.

          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User        43513.28    44403.17    44513.42    44406.55
System        871.01      122.46      107.05      116.15
Elapsed      1665.24     1664.94     1665.03     1665.06

A big positive at least is that system CPU overhead is slashed.

                            3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
                          account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success         133385393    14958732     9859116    12092458
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                 138454       15527       10233       12551
NUMA PTE updates              19952605      712634      674115      730464
NUMA hint faults               4113211      710022      668011      729294
NUMA hint local faults         1197939      274740      251230      273679
NUMA hint local percent             29          38          37          37
NUMA pages migrated          133385393    14958732     9859116    12092458
AutoNUMA cost                    23240        3839        3532        3881

And the source of the reduction is obvious here from the much smaller
number of PTE updates and hinting faults.


This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.

specjbb
                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
Mean   1      29995.75 (  0.00%)     30321.50 (  1.09%)     29457.25 ( -1.80%)     30791.75 (  2.65%)
Mean   2      62699.25 (  0.00%)     60564.75 ( -3.40%)     59721.00 ( -4.75%)     61050.00 ( -2.63%)
Mean   3      88312.75 (  0.00%)     89286.50 (  1.10%)     88451.50 (  0.16%)     90461.75 (  2.43%)
Mean   4     117827.00 (  0.00%)    115583.00 ( -1.90%)    116043.00 ( -1.51%)    114945.75 ( -2.45%)
Mean   5     139419.00 (  0.00%)    137869.25 ( -1.11%)    137761.00 ( -1.19%)    136841.25 ( -1.85%)
Mean   6     156185.25 (  0.00%)    155811.50 ( -0.24%)    151628.50 ( -2.92%)    149850.25 ( -4.06%)
Mean   7     162258.25 (  0.00%)    160665.25 ( -0.98%)    154775.25 ( -4.61%)    154356.25 ( -4.87%)
Mean   8     160665.00 (  0.00%)    160376.75 ( -0.18%)    150849.00 ( -6.11%)    154266.50 ( -3.98%)
Mean   9     156048.00 (  0.00%)    159689.75 (  2.33%)    150347.00 ( -3.65%)    150804.75 ( -3.36%)
Mean   10    144640.75 (  0.00%)    153683.50 (  6.25%)    146165.50 (  1.05%)    143256.00 ( -0.96%)
Mean   11    136418.75 (  0.00%)    146141.75 (  7.13%)    139216.75 (  2.05%)    137435.00 (  0.74%)
Mean   12    132808.00 (  0.00%)    141567.75 (  6.60%)    131523.25 ( -0.97%)    139129.50 (  4.76%)
Mean   13    126834.75 (  0.00%)    140738.50 ( 10.96%)    124446.25 ( -1.88%)    138181.50 (  8.95%)
Mean   14    127837.25 (  0.00%)    140882.00 ( 10.20%)    121495.50 ( -4.96%)    128275.75 (  0.34%)
Mean   15    122268.50 (  0.00%)    139983.50 ( 14.49%)    115737.25 ( -5.34%)    119838.75 ( -1.99%)
Mean   16    118739.25 (  0.00%)    142654.25 ( 20.14%)    110902.25 ( -6.60%)    123369.50 (  3.90%)
Mean   17    117972.75 (  0.00%)    136969.50 ( 16.10%)    108398.00 ( -8.12%)    115575.00 ( -2.03%)
Mean   18    116308.50 (  0.00%)    134009.50 ( 15.22%)    109094.25 ( -6.20%)    118385.75 (  1.79%)
Mean   19    114594.75 (  0.00%)    125941.75 (  9.90%)    108366.75 ( -5.43%)    117998.00 (  2.97%)
Mean   20    116338.50 (  0.00%)    121586.50 (  4.51%)    110267.25 ( -5.22%)    121703.00 (  4.61%)
Mean   21    114274.00 (  0.00%)    118586.00 (  3.77%)    105316.25 ( -7.84%)    112591.75 ( -1.47%)
Mean   22    113135.00 (  0.00%)    121886.75 (  7.74%)    108124.25 ( -4.43%)    107672.00 ( -4.83%)
Mean   23    109514.25 (  0.00%)    117894.25 (  7.65%)    111499.50 (  1.81%)    108045.75 ( -1.34%)
Mean   24    112897.00 (  0.00%)    119902.75 (  6.21%)    110615.50 ( -2.02%)    117146.00 (  3.76%)
Mean   25    107127.75 (  0.00%)    125763.50 ( 17.40%)    107750.75 (  0.58%)    116425.75 (  8.68%)
Mean   26    109338.75 (  0.00%)    119034.00 (  8.87%)    105875.25 ( -3.17%)    116591.00 (  6.63%)
Mean   27    110967.75 (  0.00%)    122720.50 ( 10.59%)     99660.75 (-10.19%)    118399.25 (  6.70%)
Mean   28    116559.50 (  0.00%)    121524.50 (  4.26%)     98095.50 (-15.84%)    116433.50 ( -0.11%)
Mean   29    113278.00 (  0.00%)    115992.75 (  2.40%)    101014.00 (-10.83%)    122954.25 (  8.54%)
Mean   30    110273.75 (  0.00%)    112436.50 (  1.96%)    103679.75 ( -5.98%)    127165.50 ( 15.32%)
Mean   31    107409.50 (  0.00%)    120160.00 ( 11.87%)    101122.75 ( -5.85%)    128566.25 ( 19.70%)
Mean   32    105624.00 (  0.00%)    122808.50 ( 16.27%)    100410.75 ( -4.94%)    126009.50 ( 19.30%)
Mean   33    107521.75 (  0.00%)    118049.50 (  9.79%)     97788.25 ( -9.05%)    124172.00 ( 15.49%)
Mean   34    108135.75 (  0.00%)    118198.75 (  9.31%)     99215.25 ( -8.25%)    129010.75 ( 19.30%)
Mean   35    104407.75 (  0.00%)    115090.50 ( 10.23%)     97804.00 ( -6.32%)    126019.75 ( 20.70%)
Mean   36    101119.00 (  0.00%)    118554.75 ( 17.24%)    101608.00 (  0.48%)    126106.00 ( 24.71%)
Mean   37    104228.25 (  0.00%)    123893.25 ( 18.87%)     99277.75 ( -4.75%)    122410.25 ( 17.44%)
Mean   38    104402.50 (  0.00%)    118543.50 ( 13.54%)     97255.00 ( -6.85%)    118682.75 ( 13.68%)
Mean   39    100158.50 (  0.00%)    116866.00 ( 16.68%)     99918.00 ( -0.24%)    122019.75 ( 21.83%)
Mean   40    101911.75 (  0.00%)    117276.25 ( 15.08%)     98766.25 ( -3.09%)    121322.00 ( 19.05%)
Mean   41    104757.50 (  0.00%)    116656.75 ( 11.36%)     97970.25 ( -6.48%)    121403.00 ( 15.89%)
Mean   42    104782.50 (  0.00%)    116385.25 ( 11.07%)     96897.25 ( -7.53%)    118765.25 ( 13.34%)
Mean   43     97073.00 (  0.00%)    113745.50 ( 17.18%)     93433.00 ( -3.75%)    118571.25 ( 22.15%)
Mean   44     99739.00 (  0.00%)    116286.00 ( 16.59%)     96193.50 ( -3.55%)    116149.75 ( 16.45%)
Mean   45    104422.25 (  0.00%)    109978.25 (  5.32%)     95737.50 ( -8.32%)    113604.75 (  8.79%)
Mean   46    103389.25 (  0.00%)    110703.00 (  7.07%)     93711.50 ( -9.36%)    110550.75 (  6.93%)
Mean   47     96092.25 (  0.00%)    108942.50 ( 13.37%)     94220.50 ( -1.95%)    104079.00 (  8.31%)
Mean   48     97596.25 (  0.00%)    109194.00 ( 11.88%)    101071.25 (  3.56%)    101543.00 (  4.04%)
Stddev 1       1326.20 (  0.00%)      1351.58 ( -1.91%)      1525.30 (-15.01%)      1048.89 ( 20.91%)
Stddev 2       1837.05 (  0.00%)      1538.27 ( 16.26%)       919.58 ( 49.94%)      1974.67 ( -7.49%)
Stddev 3       1267.24 (  0.00%)      2599.37 (-105.12%)      2323.12 (-83.32%)      2091.33 (-65.03%)
Stddev 4       6125.28 (  0.00%)      2980.50 ( 51.34%)      1706.84 ( 72.13%)      2497.81 ( 59.22%)
Stddev 5       6161.12 (  0.00%)      2495.59 ( 59.49%)      2466.47 ( 59.97%)      3077.78 ( 50.05%)
Stddev 6       5784.16 (  0.00%)      4799.20 ( 17.03%)      4580.83 ( 20.80%)      2889.81 ( 50.04%)
Stddev 7       6607.07 (  0.00%)      1167.21 ( 82.33%)      6196.26 (  6.22%)      4385.20 ( 33.63%)
Stddev 8       1671.12 (  0.00%)      6631.06 (-296.80%)      6812.80 (-307.68%)      8598.19 (-414.52%)
Stddev 9       6052.25 (  0.00%)      6954.93 (-14.91%)      6382.84 ( -5.46%)      8987.78 (-48.50%)
Stddev 10     11473.39 (  0.00%)      4442.38 ( 61.28%)      6772.50 ( 40.97%)     16758.82 (-46.07%)
Stddev 11      7093.02 (  0.00%)      4526.31 ( 36.19%)      9026.86 (-27.26%)     13353.17 (-88.26%)
Stddev 12      3865.06 (  0.00%)      2743.41 ( 29.02%)     15584.41 (-303.21%)     14112.46 (-265.13%)
Stddev 13      2777.36 (  0.00%)      1050.96 ( 62.16%)     16286.28 (-486.39%)      8243.38 (-196.81%)
Stddev 14      1795.89 (  0.00%)       536.93 ( 70.10%)     13502.75 (-651.87%)      6328.98 (-252.42%)
Stddev 15      2250.85 (  0.00%)      1135.62 ( 49.55%)      9908.63 (-340.22%)     11274.74 (-400.91%)
Stddev 16      1963.42 (  0.00%)       379.50 ( 80.67%)      9645.69 (-391.27%)      2679.87 (-36.49%)
Stddev 17      1592.42 (  0.00%)      1388.57 ( 12.80%)      6322.29 (-297.02%)      3768.27 (-136.64%)
Stddev 18      3317.92 (  0.00%)       721.81 ( 78.25%)      3065.44 (  7.61%)      6375.92 (-92.17%)
Stddev 19      4525.33 (  0.00%)      3273.36 ( 27.67%)      5565.31 (-22.98%)      3248.71 ( 28.21%)
Stddev 20      4140.94 (  0.00%)      2332.35 ( 43.68%)      8000.27 (-93.20%)      6237.91 (-50.64%)
Stddev 21      1515.71 (  0.00%)      3309.22 (-118.33%)      6587.02 (-334.58%)     10217.84 (-574.13%)
Stddev 22      5498.36 (  0.00%)      2437.41 ( 55.67%)      7920.50 (-44.05%)      8414.84 (-53.04%)
Stddev 23      5637.68 (  0.00%)      1832.68 ( 67.49%)      6543.07 (-16.06%)      5976.59 ( -6.01%)
Stddev 24      4862.89 (  0.00%)      6295.82 (-29.47%)      9229.15 (-89.79%)      9046.57 (-86.03%)
Stddev 25      1725.07 (  0.00%)      2986.87 (-73.15%)     13679.77 (-693.00%)      9521.44 (-451.95%)
Stddev 26      4590.06 (  0.00%)      1862.17 ( 59.43%)     10773.97 (-134.72%)      5417.65 (-18.03%)
Stddev 27      6060.43 (  0.00%)      1567.32 ( 74.14%)     10217.36 (-68.59%)      2934.56 ( 51.58%)
Stddev 28      2742.94 (  0.00%)      2533.06 (  7.65%)     11375.97 (-314.74%)      3713.72 (-35.39%)
Stddev 29      3878.01 (  0.00%)       783.58 ( 79.79%)      8718.86 (-124.83%)      2870.90 ( 25.97%)
Stddev 30      4446.49 (  0.00%)       852.75 ( 80.82%)      5318.24 (-19.61%)      2174.56 ( 51.09%)
Stddev 31      3825.27 (  0.00%)       876.75 ( 77.08%)      7412.96 (-93.79%)      1517.78 ( 60.32%)
Stddev 32      8118.60 (  0.00%)      1367.48 ( 83.16%)      5757.34 ( 29.08%)      1025.48 ( 87.37%)
Stddev 33      3237.05 (  0.00%)      3807.47 (-17.62%)      7493.40 (-131.49%)      4600.54 (-42.12%)
Stddev 34      7413.56 (  0.00%)      3599.54 ( 51.45%)      8514.89 (-14.86%)      2999.21 ( 59.54%)
Stddev 35      6061.77 (  0.00%)      3756.88 ( 38.02%)      5594.20 (  7.71%)      4241.61 ( 30.03%)
Stddev 36      5836.80 (  0.00%)      2944.03 ( 49.56%)     10641.97 (-82.33%)      1267.44 ( 78.29%)
Stddev 37      2719.65 (  0.00%)      3819.92 (-40.46%)      4075.76 (-49.86%)      2604.21 (  4.24%)
Stddev 38      3267.94 (  0.00%)      2148.38 ( 34.26%)      5219.19 (-59.71%)      4865.10 (-48.87%)
Stddev 39      3596.06 (  0.00%)      1042.13 ( 71.02%)      5891.17 (-63.82%)      3067.42 ( 14.70%)
Stddev 40      4303.03 (  0.00%)      2518.02 ( 41.48%)      5279.70 (-22.70%)      1750.86 ( 59.31%)
Stddev 41     10269.08 (  0.00%)      3602.25 ( 64.92%)      5907.68 ( 42.47%)      3163.17 ( 69.20%)
Stddev 42      3221.41 (  0.00%)      3707.32 (-15.08%)      6926.80 (-115.02%)      2555.18 ( 20.68%)
Stddev 43      7203.43 (  0.00%)      3082.74 ( 57.20%)      6537.72 (  9.24%)      3912.25 ( 45.69%)
Stddev 44      6164.48 (  0.00%)      2946.14 ( 52.21%)      4702.32 ( 23.72%)      3228.17 ( 47.63%)
Stddev 45      7696.65 (  0.00%)      2461.14 ( 68.02%)      4697.11 ( 38.97%)      4675.68 ( 39.25%)
Stddev 46      6989.59 (  0.00%)      3713.96 ( 46.86%)      5105.63 ( 26.95%)      5008.38 ( 28.35%)
Stddev 47      5580.13 (  0.00%)      4025.00 ( 27.87%)      4034.38 ( 27.70%)      5538.34 (  0.75%)
Stddev 48      5647.24 (  0.00%)      1694.00 ( 70.00%)      2980.82 ( 47.22%)      8123.60 (-43.85%)
TPut   1     119983.00 (  0.00%)    121286.00 (  1.09%)    117829.00 ( -1.80%)    123167.00 (  2.65%)
TPut   2     250797.00 (  0.00%)    242259.00 ( -3.40%)    238884.00 ( -4.75%)    244200.00 ( -2.63%)
TPut   3     353251.00 (  0.00%)    357146.00 (  1.10%)    353806.00 (  0.16%)    361847.00 (  2.43%)
TPut   4     471308.00 (  0.00%)    462332.00 ( -1.90%)    464172.00 ( -1.51%)    459783.00 ( -2.45%)
TPut   5     557676.00 (  0.00%)    551477.00 ( -1.11%)    551044.00 ( -1.19%)    547365.00 ( -1.85%)
TPut   6     624741.00 (  0.00%)    623246.00 ( -0.24%)    606514.00 ( -2.92%)    599401.00 ( -4.06%)
TPut   7     649033.00 (  0.00%)    642661.00 ( -0.98%)    619101.00 ( -4.61%)    617425.00 ( -4.87%)
TPut   8     642660.00 (  0.00%)    641507.00 ( -0.18%)    603396.00 ( -6.11%)    617066.00 ( -3.98%)
TPut   9     624192.00 (  0.00%)    638759.00 (  2.33%)    601388.00 ( -3.65%)    603219.00 ( -3.36%)
TPut   10    578563.00 (  0.00%)    614734.00 (  6.25%)    584662.00 (  1.05%)    573024.00 ( -0.96%)
TPut   11    545675.00 (  0.00%)    584567.00 (  7.13%)    556867.00 (  2.05%)    549740.00 (  0.74%)
TPut   12    531232.00 (  0.00%)    566271.00 (  6.60%)    526093.00 ( -0.97%)    556518.00 (  4.76%)
TPut   13    507339.00 (  0.00%)    562954.00 ( 10.96%)    497785.00 ( -1.88%)    552726.00 (  8.95%)
TPut   14    511349.00 (  0.00%)    563528.00 ( 10.20%)    485982.00 ( -4.96%)    513103.00 (  0.34%)
TPut   15    489074.00 (  0.00%)    559934.00 ( 14.49%)    462949.00 ( -5.34%)    479355.00 ( -1.99%)
TPut   16    474957.00 (  0.00%)    570617.00 ( 20.14%)    443609.00 ( -6.60%)    493478.00 (  3.90%)
TPut   17    471891.00 (  0.00%)    547878.00 ( 16.10%)    433592.00 ( -8.12%)    462300.00 ( -2.03%)
TPut   18    465234.00 (  0.00%)    536038.00 ( 15.22%)    436377.00 ( -6.20%)    473543.00 (  1.79%)
TPut   19    458379.00 (  0.00%)    503767.00 (  9.90%)    433467.00 ( -5.43%)    471992.00 (  2.97%)
TPut   20    465354.00 (  0.00%)    486346.00 (  4.51%)    441069.00 ( -5.22%)    486812.00 (  4.61%)
TPut   21    457096.00 (  0.00%)    474344.00 (  3.77%)    421265.00 ( -7.84%)    450367.00 ( -1.47%)
TPut   22    452540.00 (  0.00%)    487547.00 (  7.74%)    432497.00 ( -4.43%)    430688.00 ( -4.83%)
TPut   23    438057.00 (  0.00%)    471577.00 (  7.65%)    445998.00 (  1.81%)    432183.00 ( -1.34%)
TPut   24    451588.00 (  0.00%)    479611.00 (  6.21%)    442462.00 ( -2.02%)    468584.00 (  3.76%)
TPut   25    428511.00 (  0.00%)    503054.00 ( 17.40%)    431003.00 (  0.58%)    465703.00 (  8.68%)
TPut   26    437355.00 (  0.00%)    476136.00 (  8.87%)    423501.00 ( -3.17%)    466364.00 (  6.63%)
TPut   27    443871.00 (  0.00%)    490882.00 ( 10.59%)    398643.00 (-10.19%)    473597.00 (  6.70%)
TPut   28    466238.00 (  0.00%)    486098.00 (  4.26%)    392382.00 (-15.84%)    465734.00 ( -0.11%)
TPut   29    453112.00 (  0.00%)    463971.00 (  2.40%)    404056.00 (-10.83%)    491817.00 (  8.54%)
TPut   30    441095.00 (  0.00%)    449746.00 (  1.96%)    414719.00 ( -5.98%)    508662.00 ( 15.32%)
TPut   31    429638.00 (  0.00%)    480640.00 ( 11.87%)    404491.00 ( -5.85%)    514265.00 ( 19.70%)
TPut   32    422496.00 (  0.00%)    491234.00 ( 16.27%)    401643.00 ( -4.94%)    504038.00 ( 19.30%)
TPut   33    430087.00 (  0.00%)    472198.00 (  9.79%)    391153.00 ( -9.05%)    496688.00 ( 15.49%)
TPut   34    432543.00 (  0.00%)    472795.00 (  9.31%)    396861.00 ( -8.25%)    516043.00 ( 19.30%)
TPut   35    417631.00 (  0.00%)    460362.00 ( 10.23%)    391216.00 ( -6.32%)    504079.00 ( 20.70%)
TPut   36    404476.00 (  0.00%)    474219.00 ( 17.24%)    406432.00 (  0.48%)    504424.00 ( 24.71%)
TPut   37    416913.00 (  0.00%)    495573.00 ( 18.87%)    397111.00 ( -4.75%)    489641.00 ( 17.44%)
TPut   38    417610.00 (  0.00%)    474174.00 ( 13.54%)    389020.00 ( -6.85%)    474731.00 ( 13.68%)
TPut   39    400634.00 (  0.00%)    467464.00 ( 16.68%)    399672.00 ( -0.24%)    488079.00 ( 21.83%)
TPut   40    407647.00 (  0.00%)    469105.00 ( 15.08%)    395065.00 ( -3.09%)    485288.00 ( 19.05%)
TPut   41    419030.00 (  0.00%)    466627.00 ( 11.36%)    391881.00 ( -6.48%)    485612.00 ( 15.89%)
TPut   42    419130.00 (  0.00%)    465541.00 ( 11.07%)    387589.00 ( -7.53%)    475061.00 ( 13.34%)
TPut   43    388292.00 (  0.00%)    454982.00 ( 17.18%)    373732.00 ( -3.75%)    474285.00 ( 22.15%)
TPut   44    398956.00 (  0.00%)    465144.00 ( 16.59%)    384774.00 ( -3.55%)    464599.00 ( 16.45%)
TPut   45    417689.00 (  0.00%)    439913.00 (  5.32%)    382950.00 ( -8.32%)    454419.00 (  8.79%)
TPut   46    413557.00 (  0.00%)    442812.00 (  7.07%)    374846.00 ( -9.36%)    442203.00 (  6.93%)
TPut   47    384369.00 (  0.00%)    435770.00 ( 13.37%)    376882.00 ( -1.95%)    416316.00 (  8.31%)
TPut   48    390385.00 (  0.00%)    436776.00 ( 11.88%)    404285.00 (  3.56%)    406172.00 (  4.04%)

This is looking a bit better overall. One would generally expect this
JVM configuration to be handled better because there are far few problems
dealing with shared pages.

specjbb Peaks
                                  3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7                 3.11.0-rc7
                               account-v7                 lesspmd-v7            selectweight-v7               avoidmove-v7   
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        545675.00 (  0.00%)        584567.00 (  7.13%)        556867.00 (  2.05%)        549740.00 (  0.74%)
 Actual Warehouse             8.00 (  0.00%)             8.00 (  0.00%)             8.00 (  0.00%)             8.00 (  0.00%)
 Actual Peak Bops        649033.00 (  0.00%)        642661.00 ( -0.98%)        619101.00 ( -4.61%)        617425.00 ( -4.87%)
 SpecJBB Bops            474931.00 (  0.00%)        523877.00 ( 10.31%)        454089.00 ( -4.39%)        482435.00 (  1.58%)
 SpecJBB Bops/JVM        118733.00 (  0.00%)        130969.00 ( 10.31%)        113522.00 ( -4.39%)        120609.00 (  1.58%)

Because the specjvm score is based on lower number of clients this does
not look as impressive but at least the overall series does not have a
worse specjbb score.


          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User       464762.73   474999.54   474756.33   475883.65
System      10593.13      725.15      752.36      689.00
Elapsed     10409.45    10414.85    10416.46    10441.17

On the other hand, look at the system CPU overhead. We are getting comparable
or better performance at a small fraction of the cost.


                            3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
                          account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
Compaction stalls                    0           0           0           0
Compaction success                   0           0           0           0
Compaction failures                  0           0           0           0
Page migrate success        1339274585    55904453    50672239    48174428
Page migrate failure                 0           0           0           0
Compaction pages isolated            0           0           0           0
Compaction migrate scanned           0           0           0           0
Compaction free scanned              0           0           0           0
Compaction cost                1390167       58028       52597       50005
NUMA PTE updates             501107230     9187590     8925627     9120756
NUMA hint faults              69895484     9184458     8917340     9096029
NUMA hint local faults        21848214     3778721     3832832     4025324
NUMA hint local percent             31          41          42          44
NUMA pages migrated         1339274585    55904453    50672239    48174428
AutoNUMA cost                   378431       47048       45611       46459

And again the reduced cost is from massively reduced numbers of PTE updates
and faults. This may mean some workloads may converge slower but the system
will not get hammered constantly trying to converge either.

                                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
User    NUMA01               53586.49 (  0.00%)    57956.00 ( -8.15%)    38838.20 ( 27.52%)    45977.10 ( 14.20%)
User    NUMA01_THEADLOCAL    16956.29 (  0.00%)    17070.87 ( -0.68%)    16972.80 ( -0.10%)    17262.89 ( -1.81%)
User    NUMA02                2024.02 (  0.00%)     2022.45 (  0.08%)     2035.17 ( -0.55%)     2013.42 (  0.52%)
User    NUMA02_SMT             968.96 (  0.00%)      992.63 ( -2.44%)      979.86 ( -1.12%)     1379.96 (-42.42%)
System  NUMA01                1442.97 (  0.00%)      542.69 ( 62.39%)      309.92 ( 78.52%)      405.48 ( 71.90%)
System  NUMA01_THEADLOCAL      117.16 (  0.00%)       72.08 ( 38.48%)       75.60 ( 35.47%)       91.56 ( 21.85%)
System  NUMA02                   7.12 (  0.00%)        7.86 (-10.39%)        7.84 (-10.11%)        6.38 ( 10.39%)
System  NUMA02_SMT               8.49 (  0.00%)        3.74 ( 55.95%)        3.53 ( 58.42%)        6.26 ( 26.27%)
Elapsed NUMA01                1216.88 (  0.00%)     1372.29 (-12.77%)      918.05 ( 24.56%)     1065.63 ( 12.43%)
Elapsed NUMA01_THEADLOCAL      375.15 (  0.00%)      388.68 ( -3.61%)      386.02 ( -2.90%)      382.63 ( -1.99%)
Elapsed NUMA02                  48.61 (  0.00%)       52.19 ( -7.36%)       49.65 ( -2.14%)       51.85 ( -6.67%)
Elapsed NUMA02_SMT              49.68 (  0.00%)       51.23 ( -3.12%)       50.36 ( -1.37%)       80.91 (-62.86%)
CPU     NUMA01                4522.00 (  0.00%)     4262.00 (  5.75%)     4264.00 (  5.71%)     4352.00 (  3.76%)
CPU     NUMA01_THEADLOCAL     4551.00 (  0.00%)     4410.00 (  3.10%)     4416.00 (  2.97%)     4535.00 (  0.35%)
CPU     NUMA02                4178.00 (  0.00%)     3890.00 (  6.89%)     4114.00 (  1.53%)     3895.00 (  6.77%)
CPU     NUMA02_SMT            1967.00 (  0.00%)     1944.00 (  1.17%)     1952.00 (  0.76%)     1713.00 ( 12.91%)

Elapsed figures here are poor. The numa01 test case saw an improvement but
it's an adverse workload and not that interesting per-se. Its main benefit
is from the reduction of system overhead. numa02_smt suffered badly due
to the last patch in the series that needs addressing.

nas-omp
                     3.11.0-rc7            3.11.0-rc7            3.11.0-rc7            3.11.0-rc7
                  account-v7            lesspmd-v7       selectweight-v7          avoidmove-v7   
Time bt.C      187.22 (  0.00%)      188.02 ( -0.43%)      188.19 ( -0.52%)      197.49 ( -5.49%)
Time cg.C       61.58 (  0.00%)       49.64 ( 19.39%)       61.44 (  0.23%)       56.84 (  7.70%)
Time ep.C       13.28 (  0.00%)       13.28 (  0.00%)       14.05 ( -5.80%)       13.34 ( -0.45%)
Time ft.C       38.35 (  0.00%)       37.39 (  2.50%)       35.08 (  8.53%)       37.05 (  3.39%)
Time is.C        2.12 (  0.00%)        1.75 ( 17.45%)        2.20 ( -3.77%)        2.14 ( -0.94%)
Time lu.C      180.71 (  0.00%)      183.01 ( -1.27%)      186.64 ( -3.28%)      169.77 (  6.05%)
Time mg.C       32.02 (  0.00%)       31.57 (  1.41%)       29.45 (  8.03%)       31.98 (  0.12%)
Time sp.C      413.92 (  0.00%)      396.36 (  4.24%)      400.92 (  3.14%)      388.54 (  6.13%)
Time ua.C      200.27 (  0.00%)      204.68 ( -2.20%)      211.46 ( -5.59%)      194.92 (  2.67%)

This is Nasa Parallel Benchmark (npb) running with openmp. Some small improvements.

          3.11.0-rc7  3.11.0-rc7  3.11.0-rc7  3.11.0-rc7
        account-v7   lesspmd-v7   selectweight-v7   avoidmove-v7   
User        47694.80    47262.68    47998.27    46282.27
System        421.02      136.12      129.91      131.36
Elapsed      1265.34     1242.74     1267.06     1229.70

With large reductions of system CPU usage.

So overall it is still a bit of a mixed bag. There is not a universal
performance win but there are massive reductions in system CPU overhead
which may of big benefit on larger machines meaning the series is still
worth considering.  The ratio of local/remote NUMA hinting faults is still
very slow and the fact that there are tasks sharing a numa group running
on different nodes should be examined closer.

 Documentation/sysctl/kernel.txt   |   73 ++
 arch/x86/mm/numa.c                |    6 +-
 fs/proc/array.c                   |    2 +
 include/linux/migrate.h           |    7 +-
 include/linux/mm.h                |  107 ++-
 include/linux/mm_types.h          |   14 +-
 include/linux/page-flags-layout.h |   28 +-
 include/linux/sched.h             |   45 +-
 include/linux/stop_machine.h      |    1 +
 kernel/bounds.c                   |    4 +
 kernel/fork.c                     |    5 +-
 kernel/sched/core.c               |  196 ++++-
 kernel/sched/debug.c              |   60 +-
 kernel/sched/fair.c               | 1523 ++++++++++++++++++++++++++++++-------
 kernel/sched/features.h           |   19 +-
 kernel/sched/idle_task.c          |    2 +-
 kernel/sched/rt.c                 |    5 +-
 kernel/sched/sched.h              |   19 +-
 kernel/sched/stop_task.c          |    2 +-
 kernel/stop_machine.c             |  272 ++++---
 kernel/sysctl.c                   |    7 +
 lib/vsprintf.c                    |    5 +
 mm/huge_memory.c                  |  103 ++-
 mm/memory.c                       |   95 ++-
 mm/mempolicy.c                    |   24 +-
 mm/migrate.c                      |   21 +-
 mm/mm_init.c                      |   18 +-
 mm/mmzone.c                       |   14 +-
 mm/mprotect.c                     |   70 +-
 mm/page_alloc.c                   |    4 +-
 30 files changed, 2147 insertions(+), 604 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

So I have the below patch in front of all your patches. It contains the
10 or so sched,fair patches I posted to lkml the other day.

I used these to poke at the group_imb crud, am now digging through
traces of perf bench numa to see if there's anything else I need.

like said on IRC: I boot with ftrace=nop to ensure we allocate properly
sized trace buffers. This can also be done at runtime by switching
active tracer -- this allocates the default buffer size, or by
explicitly setting a per-cpu buffer size in
/debug/tracing/buffer_size_kb. By default the thing allocates a single
page per cpu or something uselessly small like that.

I then run a benchmark and at an appropriate time (eg. when I see
something 'weird' happen) I do something like:

  echo 0 > /debug/tracing/tracing_on  # disable writing into the buffers
  cat /debug/tracing/trace > ~/trace  # dump to file
  echo 0 > /debug/tracing/trace       # reset buffers
  echo 1 > /debug/tracing/tracing_on  # enable writing to the buffers

[ Note I mount debugfs at /debug, this is not the default location but I
  think the rest of the world is wrong ;-) ]

Also, the brain seems to adapt once you're staring at them for longer
than a day -- yay for human pattern recognition skillz.

Ingo tends to favour more verbose dumps, I tend to favour minimal
dumps.. whatever works for you is something you'll learn with
experience.
---
 arch/x86/mm/numa.c   |   6 +-
 kernel/sched/core.c  |  18 +-
 kernel/sched/fair.c  | 498 ++++++++++++++++++++++++++++-----------------------
 kernel/sched/sched.h |   1 +
 lib/vsprintf.c       |   5 +
 5 files changed, 288 insertions(+), 240 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..4ed4612 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -737,7 +737,6 @@ int early_cpu_to_node(int cpu)
 void debug_cpumask_set_cpu(int cpu, int node, bool enable)
 {
 	struct cpumask *mask;
-	char buf[64];
 
 	if (node == NUMA_NO_NODE) {
 		/* early_cpu_to_node() already emits a warning and trace */
@@ -755,10 +754,9 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
 	else
 		cpumask_clear_cpu(cpu, mask);
 
-	cpulist_scnprintf(buf, sizeof(buf), mask);
-	printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
+	printk(KERN_DEBUG "%s cpu %d node %d: mask now %pc\n",
 		enable ? "numa_add_cpu" : "numa_remove_cpu",
-		cpu, node, buf);
+		cpu, node, mask);
 	return;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 05c39f0..f307c2c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4809,9 +4809,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 				  struct cpumask *groupmask)
 {
 	struct sched_group *group = sd->groups;
-	char str[256];
 
-	cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd));
 	cpumask_clear(groupmask);
 
 	printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
@@ -4824,7 +4822,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		return -1;
 	}
 
-	printk(KERN_CONT "span %s level %s\n", str, sd->name);
+	printk(KERN_CONT "span %pc level %s\n", sched_domain_span(sd), sd->name);
 
 	if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
 		printk(KERN_ERR "ERROR: domain->span does not contain "
@@ -4870,9 +4868,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 		cpumask_or(groupmask, groupmask, sched_group_cpus(group));
 
-		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
-
-		printk(KERN_CONT " %s", str);
+		printk(KERN_CONT " %pc", sched_group_cpus(group));
 		if (group->sgp->power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
 				group->sgp->power);
@@ -4964,7 +4960,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
-				SD_SHARE_PKG_RESOURCES);
+				SD_SHARE_PKG_RESOURCES |
+				SD_PREFER_SIBLING);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -5168,6 +5165,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			tmp->parent = parent->parent;
 			if (parent->parent)
 				parent->parent->child = tmp;
+			/*
+			 * Transfer SD_PREFER_SIBLING down in case of a
+			 * degenerate parent; the spans match for this
+			 * so the property transfers.
+			 */
+			if (parent->flags & SD_PREFER_SIBLING)
+				tmp->flags |= SD_PREFER_SIBLING;
 			destroy_sched_domain(parent, cpu);
 		} else
 			tmp = tmp->parent;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68f1609..0c085ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3859,7 +3859,8 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_DST_PINNED  0x04
+#define LBF_SOME_PINNED	0x08
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3950,6 +3951,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 
+		env->flags |= LBF_SOME_PINNED;
+
 		/*
 		 * Remember if this task can be migrated to any other cpu in
 		 * our sched_group. We may want to revisit it if we couldn't
@@ -3958,13 +3961,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		 * Also avoid computing new_dst_cpu if we have already computed
 		 * one in current iteration.
 		 */
-		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's cpus */
 		for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
 			if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
-				env->flags |= LBF_SOME_PINNED;
+				env->flags |= LBF_DST_PINNED;
 				env->new_dst_cpu = cpu;
 				break;
 			}
@@ -4019,6 +4022,7 @@ static int move_one_task(struct lb_env *env)
 			continue;
 
 		move_task(p, env);
+
 		/*
 		 * Right now, this is only the second place move_task()
 		 * is called, so we can safely collect move_task()
@@ -4233,50 +4237,65 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /********** Helpers for find_busiest_group ************************/
 /*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
+	unsigned long load_per_task;
+	unsigned long group_power;
+	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int group_capacity;
+	unsigned int idle_cpus;
+	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		 during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *this;	/* Local group in this sd */
+	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_pwr;	/* Total power of all groups in sd */
+	unsigned long avg_load;	/* Average load across all groups in sd */
+
+	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+	struct sg_lb_stats this_stat;	/* Statistics of this group */
+};
+
+static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
+{
+	/*
+	 * struct sd_lb_stats {
+	 *	   struct sched_group *       busiest;             //     0  8
+	 *	   struct sched_group *       this;                //     8  8
+	 *	   long unsigned int          total_load;          //    16  8
+	 *	   long unsigned int          total_pwr;           //    24  8
+	 *	   long unsigned int          avg_load;            //    32  8
+	 *	   struct sg_lb_stats {
+	 *		   long unsigned int  avg_load;            //    40  8
+	 *		   long unsigned int  group_load;          //    48  8
+	 *	           ...
+	 *	   } busiest_stat;                                 //    40 64
+	 *	   struct sg_lb_stats	      this_stat;	   //   104 64
+	 *
+	 *	   // size: 168, cachelines: 3, members: 7
+	 *	   // last cacheline: 40 bytes
+	 * };
+	 *
+	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
+	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
+	 * We must however clear busiest_stat::avg_load because
+	 * update_sd_pick_busiest() reads this before assignment.
+	 */
+	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4460,60 +4479,66 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 	return 0;
 }
 
+/*
+ * Group imbalance indicates (and tries to solve) the problem where balancing
+ * groups is inadequate due to tsk_cpus_allowed() constraints.
+ *
+ * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
+ * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
+ * Something like:
+ *
+ * 	{ 0 1 2 3 } { 4 5 6 7 }
+ * 	        *     * * *
+ *
+ * If we were to balance group-wise we'd place two tasks in the first group and
+ * two tasks in the second group. Clearly this is undesired as it will overload
+ * cpu 3 and leave one of the cpus in the second group unused.
+ *
+ * The current solution to this issue is detecting the skew in the first group
+ * by noticing the lower domain failed to reach balance and had difficulty
+ * moving tasks due to affinity constraints.
+ *
+ * When this is so detected; this group becomes a candidate for busiest; see
+ * update_sd_pick_busiest(). And calculcate_imbalance() and
+ * find_busiest_group() avoid some of the usual balance conditions to allow it
+ * to create an effective group imbalance.
+ *
+ * This is a somewhat tricky proposition since the next run might not find the
+ * group imbalance and decide the groups need to be balanced again. A most
+ * subtle and fragile situation.
+ */
+
+static inline int sg_imbalanced(struct sched_group *group)
+{
+	return group->sgp->imbalance;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
  * @group: sched_group whose statistics are to be updated.
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
-			int local_group, int *balance, struct sg_lb_stats *sgs)
+			int local_group, struct sg_lb_stats *sgs)
 {
-	unsigned long nr_running, max_nr_running, min_nr_running;
-	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned int balance_cpu = -1, first_idle_cpu = 0;
-	unsigned long avg_load_per_task = 0;
+	unsigned long nr_running;
+	unsigned long load;
 	int i;
 
-	if (local_group)
-		balance_cpu = group_balance_cpu(group);
-
-	/* Tally up the load of all CPUs in the group */
-	max_cpu_load = 0;
-	min_cpu_load = ~0UL;
-	max_nr_running = 0;
-	min_nr_running = ~0UL;
-
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group) {
-			if (idle_cpu(i) && !first_idle_cpu &&
-					cpumask_test_cpu(i, sched_group_mask(group))) {
-				first_idle_cpu = 1;
-				balance_cpu = i;
-			}
-
+		if (local_group)
 			load = target_load(i, load_idx);
-		} else {
+		else
 			load = source_load(i, load_idx);
-			if (load > max_cpu_load)
-				max_cpu_load = load;
-			if (min_cpu_load > load)
-				min_cpu_load = load;
-
-			if (nr_running > max_nr_running)
-				max_nr_running = nr_running;
-			if (min_nr_running > nr_running)
-				min_nr_running = nr_running;
-		}
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
@@ -4522,46 +4547,25 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 	}
 
-	/*
-	 * First idle cpu or the first cpu(busiest) in this sched group
-	 * is eligible for doing load balancing at this and above
-	 * domains. In the newly idle case, we will allow all the cpu's
-	 * to do the newly idle load balance.
-	 */
-	if (local_group) {
-		if (env->idle != CPU_NEWLY_IDLE) {
-			if (balance_cpu != env->dst_cpu) {
-				*balance = 0;
-				return;
-			}
-			update_group_power(env->sd, env->dst_cpu);
-		} else if (time_after_eq(jiffies, group->sgp->next_update))
-			update_group_power(env->sd, env->dst_cpu);
-	}
+	if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+			time_after_eq(jiffies, group->sgp->next_update)))
+		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+	sgs->group_power = group->sgp->power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
-	/*
-	 * Consider the group unbalanced when the imbalance is larger
-	 * than the average weight of a task.
-	 *
-	 * APZ: with cgroup the avg task weight can vary wildly and
-	 *      might not be a suitable number - should we keep a
-	 *      normalized nr_running number somewhere that negates
-	 *      the hierarchy?
-	 */
 	if (sgs->sum_nr_running)
-		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
-	    (max_nr_running - min_nr_running) > 1)
-		sgs->group_imb = 1;
+	sgs->group_imb = sg_imbalanced(group);
+
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
-						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
+
 	sgs->group_weight = group->group_weight;
 
 	if (sgs->group_capacity > sgs->sum_nr_running)
@@ -4586,7 +4590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 				   struct sched_group *sg,
 				   struct sg_lb_stats *sgs)
 {
-	if (sgs->avg_load <= sds->max_load)
+	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
 	if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4619,11 +4623,11 @@ static bool update_sd_pick_busiest(struct lb_env *env,
  * @sds: variable to hold the statistics for this sched_domain.
  */
 static inline void update_sd_lb_stats(struct lb_env *env,
-					int *balance, struct sd_lb_stats *sds)
+					struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
-	struct sg_lb_stats sgs;
+	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
@@ -4632,17 +4636,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
+		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
-		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
-		if (local_group && !(*balance))
-			return;
+		if (local_group) {
+			sds->this = sg;
+			sgs = &sds->this_stat;
+		}
 
-		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->sgp->power;
+		memset(sgs, 0, sizeof(*sgs));
+		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4654,26 +4658,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
+		if (prefer_sibling && !local_group &&
+				sds->this && sds->this_stat.group_has_capacity)
+			sgs->group_capacity = min(sgs->group_capacity, 1U);
 
-		if (local_group) {
-			sds->this_load = sgs.avg_load;
-			sds->this = sg;
-			sds->this_nr_running = sgs.sum_nr_running;
-			sds->this_load_per_task = sgs.sum_weighted_load;
-			sds->this_has_capacity = sgs.group_has_capacity;
-			sds->this_idle_cpus = sgs.idle_cpus;
-		} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
-			sds->max_load = sgs.avg_load;
+		/* Now, start updating sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_pwr += sgs->group_power;
+
+		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
-			sds->busiest_nr_running = sgs.sum_nr_running;
-			sds->busiest_idle_cpus = sgs.idle_cpus;
-			sds->busiest_group_capacity = sgs.group_capacity;
-			sds->busiest_load_per_task = sgs.sum_weighted_load;
-			sds->busiest_has_capacity = sgs.group_has_capacity;
-			sds->busiest_group_weight = sgs.group_weight;
-			sds->group_imb = sgs.group_imb;
+			sds->busiest_stat = *sgs;
 		}
 
 		sg = sg->next;
@@ -4718,7 +4713,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
 		return 0;
 
 	env->imbalance = DIV_ROUND_CLOSEST(
-		sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+		sds->busiest_stat.avg_load * sds->busiest_stat.group_power,
+		SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4736,24 +4732,23 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	unsigned long tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
 	unsigned long scaled_busy_load_per_task;
+	struct sg_lb_stats *this, *busiest;
 
-	if (sds->this_nr_running) {
-		sds->this_load_per_task /= sds->this_nr_running;
-		if (sds->busiest_load_per_task >
-				sds->this_load_per_task)
-			imbn = 1;
-	} else {
-		sds->this_load_per_task =
-			cpu_avg_load_per_task(env->dst_cpu);
-	}
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
 
-	scaled_busy_load_per_task = sds->busiest_load_per_task
-					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->sgp->power;
+	if (!this->sum_nr_running)
+		this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
+	else if (busiest->load_per_task > this->load_per_task)
+		imbn = 1;
 
-	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
-			(scaled_busy_load_per_task * imbn)) {
-		env->imbalance = sds->busiest_load_per_task;
+	scaled_busy_load_per_task =
+		(busiest->load_per_task * SCHED_POWER_SCALE) /
+		busiest->group_power;
+
+	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+	    (scaled_busy_load_per_task * imbn)) {
+		env->imbalance = busiest->load_per_task;
 		return;
 	}
 
@@ -4763,34 +4758,37 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load);
+	pwr_now += busiest->group_power *
+			min(busiest->load_per_task, busiest->avg_load);
+	pwr_now += this->group_power *
+			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
-	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load - tmp);
+	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+		busiest->group_power;
+	if (busiest->avg_load > tmp) {
+		pwr_move += busiest->group_power *
+			    min(busiest->load_per_task,
+				busiest->avg_load - tmp);
+	}
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->sgp->power <
-		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->sgp->power) /
-			sds->this->sgp->power;
-	else
-		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->sgp->power;
-	pwr_move += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load + tmp);
+	if (busiest->avg_load * busiest->group_power <
+	    busiest->load_per_task * SCHED_POWER_SCALE) {
+		tmp = (busiest->avg_load * busiest->group_power) /
+		      this->group_power;
+	} else {
+		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+		      this->group_power;
+	}
+	pwr_move += this->group_power *
+		    min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
-		env->imbalance = sds->busiest_load_per_task;
+		env->imbalance = busiest->load_per_task;
 }
 
 /**
@@ -4802,11 +4800,18 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long max_pull, load_above_capacity = ~0UL;
+	struct sg_lb_stats *this, *busiest;
 
-	sds->busiest_load_per_task /= sds->busiest_nr_running;
-	if (sds->group_imb) {
-		sds->busiest_load_per_task =
-			min(sds->busiest_load_per_task, sds->avg_load);
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
+
+	if (busiest->group_imb) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure cpu-load equilibrium, look at wider averages. XXX
+		 */
+		busiest->load_per_task =
+			min(busiest->load_per_task, sds->avg_load);
 	}
 
 	/*
@@ -4814,21 +4819,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * max load less than avg load(as we skip the groups at or below
 	 * its cpu_power, while calculating max_load..)
 	 */
-	if (sds->max_load < sds->avg_load) {
+	if (busiest->avg_load < sds->avg_load) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
 
-	if (!sds->group_imb) {
+	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
+		 * Except of course for the group_imb case, since then we might
+		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
-		load_above_capacity = (sds->busiest_nr_running -
-						sds->busiest_group_capacity);
+		load_above_capacity =
+			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
-		load_above_capacity /= sds->busiest->sgp->power;
+		load_above_capacity /= busiest->group_power;
 	}
 
 	/*
@@ -4838,15 +4844,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * we also don't want to reduce the group load below the group capacity
 	 * (so that we can implement power-savings policies etc). Thus we look
 	 * for the minimum possible imbalance.
-	 * Be careful of negative numbers as they'll appear as very large values
-	 * with unsigned longs.
 	 */
-	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	env->imbalance = min(max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
-			/ SCHED_POWER_SCALE;
+	env->imbalance = min(
+		max_pull * busiest->group_power,
+		(sds->avg_load - this->avg_load) * this->group_power
+	) / SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -4854,9 +4859,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
-	if (env->imbalance < sds->busiest_load_per_task)
+	if (env->imbalance < busiest->load_per_task)
 		return fix_small_imbalance(env, sds);
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
@@ -4872,69 +4876,62 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * to restore balance.
  *
  * @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- *	is the appropriate cpu to perform load balancing at this_level.
  *
  * Return:	- The busiest group if imbalance exists.
  *		- If no imbalance and user has opted for power-savings balance,
  *		   return the least loaded group whose CPUs can be
  *		   put to idle by rebalancing its tasks onto our group.
  */
-static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+static struct sched_group *find_busiest_group(struct lb_env *env)
 {
+	struct sg_lb_stats *this, *busiest;
 	struct sd_lb_stats sds;
 
-	memset(&sds, 0, sizeof(sds));
+	init_sd_lb_stats(&sds);
 
 	/*
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
-	update_sd_lb_stats(env, balance, &sds);
-
-	/*
-	 * this_cpu is not the appropriate cpu to perform load balancing at
-	 * this level.
-	 */
-	if (!(*balance))
-		goto ret;
+	update_sd_lb_stats(env, &sds);
+	this = &sds.this_stat;
+	busiest = &sds.busiest_stat;
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds.busiest || busiest->sum_nr_running == 0)
 		goto out_balanced;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
-	 * work because they assumes all things are equal, which typically
+	 * work because they assume all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
-	if (sds.group_imb)
+	if (busiest->group_imb)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
-			!sds.busiest_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
+			!busiest->group_has_capacity)
 		goto force_balance;
 
 	/*
 	 * If the local group is more busy than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (sds.this_load >= sds.max_load)
+	if (this->avg_load >= busiest->avg_load)
 		goto out_balanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	if (sds.this_load >= sds.avg_load)
+	if (this->avg_load >= sds.avg_load)
 		goto out_balanced;
 
 	if (env->idle == CPU_IDLE) {
@@ -4944,15 +4941,16 @@ find_busiest_group(struct lb_env *env, int *balance)
 		 * there is no imbalance between this and busiest group
 		 * wrt to idle cpu's, it is balanced.
 		 */
-		if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
-		    sds.busiest_nr_running <= sds.busiest_group_weight)
+		if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 		 * imbalance_pct to be conservative.
 		 */
-		if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+		if (100 * busiest->avg_load <=
+				env->sd->imbalance_pct * this->avg_load)
 			goto out_balanced;
 	}
 
@@ -4962,7 +4960,6 @@ force_balance:
 	return sds.busiest;
 
 out_balanced:
-ret:
 	env->imbalance = 0;
 	return NULL;
 }
@@ -4974,10 +4971,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long max_load = 0;
+	unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 	int i;
 
-	for_each_cpu(i, sched_group_cpus(group)) {
+	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		unsigned long power = power_of(i);
 		unsigned long capacity = DIV_ROUND_CLOSEST(power,
 							   SCHED_POWER_SCALE);
@@ -4986,9 +4983,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		if (!cpumask_test_cpu(i, env->cpus))
-			continue;
-
 		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
@@ -5005,10 +4999,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_POWER_SCALE) / power;
-
-		if (wl > max_load) {
-			max_load = wl;
+		if (wl * busiest_power > busiest_load * power) {
+			busiest_load = wl;
+			busiest_power = power;
 			busiest = rq;
 		}
 	}
@@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
 
 static int active_load_balance_cpu_stop(void *data);
 
+static int should_we_balance(struct lb_env *env)
+{
+	struct sched_group *sg = env->sd->groups;
+	struct cpumask *sg_cpus, *sg_mask;
+	int cpu, balance_cpu = -1;
+
+	/*
+	 * In the newly idle case, we will allow all the cpu's
+	 * to do the newly idle load balance.
+	 */
+	if (env->idle == CPU_NEWLY_IDLE)
+		return 1;
+
+	sg_cpus = sched_group_cpus(sg);
+	sg_mask = sched_group_mask(sg);
+	/* Try to find first idle cpu */
+	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
+		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+			continue;
+
+		balance_cpu = cpu;
+		break;
+	}
+
+	if (balance_cpu == -1)
+		balance_cpu = group_balance_cpu(sg);
+
+	/*
+	 * First idle cpu or the first cpu(busiest) in this sched group
+	 * is eligible for doing load balancing at this and above domains.
+	 */
+	return balance_cpu != env->dst_cpu;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *balance)
+			int *should_balance)
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
+	struct sched_domain *sd_parent = sd->parent;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -5080,12 +5108,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_inc(sd, lb_count[idle]);
 
-redo:
-	group = find_busiest_group(&env, balance);
-
-	if (*balance == 0)
+	if (!(*should_balance = should_we_balance(&env)))
 		goto out_balanced;
 
+redo:
+	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -5158,11 +5185,11 @@ more_balance:
 		 * moreover subsequent load balance cycles should correct the
 		 * excess load moved.
 		 */
-		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+		if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {
 
 			env.dst_rq	 = cpu_rq(env.new_dst_cpu);
 			env.dst_cpu	 = env.new_dst_cpu;
-			env.flags	&= ~LBF_SOME_PINNED;
+			env.flags	&= ~LBF_DST_PINNED;
 			env.loop	 = 0;
 			env.loop_break	 = sched_nr_migrate_break;
 
@@ -5176,6 +5203,18 @@ more_balance:
 			goto more_balance;
 		}
 
+		/*
+		 * We failed to reach balance because of affinity.
+		 */
+		if (sd_parent) {
+			int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+				*group_imbalance = 1;
+			} else if (*group_imbalance)
+				*group_imbalance = 0;
+		}
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
@@ -5298,7 +5337,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
-		int balance = 1;
+		int should_balance;
 
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5306,7 +5345,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
-						   sd, CPU_NEWLY_IDLE, &balance);
+						   sd, CPU_NEWLY_IDLE,
+						   &should_balance);
 		}
 
 		interval = msecs_to_jiffies(sd->balance_interval);
@@ -5544,7 +5584,7 @@ void update_max_interval(void)
  */
 static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 {
-	int balance = 1;
+	int should_balance = 1;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
@@ -5576,9 +5616,9 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(cpu, rq, sd, idle, &balance)) {
+			if (load_balance(cpu, rq, sd, idle, &should_balance)) {
 				/*
-				 * The LBF_SOME_PINNED logic could have changed
+				 * The LBF_DST_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
 				 * state even if we migrated tasks. Update it.
 				 */
@@ -5599,7 +5639,7 @@ out:
 		 * CPU in our sched group which is doing load balancing more
 		 * actively.
 		 */
-		if (!balance)
+		if (!should_balance)
 			break;
 	}
 	rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..7c17661 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -605,6 +605,7 @@ struct sched_group_power {
 	 */
 	unsigned int power, power_orig;
 	unsigned long next_update;
+	int imbalance; /* XXX unrelated to power but shared group state */
 	/*
 	 * Number of busy cpus in this group.
 	 */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 739a363..5521015 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -26,6 +26,7 @@
 #include <linux/math64.h>
 #include <linux/uaccess.h>
 #include <linux/ioport.h>
+#include <linux/cpumask.h>
 #include <net/addrconf.h>
 
 #include <asm/page.h>		/* for PAGE_SIZE */
@@ -1142,6 +1143,7 @@ int kptr_restrict __read_mostly;
  *            The maximum supported length is 64 bytes of the input. Consider
  *            to use print_hex_dump() for the larger input.
  * - 'a' For a phys_addr_t type and its derivative types (passed by reference)
+ * - 'c' For a cpumask list
  *
  * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
  * function pointers are really function descriptors, which contain a
@@ -1253,6 +1255,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
 		spec.base = 16;
 		return number(buf, end,
 			      (unsigned long long) *((phys_addr_t *)ptr), spec);
+	case 'c':
+		return buf + cpulist_scnprintf(buf, end - buf, ptr);
 	}
 	spec.flags |= SMALL;
 	if (spec.field_width == -1) {
@@ -1494,6 +1498,7 @@ qualifier:
  *   case.
  * %*ph[CDN] a variable-length hex string with a separator (supports up to 64
  *           bytes of the input)
+ * %pc print a cpumask as comma-separated list
  * %n is ignored
  *
  * ** Please update Documentation/printk-formats.txt when making changes **
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

So I have the below patch in front of all your patches. It contains the
10 or so sched,fair patches I posted to lkml the other day.

I used these to poke at the group_imb crud, am now digging through
traces of perf bench numa to see if there's anything else I need.

like said on IRC: I boot with ftrace=nop to ensure we allocate properly
sized trace buffers. This can also be done at runtime by switching
active tracer -- this allocates the default buffer size, or by
explicitly setting a per-cpu buffer size in
/debug/tracing/buffer_size_kb. By default the thing allocates a single
page per cpu or something uselessly small like that.

I then run a benchmark and at an appropriate time (eg. when I see
something 'weird' happen) I do something like:

  echo 0 > /debug/tracing/tracing_on  # disable writing into the buffers
  cat /debug/tracing/trace > ~/trace  # dump to file
  echo 0 > /debug/tracing/trace       # reset buffers
  echo 1 > /debug/tracing/tracing_on  # enable writing to the buffers

[ Note I mount debugfs at /debug, this is not the default location but I
  think the rest of the world is wrong ;-) ]

Also, the brain seems to adapt once you're staring at them for longer
than a day -- yay for human pattern recognition skillz.

Ingo tends to favour more verbose dumps, I tend to favour minimal
dumps.. whatever works for you is something you'll learn with
experience.
---
 arch/x86/mm/numa.c   |   6 +-
 kernel/sched/core.c  |  18 +-
 kernel/sched/fair.c  | 498 ++++++++++++++++++++++++++++-----------------------
 kernel/sched/sched.h |   1 +
 lib/vsprintf.c       |   5 +
 5 files changed, 288 insertions(+), 240 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..4ed4612 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -737,7 +737,6 @@ int early_cpu_to_node(int cpu)
 void debug_cpumask_set_cpu(int cpu, int node, bool enable)
 {
 	struct cpumask *mask;
-	char buf[64];
 
 	if (node == NUMA_NO_NODE) {
 		/* early_cpu_to_node() already emits a warning and trace */
@@ -755,10 +754,9 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
 	else
 		cpumask_clear_cpu(cpu, mask);
 
-	cpulist_scnprintf(buf, sizeof(buf), mask);
-	printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
+	printk(KERN_DEBUG "%s cpu %d node %d: mask now %pc\n",
 		enable ? "numa_add_cpu" : "numa_remove_cpu",
-		cpu, node, buf);
+		cpu, node, mask);
 	return;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 05c39f0..f307c2c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4809,9 +4809,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 				  struct cpumask *groupmask)
 {
 	struct sched_group *group = sd->groups;
-	char str[256];
 
-	cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd));
 	cpumask_clear(groupmask);
 
 	printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
@@ -4824,7 +4822,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		return -1;
 	}
 
-	printk(KERN_CONT "span %s level %s\n", str, sd->name);
+	printk(KERN_CONT "span %pc level %s\n", sched_domain_span(sd), sd->name);
 
 	if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
 		printk(KERN_ERR "ERROR: domain->span does not contain "
@@ -4870,9 +4868,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 
 		cpumask_or(groupmask, groupmask, sched_group_cpus(group));
 
-		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
-
-		printk(KERN_CONT " %s", str);
+		printk(KERN_CONT " %pc", sched_group_cpus(group));
 		if (group->sgp->power != SCHED_POWER_SCALE) {
 			printk(KERN_CONT " (cpu_power = %d)",
 				group->sgp->power);
@@ -4964,7 +4960,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
-				SD_SHARE_PKG_RESOURCES);
+				SD_SHARE_PKG_RESOURCES |
+				SD_PREFER_SIBLING);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -5168,6 +5165,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			tmp->parent = parent->parent;
 			if (parent->parent)
 				parent->parent->child = tmp;
+			/*
+			 * Transfer SD_PREFER_SIBLING down in case of a
+			 * degenerate parent; the spans match for this
+			 * so the property transfers.
+			 */
+			if (parent->flags & SD_PREFER_SIBLING)
+				tmp->flags |= SD_PREFER_SIBLING;
 			destroy_sched_domain(parent, cpu);
 		} else
 			tmp = tmp->parent;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68f1609..0c085ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3859,7 +3859,8 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_DST_PINNED  0x04
+#define LBF_SOME_PINNED	0x08
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3950,6 +3951,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 
+		env->flags |= LBF_SOME_PINNED;
+
 		/*
 		 * Remember if this task can be migrated to any other cpu in
 		 * our sched_group. We may want to revisit it if we couldn't
@@ -3958,13 +3961,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		 * Also avoid computing new_dst_cpu if we have already computed
 		 * one in current iteration.
 		 */
-		if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's cpus */
 		for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
 			if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
-				env->flags |= LBF_SOME_PINNED;
+				env->flags |= LBF_DST_PINNED;
 				env->new_dst_cpu = cpu;
 				break;
 			}
@@ -4019,6 +4022,7 @@ static int move_one_task(struct lb_env *env)
 			continue;
 
 		move_task(p, env);
+
 		/*
 		 * Right now, this is only the second place move_task()
 		 * is called, so we can safely collect move_task()
@@ -4233,50 +4237,65 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /********** Helpers for find_busiest_group ************************/
 /*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
+	unsigned long load_per_task;
+	unsigned long group_power;
+	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int group_capacity;
+	unsigned int idle_cpus;
+	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		 during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *this;	/* Local group in this sd */
+	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_pwr;	/* Total power of all groups in sd */
+	unsigned long avg_load;	/* Average load across all groups in sd */
+
+	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+	struct sg_lb_stats this_stat;	/* Statistics of this group */
+};
+
+static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
+{
+	/*
+	 * struct sd_lb_stats {
+	 *	   struct sched_group *       busiest;             //     0  8
+	 *	   struct sched_group *       this;                //     8  8
+	 *	   long unsigned int          total_load;          //    16  8
+	 *	   long unsigned int          total_pwr;           //    24  8
+	 *	   long unsigned int          avg_load;            //    32  8
+	 *	   struct sg_lb_stats {
+	 *		   long unsigned int  avg_load;            //    40  8
+	 *		   long unsigned int  group_load;          //    48  8
+	 *	           ...
+	 *	   } busiest_stat;                                 //    40 64
+	 *	   struct sg_lb_stats	      this_stat;	   //   104 64
+	 *
+	 *	   // size: 168, cachelines: 3, members: 7
+	 *	   // last cacheline: 40 bytes
+	 * };
+	 *
+	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
+	 * this_stat because update_sg_lb_stats() does a full clear/assignment.
+	 * We must however clear busiest_stat::avg_load because
+	 * update_sd_pick_busiest() reads this before assignment.
+	 */
+	memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4460,60 +4479,66 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 	return 0;
 }
 
+/*
+ * Group imbalance indicates (and tries to solve) the problem where balancing
+ * groups is inadequate due to tsk_cpus_allowed() constraints.
+ *
+ * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
+ * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
+ * Something like:
+ *
+ * 	{ 0 1 2 3 } { 4 5 6 7 }
+ * 	        *     * * *
+ *
+ * If we were to balance group-wise we'd place two tasks in the first group and
+ * two tasks in the second group. Clearly this is undesired as it will overload
+ * cpu 3 and leave one of the cpus in the second group unused.
+ *
+ * The current solution to this issue is detecting the skew in the first group
+ * by noticing the lower domain failed to reach balance and had difficulty
+ * moving tasks due to affinity constraints.
+ *
+ * When this is so detected; this group becomes a candidate for busiest; see
+ * update_sd_pick_busiest(). And calculcate_imbalance() and
+ * find_busiest_group() avoid some of the usual balance conditions to allow it
+ * to create an effective group imbalance.
+ *
+ * This is a somewhat tricky proposition since the next run might not find the
+ * group imbalance and decide the groups need to be balanced again. A most
+ * subtle and fragile situation.
+ */
+
+static inline int sg_imbalanced(struct sched_group *group)
+{
+	return group->sgp->imbalance;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
  * @group: sched_group whose statistics are to be updated.
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
-			int local_group, int *balance, struct sg_lb_stats *sgs)
+			int local_group, struct sg_lb_stats *sgs)
 {
-	unsigned long nr_running, max_nr_running, min_nr_running;
-	unsigned long load, max_cpu_load, min_cpu_load;
-	unsigned int balance_cpu = -1, first_idle_cpu = 0;
-	unsigned long avg_load_per_task = 0;
+	unsigned long nr_running;
+	unsigned long load;
 	int i;
 
-	if (local_group)
-		balance_cpu = group_balance_cpu(group);
-
-	/* Tally up the load of all CPUs in the group */
-	max_cpu_load = 0;
-	min_cpu_load = ~0UL;
-	max_nr_running = 0;
-	min_nr_running = ~0UL;
-
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
 		nr_running = rq->nr_running;
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group) {
-			if (idle_cpu(i) && !first_idle_cpu &&
-					cpumask_test_cpu(i, sched_group_mask(group))) {
-				first_idle_cpu = 1;
-				balance_cpu = i;
-			}
-
+		if (local_group)
 			load = target_load(i, load_idx);
-		} else {
+		else
 			load = source_load(i, load_idx);
-			if (load > max_cpu_load)
-				max_cpu_load = load;
-			if (min_cpu_load > load)
-				min_cpu_load = load;
-
-			if (nr_running > max_nr_running)
-				max_nr_running = nr_running;
-			if (min_nr_running > nr_running)
-				min_nr_running = nr_running;
-		}
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
@@ -4522,46 +4547,25 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 	}
 
-	/*
-	 * First idle cpu or the first cpu(busiest) in this sched group
-	 * is eligible for doing load balancing at this and above
-	 * domains. In the newly idle case, we will allow all the cpu's
-	 * to do the newly idle load balance.
-	 */
-	if (local_group) {
-		if (env->idle != CPU_NEWLY_IDLE) {
-			if (balance_cpu != env->dst_cpu) {
-				*balance = 0;
-				return;
-			}
-			update_group_power(env->sd, env->dst_cpu);
-		} else if (time_after_eq(jiffies, group->sgp->next_update))
-			update_group_power(env->sd, env->dst_cpu);
-	}
+	if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+			time_after_eq(jiffies, group->sgp->next_update)))
+		update_group_power(env->sd, env->dst_cpu);
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+	sgs->group_power = group->sgp->power;
+	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;
 
-	/*
-	 * Consider the group unbalanced when the imbalance is larger
-	 * than the average weight of a task.
-	 *
-	 * APZ: with cgroup the avg task weight can vary wildly and
-	 *      might not be a suitable number - should we keep a
-	 *      normalized nr_running number somewhere that negates
-	 *      the hierarchy?
-	 */
 	if (sgs->sum_nr_running)
-		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
-	    (max_nr_running - min_nr_running) > 1)
-		sgs->group_imb = 1;
+	sgs->group_imb = sg_imbalanced(group);
+
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);
 
-	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
-						SCHED_POWER_SCALE);
 	if (!sgs->group_capacity)
 		sgs->group_capacity = fix_small_capacity(env->sd, group);
+
 	sgs->group_weight = group->group_weight;
 
 	if (sgs->group_capacity > sgs->sum_nr_running)
@@ -4586,7 +4590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 				   struct sched_group *sg,
 				   struct sg_lb_stats *sgs)
 {
-	if (sgs->avg_load <= sds->max_load)
+	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
 	if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4619,11 +4623,11 @@ static bool update_sd_pick_busiest(struct lb_env *env,
  * @sds: variable to hold the statistics for this sched_domain.
  */
 static inline void update_sd_lb_stats(struct lb_env *env,
-					int *balance, struct sd_lb_stats *sds)
+					struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
-	struct sg_lb_stats sgs;
+	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
@@ -4632,17 +4636,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
+		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
 
 		local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
-		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
-		if (local_group && !(*balance))
-			return;
+		if (local_group) {
+			sds->this = sg;
+			sgs = &sds->this_stat;
+		}
 
-		sds->total_load += sgs.group_load;
-		sds->total_pwr += sg->sgp->power;
+		memset(sgs, 0, sizeof(*sgs));
+		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4654,26 +4658,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
+		if (prefer_sibling && !local_group &&
+				sds->this && sds->this_stat.group_has_capacity)
+			sgs->group_capacity = min(sgs->group_capacity, 1U);
 
-		if (local_group) {
-			sds->this_load = sgs.avg_load;
-			sds->this = sg;
-			sds->this_nr_running = sgs.sum_nr_running;
-			sds->this_load_per_task = sgs.sum_weighted_load;
-			sds->this_has_capacity = sgs.group_has_capacity;
-			sds->this_idle_cpus = sgs.idle_cpus;
-		} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
-			sds->max_load = sgs.avg_load;
+		/* Now, start updating sd_lb_stats */
+		sds->total_load += sgs->group_load;
+		sds->total_pwr += sgs->group_power;
+
+		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
-			sds->busiest_nr_running = sgs.sum_nr_running;
-			sds->busiest_idle_cpus = sgs.idle_cpus;
-			sds->busiest_group_capacity = sgs.group_capacity;
-			sds->busiest_load_per_task = sgs.sum_weighted_load;
-			sds->busiest_has_capacity = sgs.group_has_capacity;
-			sds->busiest_group_weight = sgs.group_weight;
-			sds->group_imb = sgs.group_imb;
+			sds->busiest_stat = *sgs;
 		}
 
 		sg = sg->next;
@@ -4718,7 +4713,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
 		return 0;
 
 	env->imbalance = DIV_ROUND_CLOSEST(
-		sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+		sds->busiest_stat.avg_load * sds->busiest_stat.group_power,
+		SCHED_POWER_SCALE);
 
 	return 1;
 }
@@ -4736,24 +4732,23 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	unsigned long tmp, pwr_now = 0, pwr_move = 0;
 	unsigned int imbn = 2;
 	unsigned long scaled_busy_load_per_task;
+	struct sg_lb_stats *this, *busiest;
 
-	if (sds->this_nr_running) {
-		sds->this_load_per_task /= sds->this_nr_running;
-		if (sds->busiest_load_per_task >
-				sds->this_load_per_task)
-			imbn = 1;
-	} else {
-		sds->this_load_per_task =
-			cpu_avg_load_per_task(env->dst_cpu);
-	}
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
 
-	scaled_busy_load_per_task = sds->busiest_load_per_task
-					 * SCHED_POWER_SCALE;
-	scaled_busy_load_per_task /= sds->busiest->sgp->power;
+	if (!this->sum_nr_running)
+		this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
+	else if (busiest->load_per_task > this->load_per_task)
+		imbn = 1;
 
-	if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
-			(scaled_busy_load_per_task * imbn)) {
-		env->imbalance = sds->busiest_load_per_task;
+	scaled_busy_load_per_task =
+		(busiest->load_per_task * SCHED_POWER_SCALE) /
+		busiest->group_power;
+
+	if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+	    (scaled_busy_load_per_task * imbn)) {
+		env->imbalance = busiest->load_per_task;
 		return;
 	}
 
@@ -4763,34 +4758,37 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load);
+	pwr_now += busiest->group_power *
+			min(busiest->load_per_task, busiest->avg_load);
+	pwr_now += this->group_power *
+			min(this->load_per_task, this->avg_load);
 	pwr_now /= SCHED_POWER_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-		sds->busiest->sgp->power;
-	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->sgp->power *
-			min(sds->busiest_load_per_task, sds->max_load - tmp);
+	tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+		busiest->group_power;
+	if (busiest->avg_load > tmp) {
+		pwr_move += busiest->group_power *
+			    min(busiest->load_per_task,
+				busiest->avg_load - tmp);
+	}
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->sgp->power <
-		sds->busiest_load_per_task * SCHED_POWER_SCALE)
-		tmp = (sds->max_load * sds->busiest->sgp->power) /
-			sds->this->sgp->power;
-	else
-		tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
-			sds->this->sgp->power;
-	pwr_move += sds->this->sgp->power *
-			min(sds->this_load_per_task, sds->this_load + tmp);
+	if (busiest->avg_load * busiest->group_power <
+	    busiest->load_per_task * SCHED_POWER_SCALE) {
+		tmp = (busiest->avg_load * busiest->group_power) /
+		      this->group_power;
+	} else {
+		tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+		      this->group_power;
+	}
+	pwr_move += this->group_power *
+		    min(this->load_per_task, this->avg_load + tmp);
 	pwr_move /= SCHED_POWER_SCALE;
 
 	/* Move if we gain throughput */
 	if (pwr_move > pwr_now)
-		env->imbalance = sds->busiest_load_per_task;
+		env->imbalance = busiest->load_per_task;
 }
 
 /**
@@ -4802,11 +4800,18 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	unsigned long max_pull, load_above_capacity = ~0UL;
+	struct sg_lb_stats *this, *busiest;
 
-	sds->busiest_load_per_task /= sds->busiest_nr_running;
-	if (sds->group_imb) {
-		sds->busiest_load_per_task =
-			min(sds->busiest_load_per_task, sds->avg_load);
+	this = &sds->this_stat;
+	busiest = &sds->busiest_stat;
+
+	if (busiest->group_imb) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure cpu-load equilibrium, look at wider averages. XXX
+		 */
+		busiest->load_per_task =
+			min(busiest->load_per_task, sds->avg_load);
 	}
 
 	/*
@@ -4814,21 +4819,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * max load less than avg load(as we skip the groups at or below
 	 * its cpu_power, while calculating max_load..)
 	 */
-	if (sds->max_load < sds->avg_load) {
+	if (busiest->avg_load < sds->avg_load) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
 
-	if (!sds->group_imb) {
+	if (!busiest->group_imb) {
 		/*
 		 * Don't want to pull so many tasks that a group would go idle.
+		 * Except of course for the group_imb case, since then we might
+		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
-		load_above_capacity = (sds->busiest_nr_running -
-						sds->busiest_group_capacity);
+		load_above_capacity =
+			(busiest->sum_nr_running - busiest->group_capacity);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
-		load_above_capacity /= sds->busiest->sgp->power;
+		load_above_capacity /= busiest->group_power;
 	}
 
 	/*
@@ -4838,15 +4844,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * we also don't want to reduce the group load below the group capacity
 	 * (so that we can implement power-savings policies etc). Thus we look
 	 * for the minimum possible imbalance.
-	 * Be careful of negative numbers as they'll appear as very large values
-	 * with unsigned longs.
 	 */
-	max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
 
 	/* How much load to actually move to equalise the imbalance */
-	env->imbalance = min(max_pull * sds->busiest->sgp->power,
-		(sds->avg_load - sds->this_load) * sds->this->sgp->power)
-			/ SCHED_POWER_SCALE;
+	env->imbalance = min(
+		max_pull * busiest->group_power,
+		(sds->avg_load - this->avg_load) * this->group_power
+	) / SCHED_POWER_SCALE;
 
 	/*
 	 * if *imbalance is less than the average load per runnable task
@@ -4854,9 +4859,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
-	if (env->imbalance < sds->busiest_load_per_task)
+	if (env->imbalance < busiest->load_per_task)
 		return fix_small_imbalance(env, sds);
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
@@ -4872,69 +4876,62 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
  * to restore balance.
  *
  * @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- *	is the appropriate cpu to perform load balancing at this_level.
  *
  * Return:	- The busiest group if imbalance exists.
  *		- If no imbalance and user has opted for power-savings balance,
  *		   return the least loaded group whose CPUs can be
  *		   put to idle by rebalancing its tasks onto our group.
  */
-static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+static struct sched_group *find_busiest_group(struct lb_env *env)
 {
+	struct sg_lb_stats *this, *busiest;
 	struct sd_lb_stats sds;
 
-	memset(&sds, 0, sizeof(sds));
+	init_sd_lb_stats(&sds);
 
 	/*
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
 	 */
-	update_sd_lb_stats(env, balance, &sds);
-
-	/*
-	 * this_cpu is not the appropriate cpu to perform load balancing at
-	 * this level.
-	 */
-	if (!(*balance))
-		goto ret;
+	update_sd_lb_stats(env, &sds);
+	this = &sds.this_stat;
+	busiest = &sds.busiest_stat;
 
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || sds.busiest_nr_running == 0)
+	if (!sds.busiest || busiest->sum_nr_running == 0)
 		goto out_balanced;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
-	 * work because they assumes all things are equal, which typically
+	 * work because they assume all things are equal, which typically
 	 * isn't true due to cpus_allowed constraints and the like.
 	 */
-	if (sds.group_imb)
+	if (busiest->group_imb)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
-			!sds.busiest_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
+			!busiest->group_has_capacity)
 		goto force_balance;
 
 	/*
 	 * If the local group is more busy than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (sds.this_load >= sds.max_load)
+	if (this->avg_load >= busiest->avg_load)
 		goto out_balanced;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
-	if (sds.this_load >= sds.avg_load)
+	if (this->avg_load >= sds.avg_load)
 		goto out_balanced;
 
 	if (env->idle == CPU_IDLE) {
@@ -4944,15 +4941,16 @@ find_busiest_group(struct lb_env *env, int *balance)
 		 * there is no imbalance between this and busiest group
 		 * wrt to idle cpu's, it is balanced.
 		 */
-		if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
-		    sds.busiest_nr_running <= sds.busiest_group_weight)
+		if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
 	} else {
 		/*
 		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
 		 * imbalance_pct to be conservative.
 		 */
-		if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+		if (100 * busiest->avg_load <=
+				env->sd->imbalance_pct * this->avg_load)
 			goto out_balanced;
 	}
 
@@ -4962,7 +4960,6 @@ force_balance:
 	return sds.busiest;
 
 out_balanced:
-ret:
 	env->imbalance = 0;
 	return NULL;
 }
@@ -4974,10 +4971,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long max_load = 0;
+	unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
 	int i;
 
-	for_each_cpu(i, sched_group_cpus(group)) {
+	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		unsigned long power = power_of(i);
 		unsigned long capacity = DIV_ROUND_CLOSEST(power,
 							   SCHED_POWER_SCALE);
@@ -4986,9 +4983,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		if (!cpumask_test_cpu(i, env->cpus))
-			continue;
-
 		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
@@ -5005,10 +4999,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * the load can be moved away from the cpu that is potentially
 		 * running at a lower capacity.
 		 */
-		wl = (wl * SCHED_POWER_SCALE) / power;
-
-		if (wl > max_load) {
-			max_load = wl;
+		if (wl * busiest_power > busiest_load * power) {
+			busiest_load = wl;
+			busiest_power = power;
 			busiest = rq;
 		}
 	}
@@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
 
 static int active_load_balance_cpu_stop(void *data);
 
+static int should_we_balance(struct lb_env *env)
+{
+	struct sched_group *sg = env->sd->groups;
+	struct cpumask *sg_cpus, *sg_mask;
+	int cpu, balance_cpu = -1;
+
+	/*
+	 * In the newly idle case, we will allow all the cpu's
+	 * to do the newly idle load balance.
+	 */
+	if (env->idle == CPU_NEWLY_IDLE)
+		return 1;
+
+	sg_cpus = sched_group_cpus(sg);
+	sg_mask = sched_group_mask(sg);
+	/* Try to find first idle cpu */
+	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
+		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+			continue;
+
+		balance_cpu = cpu;
+		break;
+	}
+
+	if (balance_cpu == -1)
+		balance_cpu = group_balance_cpu(sg);
+
+	/*
+	 * First idle cpu or the first cpu(busiest) in this sched group
+	 * is eligible for doing load balancing at this and above domains.
+	 */
+	return balance_cpu != env->dst_cpu;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
 static int load_balance(int this_cpu, struct rq *this_rq,
 			struct sched_domain *sd, enum cpu_idle_type idle,
-			int *balance)
+			int *should_balance)
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
+	struct sched_domain *sd_parent = sd->parent;
 	struct sched_group *group;
 	struct rq *busiest;
 	unsigned long flags;
@@ -5080,12 +5108,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_inc(sd, lb_count[idle]);
 
-redo:
-	group = find_busiest_group(&env, balance);
-
-	if (*balance == 0)
+	if (!(*should_balance = should_we_balance(&env)))
 		goto out_balanced;
 
+redo:
+	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -5158,11 +5185,11 @@ more_balance:
 		 * moreover subsequent load balance cycles should correct the
 		 * excess load moved.
 		 */
-		if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+		if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {
 
 			env.dst_rq	 = cpu_rq(env.new_dst_cpu);
 			env.dst_cpu	 = env.new_dst_cpu;
-			env.flags	&= ~LBF_SOME_PINNED;
+			env.flags	&= ~LBF_DST_PINNED;
 			env.loop	 = 0;
 			env.loop_break	 = sched_nr_migrate_break;
 
@@ -5176,6 +5203,18 @@ more_balance:
 			goto more_balance;
 		}
 
+		/*
+		 * We failed to reach balance because of affinity.
+		 */
+		if (sd_parent) {
+			int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+				*group_imbalance = 1;
+			} else if (*group_imbalance)
+				*group_imbalance = 0;
+		}
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
@@ -5298,7 +5337,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
-		int balance = 1;
+		int should_balance;
 
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5306,7 +5345,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			/* If we've pulled tasks over stop searching: */
 			pulled_task = load_balance(this_cpu, this_rq,
-						   sd, CPU_NEWLY_IDLE, &balance);
+						   sd, CPU_NEWLY_IDLE,
+						   &should_balance);
 		}
 
 		interval = msecs_to_jiffies(sd->balance_interval);
@@ -5544,7 +5584,7 @@ void update_max_interval(void)
  */
 static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 {
-	int balance = 1;
+	int should_balance = 1;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
@@ -5576,9 +5616,9 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(cpu, rq, sd, idle, &balance)) {
+			if (load_balance(cpu, rq, sd, idle, &should_balance)) {
 				/*
-				 * The LBF_SOME_PINNED logic could have changed
+				 * The LBF_DST_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
 				 * state even if we migrated tasks. Update it.
 				 */
@@ -5599,7 +5639,7 @@ out:
 		 * CPU in our sched group which is doing load balancing more
 		 * actively.
 		 */
-		if (!balance)
+		if (!should_balance)
 			break;
 	}
 	rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..7c17661 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -605,6 +605,7 @@ struct sched_group_power {
 	 */
 	unsigned int power, power_orig;
 	unsigned long next_update;
+	int imbalance; /* XXX unrelated to power but shared group state */
 	/*
 	 * Number of busy cpus in this group.
 	 */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 739a363..5521015 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -26,6 +26,7 @@
 #include <linux/math64.h>
 #include <linux/uaccess.h>
 #include <linux/ioport.h>
+#include <linux/cpumask.h>
 #include <net/addrconf.h>
 
 #include <asm/page.h>		/* for PAGE_SIZE */
@@ -1142,6 +1143,7 @@ int kptr_restrict __read_mostly;
  *            The maximum supported length is 64 bytes of the input. Consider
  *            to use print_hex_dump() for the larger input.
  * - 'a' For a phys_addr_t type and its derivative types (passed by reference)
+ * - 'c' For a cpumask list
  *
  * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
  * function pointers are really function descriptors, which contain a
@@ -1253,6 +1255,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
 		spec.base = 16;
 		return number(buf, end,
 			      (unsigned long long) *((phys_addr_t *)ptr), spec);
+	case 'c':
+		return buf + cpulist_scnprintf(buf, end - buf, ptr);
 	}
 	spec.flags |= SMALL;
 	if (spec.field_width == -1) {
@@ -1494,6 +1498,7 @@ qualifier:
  *   case.
  * %*ph[CDN] a variable-length hex string with a separator (supports up to 64
  *           bytes of the input)
+ * %pc print a cpumask as comma-separated list
  * %n is ignored
  *
  * ** Please update Documentation/printk-formats.txt when making changes **
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 02/50] mm: numa: Document automatic NUMA balancing sysctls
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ab7d16e..ccadb52 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 02/50] mm: numa: Document automatic NUMA balancing sysctls
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ab7d16e..ccadb52 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
 
 ==============================================================
 
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running.  Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases.  The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases.  The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
 osrelease, ostype & version:
 
 # cat osrelease
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 03/50] sched, numa: Comment fixlets
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Fix a 80 column violation and a PTE vs PMD reference.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++----
 mm/huge_memory.c    | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c085ac..227b070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..860a368 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1317,7 +1317,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 03/50] sched, numa: Comment fixlets
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Fix a 80 column violation and a PTE vs PMD reference.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 ++++----
 mm/huge_memory.c    | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c085ac..227b070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..860a368 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1317,7 +1317,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
-	/* Confirm the PTE did not while locked */
+	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		unlock_page(page);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 04/50] mm: numa: Do not account for a hinting fault if we raced
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

If another task handled a hinting fault in parallel then do not double
account for it.

Not-signed-off-by: Peter Zijlstra
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 860a368..5c37cd2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1337,8 +1337,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 04/50] mm: numa: Do not account for a hinting fault if we raced
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

If another task handled a hinting fault in parallel then do not double
account for it.

Not-signed-off-by: Peter Zijlstra
---
 mm/huge_memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 860a368..5c37cd2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1337,8 +1337,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 check_same:
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	if (unlikely(!pmd_same(pmd, *pmdp))) {
+		/* Someone else took our fault */
+		current_nid = -1;
 		goto out_unlock;
+	}
 clear_pmdnuma:
 	pmd = pmd_mknonnuma(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 05/50] mm: Wait for THP migrations to complete during NUMA hinting faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Not-signed-off-by: Peter Zijlstra
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c37cd2..d0a3fce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1307,13 +1307,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1324,9 +1325,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 05/50] mm: Wait for THP migrations to complete during NUMA hinting faults
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Not-signed-off-by: Peter Zijlstra
---
 mm/huge_memory.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c37cd2..d0a3fce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1307,13 +1307,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (current_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		put_page(page);
-		goto clear_pmdnuma;
-	}
+	/*
+	 * Acquire the page lock to serialise THP migrations but avoid dropping
+	 * page_table_lock if at all possible
+	 */
+	if (trylock_page(page))
+		goto got_lock;
 
-	/* Acquire the page lock to serialise THP migrations */
+	/* Serialise against migrationa and check placement check placement */
 	spin_unlock(&mm->page_table_lock);
 	lock_page(page);
 
@@ -1324,9 +1325,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		goto out_unlock;
 	}
-	spin_unlock(&mm->page_table_lock);
+
+got_lock:
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		unlock_page(page);
+		put_page(page);
+		goto clear_pmdnuma;
+	}
 
 	/* Migrate the THP to the requested node */
+	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
 	if (!migrated)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 06/50] mm: Prevent parallel splits during THP migration
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d0a3fce..981d8a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1290,18 +1290,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1311,12 +1311,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations to complete */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto check_same;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
@@ -1326,14 +1343,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1342,6 +1351,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1358,6 +1369,10 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 06/50] mm: Prevent parallel splits during THP migration
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d0a3fce..981d8a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1290,18 +1290,18 @@ out:
 int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
+	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int target_nid;
 	int current_nid = -1;
-	bool migrated;
+	bool migrated, page_locked;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	get_page(page);
 	current_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (current_nid == numa_node_id())
@@ -1311,12 +1311,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
-	if (trylock_page(page))
-		goto got_lock;
+	page_locked = trylock_page(page);
+	target_nid = mpol_misplaced(page, vma, haddr);
+	if (target_nid == -1) {
+		/* If the page was locked, there are no parallel migrations */
+		if (page_locked) {
+			unlock_page(page);
+			goto clear_pmdnuma;
+		}
 
-	/* Serialise against migrationa and check placement check placement */
+		/* Otherwise wait for potential migrations to complete */
+		spin_unlock(&mm->page_table_lock);
+		wait_on_page_locked(page);
+		goto check_same;
+	}
+
+	/* Page is misplaced, serialise migrations and parallel THP splits */
+	get_page(page);
 	spin_unlock(&mm->page_table_lock);
-	lock_page(page);
+	if (!page_locked) {
+		lock_page(page);
+		page_locked = true;
+	}
+	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
 	spin_lock(&mm->page_table_lock);
@@ -1326,14 +1343,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 	}
 
-got_lock:
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == -1) {
-		unlock_page(page);
-		put_page(page);
-		goto clear_pmdnuma;
-	}
-
 	/* Migrate the THP to the requested node */
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1342,6 +1351,8 @@ got_lock:
 		goto check_same;
 
 	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
 	return 0;
 
 check_same:
@@ -1358,6 +1369,10 @@ clear_pmdnuma:
 	update_mmu_cache_pmd(vma, addr, pmdp);
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+
+	if (anon_vma)
+		page_unlock_anon_vma_read(anon_vma);
+
 	if (current_nid != -1)
 		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
 	return 0;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 				split_huge_page_pmd(vma, addr, pmd);
 			else if (change_huge_pmd(vma, pmd, addr, newprot,
 						 prot_numa)) {
-				pages += HPAGE_PMD_NR;
+				pages++;
 				continue;
 			}
 			/* fall through */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 08/50] mm: numa: Sanitize task_numa_fault() callsites
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 26 ++++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 981d8a2..94d0739 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,18 +1293,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1347,19 +1348,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1370,11 +1370,13 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
+out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index af84bc0..c20f872 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3530,12 +3530,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3546,7 +3546,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3576,15 +3576,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3592,11 +3587,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3611,7 +3606,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3634,9 +3628,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3658,25 +3653,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 08/50] mm: numa: Sanitize task_numa_fault() callsites
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

There are three callers of task_numa_fault():

 - do_huge_pmd_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_numa_page():
     Accounts against the current node, not the node where the
     page resides, unless we migrated, in which case it accounts
     against the node we migrated to.

 - do_pmd_numa_page():
     Accounts not at all when the page isn't migrated, otherwise
     accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 26 ++++++++++++++------------
 mm/memory.c      | 53 +++++++++++++++++++++--------------------------------
 2 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 981d8a2..94d0739 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,18 +1293,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid;
-	int current_nid = -1;
-	bool migrated, page_locked;
+	bool page_locked;
+	bool migrated = false;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
 
 	page = pmd_page(pmd);
-	current_nid = page_to_nid(page);
+	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
@@ -1347,19 +1348,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (!migrated)
+	if (migrated)
+		page_nid = target_nid;
+	else
 		goto check_same;
 
-	task_numa_fault(target_nid, HPAGE_PMD_NR, true);
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-	return 0;
+	goto out;
 
 check_same:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp))) {
 		/* Someone else took our fault */
-		current_nid = -1;
+		page_nid = -1;
 		goto out_unlock;
 	}
 clear_pmdnuma:
@@ -1370,11 +1370,13 @@ clear_pmdnuma:
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
 
+out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
-	if (current_nid != -1)
-		task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index af84bc0..c20f872 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3530,12 +3530,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int current_nid)
+				unsigned long addr, int page_nid)
 {
 	get_page(page);
 
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	return mpol_misplaced(page, vma, addr);
@@ -3546,7 +3546,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
-	int current_nid = -1;
+	int page_nid = -1;
 	int target_nid;
 	bool migrated = false;
 
@@ -3576,15 +3576,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
-	current_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+	page_nid = page_to_nid(page);
+	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
-		/*
-		 * Account for the fault against the current node if it not
-		 * being replaced regardless of where the page is located.
-		 */
-		current_nid = numa_node_id();
 		put_page(page);
 		goto out;
 	}
@@ -3592,11 +3587,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 
 out:
-	if (current_nid != -1)
-		task_numa_fault(current_nid, 1, migrated);
+	if (page_nid != -1)
+		task_numa_fault(page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3611,7 +3606,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int local_nid = numa_node_id();
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3634,9 +3628,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page *page;
-		int curr_nid = local_nid;
+		int page_nid = -1;
 		int target_nid;
-		bool migrated;
+		bool migrated = false;
+
 		if (!pte_present(pteval))
 			continue;
 		if (!pte_numa(pteval))
@@ -3658,25 +3653,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
-		/*
-		 * Note that the NUMA fault is later accounted to either
-		 * the node that is currently running or where the page is
-		 * migrated to.
-		 */
-		curr_nid = local_nid;
-		target_nid = numa_migrate_prep(page, vma, addr,
-					       page_to_nid(page));
-		if (target_nid == -1) {
+		page_nid = page_to_nid(page);
+		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+		pte_unmap_unlock(pte, ptl);
+		if (target_nid != -1) {
+			migrated = migrate_misplaced_page(page, target_nid);
+			if (migrated)
+				page_nid = target_nid;
+		} else {
 			put_page(page);
-			continue;
 		}
 
-		/* Migrate to the requested node */
-		pte_unmap_unlock(pte, ptl);
-		migrated = migrate_misplaced_page(page, target_nid);
-		if (migrated)
-			curr_nid = target_nid;
-		task_numa_fault(curr_nid, 1, migrated);
+		if (page_nid != -1)
+			task_numa_fault(page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 09/50] mm: numa: Do not migrate or account for hinting faults on the zero page
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c |  7 ++++++-
 mm/memory.c      |  1 +
 mm/mprotect.c    | 10 +++++++++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94d0739..40f75a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1303,6 +1303,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1488,8 +1489,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/* only check non-shared pages */
+			/*
+			 * Only check non-shared pages. See change_pte_range
+			 * for comment on why the zero page is not modified
+			 */
 			if (page_mapcount(page) == 1 &&
+			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmd_mknuma(entry);
 			}
diff --git a/mm/memory.c b/mm/memory.c
index c20f872..86c3caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3575,6 +3575,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
+	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..faa499e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -62,7 +62,15 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				struct page *page;
 
 				page = vm_normal_page(vma, addr, oldpte);
-				if (page) {
+
+				/*
+				 * Do not trap faults against the zero page.
+				 * The read-only data is likely to be
+				 * read-cached on the local CPU cache and it
+				 * is less useful to know about local vs remote
+				 * hits on the zero page
+				 */
+				if (page && !is_zero_pfn(page_to_pfn(page))) {
 					int this_nid = page_to_nid(page);
 					if (last_nid == -1)
 						last_nid = this_nid;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 09/50] mm: numa: Do not migrate or account for hinting faults on the zero page
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

[peterz@infradead.org: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c |  7 ++++++-
 mm/memory.c      |  1 +
 mm/mprotect.c    | 10 +++++++++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94d0739..40f75a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1303,6 +1303,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_unlock;
 
 	page = pmd_page(pmd);
+	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1488,8 +1489,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/* only check non-shared pages */
+			/*
+			 * Only check non-shared pages. See change_pte_range
+			 * for comment on why the zero page is not modified
+			 */
 			if (page_mapcount(page) == 1 &&
+			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmd_mknuma(entry);
 			}
diff --git a/mm/memory.c b/mm/memory.c
index c20f872..86c3caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3575,6 +3575,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return 0;
 	}
+	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..faa499e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -62,7 +62,15 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				struct page *page;
 
 				page = vm_normal_page(vma, addr, oldpte);
-				if (page) {
+
+				/*
+				 * Do not trap faults against the zero page.
+				 * The read-only data is likely to be
+				 * read-cached on the local CPU cache and it
+				 * is less useful to know about local vs remote
+				 * hits on the zero page
+				 */
+				if (page && !is_zero_pfn(page_to_pfn(page))) {
 					int this_nid = page_to_nid(page);
 					if (last_nid == -1)
 						last_nid = this_nid;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 10/50] sched: numa: Mitigate chance that same task always updates PTEs
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 227b070..c93a56e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->node_stamp += 2 * TICK_NSEC;
+
+	/*
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		curr->node_stamp = now;
+		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 10/50] sched: numa: Mitigate chance that same task always updates PTEs
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 227b070..c93a56e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->node_stamp += 2 * TICK_NSEC;
+
+	/*
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
 			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
-		curr->node_stamp = now;
+		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 11/50] sched: numa: Continue PTE scanning even if migrate rate limited
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c93a56e..b5aa546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	/*
-	 * Do not set pte_numa if the current running node is rate-limited.
-	 * This loses statistics on the fault but if we are unwilling to
-	 * migrate to this node, it is less likely we can do useful work
-	 */
-	if (migrate_ratelimited(numa_node_id()))
-		return;
-
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 11/50] sched: numa: Continue PTE scanning even if migrate rate limited
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c93a56e..b5aa546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
 	 */
 	p->node_stamp += 2 * TICK_NSEC;
 
-	/*
-	 * Do not set pte_numa if the current running node is rate-limited.
-	 * This loses statistics on the fault but if we are unwilling to
-	 * migrate to this node, it is less likely we can do useful work
-	 */
-	if (migrate_ratelimited(numa_node_id()))
-		return;
-
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 12/50] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h | 10 ----------
 kernel/fork.c            |  3 ---
 kernel/sched/fair.c      | 18 ------------------
 kernel/sched/features.h  |  4 +---
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faf4b7c..4f12073 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,20 +427,10 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
-
-	/*
-	 * The first node a task was scheduled on. If a task runs on
-	 * a different node than Make PTE Scan Go Now.
-	 */
-	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT	-1
-#define NUMA_PTE_SCAN_ACTIVE	-2
-
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index e23bb19..f693bdf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -820,9 +820,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5aa546..2e4c8d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
-	 * We do not care about task placement until a task runs on a node
-	 * other than the first one used by the address space. This is
-	 * largely because migrations are driven by what CPU the task
-	 * is running on. If it's never scheduled on another node, it'll
-	 * not migrate so why bother trapping the fault.
-	 */
-	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
-		mm->first_nid = numa_node_id();
-	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
-		/* Are we running on a new node yet? */
-		if (numa_node_id() == mm->first_nid &&
-		    !sched_feat_numa(NUMA_FORCE))
-			return;
-
-		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
-	}
-
-	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
-SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 12/50] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm_types.h | 10 ----------
 kernel/fork.c            |  3 ---
 kernel/sched/fair.c      | 18 ------------------
 kernel/sched/features.h  |  4 +---
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faf4b7c..4f12073 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,20 +427,10 @@ struct mm_struct {
 
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
-
-	/*
-	 * The first node a task was scheduled on. If a task runs on
-	 * a different node than Make PTE Scan Go Now.
-	 */
-	int first_nid;
 #endif
 	struct uprobes_state uprobes_state;
 };
 
-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT	-1
-#define NUMA_PTE_SCAN_ACTIVE	-2
-
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
 #ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index e23bb19..f693bdf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -820,9 +820,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5aa546..2e4c8d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
 		return;
 
 	/*
-	 * We do not care about task placement until a task runs on a node
-	 * other than the first one used by the address space. This is
-	 * largely because migrations are driven by what CPU the task
-	 * is running on. If it's never scheduled on another node, it'll
-	 * not migrate so why bother trapping the fault.
-	 */
-	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
-		mm->first_nid = numa_node_id();
-	if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
-		/* Are we running on a new node yet? */
-		if (numa_node_id() == mm->first_nid &&
-		    !sched_feat_numa(NUMA_FORCE))
-			return;
-
-		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
-	}
-
-	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
 	 * can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically
  * at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
-SCHED_FEAT(NUMA_FORCE,	false)
 #endif
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 13/50] sched: numa: Initialise numa_next_scan properly
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f307c2c..9d7a33a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1634,8 +1634,8 @@ static void __sched_fork(struct task_struct *p)
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
+		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e4c8d0..2fb978b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+		mm->numa_next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		mm->numa_next_reset = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+	}
+
 	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 13/50] sched: numa: Initialise numa_next_scan properly
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c | 4 ++--
 kernel/sched/fair.c | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f307c2c..9d7a33a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1634,8 +1634,8 @@ static void __sched_fork(struct task_struct *p)
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
-		p->mm->numa_next_scan = jiffies;
-		p->mm->numa_next_reset = jiffies;
+		p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		p->mm->numa_scan_seq = 0;
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e4c8d0..2fb978b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (!mm->numa_next_reset || !mm->numa_next_scan) {
+		mm->numa_next_scan = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+		mm->numa_next_reset = now +
+			msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+	}
+
 	/*
 	 * Reset the scan period if enough time has gone by. Objective is that
 	 * scanning will be reduced if pages are properly placed. As tasks
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 84 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccadb52..ad8d4f5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -402,15 +402,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 078066d..49b426e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fb978b..23fd1f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,10 +818,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
 
 /* Portion of address space to scan in MB */
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long rss = 0;
+	unsigned long nr_scan_pages;
+
+	/*
+	 * Calculations based on RSS as non-present and empty pages are skipped
+	 * by the PTE scanner and NUMA hinting faults should be trapped based
+	 * on resident pages
+	 */
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	rss = get_mm_rss(p->mm);
+	if (!rss)
+		rss = nr_scan_pages;
+
+	rss = round_up(rss, nr_scan_pages);
+	return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* FIXME: Scheduling placement policy hints go here */
 }
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+        if (!migrated) {
+		/* Initialise if necessary */
+		if (!p->numa_scan_period_max)
+			p->numa_scan_period_max = task_scan_max(p);
+
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 }
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_reset;
 	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		xchg(&mm->numa_next_reset, next_scan);
 	}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt | 11 +++---
 include/linux/sched.h           |  1 +
 kernel/sched/fair.c             | 84 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccadb52..ad8d4f5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -402,15 +402,16 @@ workload pattern changes and minimises performance impact due to remote
 memory accesses. These sysctls control the thresholds for scan delays and
 the number of pages scanned.
 
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
 
 numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
 when it initially forks.
 
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
 
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 078066d..49b426e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
 	int numa_scan_seq;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
+	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fb978b..23fd1f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,10 +818,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
 unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
 
 /* Portion of address space to scan in MB */
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+	unsigned long rss = 0;
+	unsigned long nr_scan_pages;
+
+	/*
+	 * Calculations based on RSS as non-present and empty pages are skipped
+	 * by the PTE scanner and NUMA hinting faults should be trapped based
+	 * on resident pages
+	 */
+	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+	rss = get_mm_rss(p->mm);
+	if (!rss)
+		rss = nr_scan_pages;
+
+	rss = round_up(rss, nr_scan_pages);
+	return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+	unsigned int scan, floor;
+	unsigned int windows = 1;
+
+	if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+		windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+	floor = 1000 / windows;
+
+	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+	return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+	unsigned int smin = task_scan_min(p);
+	unsigned int smax;
+
+	/* Watch for min being lower than max due to floor calculations */
+	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+	return max(smin, smax);
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_scan_period_max = task_scan_max(p);
 
 	/* FIXME: Scheduling placement policy hints go here */
 }
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
-        if (!migrated)
-		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+        if (!migrated) {
+		/* Initialise if necessary */
+		if (!p->numa_scan_period_max)
+			p->numa_scan_period_max = task_scan_max(p);
+
+		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
 
 	task_numa_placement(p);
 }
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
+	unsigned long nr_pte_updates = 0;
 	long pages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_reset;
 	if (time_after(now, migrate)) {
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+		p->numa_scan_period = task_scan_min(p);
 		next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
 		xchg(&mm->numa_next_reset, next_scan);
 	}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
 	if (time_before(now, migrate))
 		return;
 
-	if (p->numa_scan_period == 0)
-		p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+	if (p->numa_scan_period == 0) {
+		p->numa_scan_period_max = task_scan_max(p);
+		p->numa_scan_period = task_scan_min(p);
+	}
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			pages -= change_prot_numa(vma, start, end);
+			nr_pte_updates += change_prot_numa(vma, start, end);
+
+			/*
+			 * Scan sysctl_numa_balancing_scan_size but ensure that
+			 * at least one PTE is updated so that unused virtual
+			 * address space is quickly skipped.
+			 */
+			if (nr_pte_updates)
+				pages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
 			if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 
 	if (now - curr->node_stamp > period) {
 		if (!curr->node_stamp)
-			curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+			curr->numa_scan_period = task_scan_min(curr);
 		curr->node_stamp += period;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 15/50] sched: numa: Correct adjustment of numa_scan_period
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

numa_scan_period is in milliseconds, not jiffies. Properly placed pages
slow the scanning rate but adding 10 jiffies to numa_scan_period means
that the rate scanning slows depends on HZ which is confusing. Get rid
of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23fd1f3..29ba117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,7 +914,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period_max = task_scan_max(p);
 
 		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period + jiffies_to_msecs(10));
+			p->numa_scan_period + 10);
 	}
 
 	task_numa_placement(p);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 15/50] sched: numa: Correct adjustment of numa_scan_period
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

numa_scan_period is in milliseconds, not jiffies. Properly placed pages
slow the scanning rate but adding 10 jiffies to numa_scan_period means
that the rate scanning slows depends on HZ which is confusing. Get rid
of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23fd1f3..29ba117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,7 +914,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period_max = task_scan_max(p);
 
 		p->numa_scan_period = min(p->numa_scan_period_max,
-			p->numa_scan_period + jiffies_to_msecs(10));
+			p->numa_scan_period + 10);
 	}
 
 	task_numa_placement(p);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 16/50] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 mm/mprotect.c    | 14 ++++++++++----
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f75a6..065a31d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1474,6 +1474,12 @@ out:
 	return ret;
 }
 
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
@@ -1482,9 +1488,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+		ret = 1;
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1496,12 +1504,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			if (page_mapcount(page) == 1 &&
 			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
+				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
+				ret = HPAGE_PMD_NR;
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
+		/* Set PMD if cleared earlier */
+		if (ret == HPAGE_PMD_NR)
+			set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		ret = 1;
 	}
 
 	return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index faa499e..1f9b54b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -151,10 +151,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot,
-						 prot_numa)) {
-				pages++;
-				continue;
+			else {
+				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+						newprot, prot_numa);
+
+				if (nr_ptes) {
+					if (nr_ptes == HPAGE_PMD_NR)
+						pages++;
+
+					continue;
+				}
 			}
 			/* fall through */
 		}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 16/50] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 19 ++++++++++++++++---
 mm/mprotect.c    | 14 ++++++++++----
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f75a6..065a31d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1474,6 +1474,12 @@ out:
 	return ret;
 }
 
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot, int prot_numa)
 {
@@ -1482,9 +1488,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
-		entry = pmdp_get_and_clear(mm, addr, pmd);
+		ret = 1;
 		if (!prot_numa) {
+			entry = pmdp_get_and_clear(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			ret = HPAGE_PMD_NR;
 			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
@@ -1496,12 +1504,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			if (page_mapcount(page) == 1 &&
 			    !is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
+				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
+				ret = HPAGE_PMD_NR;
 			}
 		}
-		set_pmd_at(mm, addr, pmd, entry);
+
+		/* Set PMD if cleared earlier */
+		if (ret == HPAGE_PMD_NR)
+			set_pmd_at(mm, addr, pmd, entry);
+
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		ret = 1;
 	}
 
 	return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index faa499e..1f9b54b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -151,10 +151,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			else if (change_huge_pmd(vma, pmd, addr, newprot,
-						 prot_numa)) {
-				pages++;
-				continue;
+			else {
+				int nr_ptes = change_huge_pmd(vma, pmd, addr,
+						newprot, prot_numa);
+
+				if (nr_ptes) {
+					if (nr_ptes == HPAGE_PMD_NR)
+						pages++;
+
+					continue;
+				}
 			}
 			/* fall through */
 		}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1f9b54b..1e9cef0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				make_migration_entry_read(&entry);
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
+
+				pages++;
 			}
-			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1f9b54b..1e9cef0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				make_migration_entry_read(&entry);
 				set_pte_at(mm, addr, pte,
 					swp_entry_to_pte(entry));
+
+				pages++;
 			}
-			pages++;
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 18/50] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29ba117..779ebd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
+	 * If the whole process was scanned without updates then no NUMA
+	 * hinting faults are being recorded and scan rate should be lower.
+	 */
+	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+		mm->numa_next_scan = next_scan;
+	}
+
+	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 18/50] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29ba117..779ebd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)
 
 out:
 	/*
+	 * If the whole process was scanned without updates then no NUMA
+	 * hinting faults are being recorded and scan rate should be lower.
+	 */
+	if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+		p->numa_scan_period = min(p->numa_scan_period_max,
+			p->numa_scan_period << 1);
+
+		next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+		mm->numa_next_scan = next_scan;
+	}
+
+	/*
 	 * It is possible to reach the end of the VMA list but the last few
 	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
 	 * would find the !migratable VMA on the next scan but not reset the
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 19/50] sched: Track NUMA hinting faults on per-node basis
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:31   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 11 ++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 49b426e..dfba435 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,8 @@ struct task_struct {
 	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d7a33a..dbc2de6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1644,6 +1644,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1905,6 +1906,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 779ebd7..ebd24c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
 	}
 
 	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c17661..46c2068 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -553,6 +554,17 @@ static inline u64 rq_clock_task(struct rq *rq)
 	return rq->clock_task;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 19/50] sched: Track NUMA hinting faults on per-node basis
@ 2013-09-10  9:31   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 11 ++++++++++-
 kernel/sched/sched.h  | 12 ++++++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 49b426e..dfba435 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,8 @@ struct task_struct {
 	unsigned int numa_scan_period_max;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+
+	unsigned long *numa_faults;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d7a33a..dbc2de6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1644,6 +1644,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
@@ -1905,6 +1906,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 779ebd7..ebd24c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
-	/* FIXME: Allocate task-specific structure for placement policy here */
+	/* Allocate buffer to track faults on a per-node basis */
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		if (!p->numa_faults)
+			return;
+	}
 
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
 	}
 
 	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c17661..46c2068 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -553,6 +554,17 @@ static inline u64 rq_clock_task(struct rq *rq)
 	return rq->clock_task;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 20/50] sched: Select a preferred node with the most numa hinting faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfba435..d6ec68a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbc2de6..0235ab8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,6 +1643,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebd24c0..8c60822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for_each_online_node(nid) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 20/50] sched: Select a preferred node with the most numa hinting faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfba435..d6ec68a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,7 @@ struct task_struct {
 	struct callback_head numa_work;
 
 	unsigned long *numa_faults;
+	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbc2de6..0235ab8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,6 +1643,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebd24c0..8c60822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 
 	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
 		return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_scan_seq = seq;
 	p->numa_scan_period_max = task_scan_max(p);
 
-	/* FIXME: Scheduling placement policy hints go here */
+	/* Find the node with the highest number of faults */
+	for_each_online_node(nid) {
+		unsigned long faults = p->numa_faults[nid];
+		p->numa_faults[nid] >>= 1;
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_nid = nid;
+		}
+	}
+
+	/* Update the tasks preferred node if necessary */
+	if (max_faults && max_nid != p->numa_preferred_nid)
+		p->numa_preferred_nid = max_nid;
 }
 
 /*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 21/50] sched: Update NUMA hinting faults once per scan
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d6ec68a..84fb883 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1335,7 +1335,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0235ab8..2e8f3e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1646,6 +1646,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c60822..c2fefa5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults[node] += pages;
+	p->numa_faults_buffer[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 21/50] sched: Update NUMA hinting faults once per scan
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 13 +++++++++++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d6ec68a..84fb883 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1335,7 +1335,20 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	/*
+	 * Exponential decaying average of faults on a per-node basis.
+	 * Scheduling placement decisions are made based on the these counts.
+	 * The values remain static for the duration of a PTE scan
+	 */
 	unsigned long *numa_faults;
+
+	/*
+	 * numa_faults_buffer records faults per node during the current
+	 * scan window. When the scan completes, the counts in numa_faults
+	 * decay and these values are copied.
+	 */
+	unsigned long *numa_faults_buffer;
+
 	int numa_preferred_nid;
 #endif /* CONFIG_NUMA_BALANCING */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0235ab8..2e8f3e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1646,6 +1646,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
+	p->numa_faults_buffer = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c60822..c2fefa5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = p->numa_faults[nid];
+		unsigned long faults;
+
+		/* Decay existing window and copy faults since last scan */
 		p->numa_faults[nid] >>= 1;
+		p->numa_faults[nid] += p->numa_faults_buffer[nid];
+		p->numa_faults_buffer[nid] = 0;
+
+		faults = p->numa_faults[nid];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * nr_node_ids;
 
-		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+		/* numa_faults and numa_faults_buffer share the allocation */
+		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
 		if (!p->numa_faults)
 			return;
+
+		BUG_ON(p->numa_faults_buffer);
+		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
 	}
 
 	/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults[node] += pages;
+	p->numa_faults_buffer[node] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 22/50] sched: Favour moving tasks towards the preferred node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 +-
 kernel/sched/fair.c             | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h         |  7 +++++
 kernel/sysctl.c                 |  7 +++++
 6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ad8d4f5..23ff00a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -419,6 +420,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84fb883..a2e661d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -776,6 +776,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e8f3e2..0dbd5cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5667,6 +5667,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2fefa5..216908c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -4024,6 +4036,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+	    !(env->sd->flags & SD_NUMA)) {
+		return false;
+	}
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (dst_nid == p->numa_preferred_nid ||
+	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -4081,11 +4125,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 07f6fc4..0015fb9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 22/50] sched: Favour moving tasks towards the preferred node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
[peterz@infradead.org: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/kernel.txt |  8 +++++-
 include/linux/sched.h           |  1 +
 kernel/sched/core.c             |  3 +-
 kernel/sched/fair.c             | 63 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h         |  7 +++++
 kernel/sysctl.c                 |  7 +++++
 6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ad8d4f5..23ff00a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
 feature is too high then the rate the kernel samples for NUMA hinting
 faults may be controlled by the numa_balancing_scan_period_min_ms,
 numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
 
 ==============================================================
 
@@ -419,6 +420,11 @@ scanned for a given scan.
 numa_balancing_scan_period_reset is a blunt instrument that controls how
 often a tasks scan delay is reset to detect sudden changes in task behaviour.
 
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
 ==============================================================
 
 osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84fb883..a2e661d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -776,6 +776,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e8f3e2..0dbd5cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_migrate_seq = 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
@@ -5667,6 +5667,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2fefa5..216908c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
 	return max(smin, smax);
 }
 
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_scan_seq == seq)
 		return;
 	p->numa_scan_seq = seq;
+	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
 	/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Update the tasks preferred node if necessary */
-	if (max_faults && max_nid != p->numa_preferred_nid)
+	if (max_faults && max_nid != p->numa_preferred_nid) {
 		p->numa_preferred_nid = max_nid;
+		p->numa_migrate_seq = 0;
+	}
 }
 
 /*
@@ -4024,6 +4036,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+	    !(env->sd->flags & SD_NUMA)) {
+		return false;
+	}
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (dst_nid == p->numa_preferred_nid ||
+	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+		return true;
+
+	return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -4081,11 +4125,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * 1) destination numa is preferred
+	 * 2) task is cache cold, or
+	 * 3) too many balance attempts have failed.
 	 */
-
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+	if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+		if (tsk_cache_hot) {
+			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+			schedstat_inc(p, se.statistics.nr_forced_migrations);
+		}
+#endif
+		return 1;
+	}
+
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
  */
 #ifdef CONFIG_NUMA_BALANCING
 SCHED_FEAT(NUMA,	false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 07f6fc4..0015fb9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname       = "numa_balancing_settle_count",
+		.data           = &sysctl_numa_balancing_settle_count,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 23/50] sched: Resist moving tasks towards nodes with fewer hinting faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c     | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  8 ++++++++
 2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 216908c..5649280 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4060,12 +4060,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 
 	return false;
 }
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+		return false;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ 		return true;
+ 
+ 	return false;
+}
+
 #else
 static inline bool migrate_improves_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
 #endif
 
 /*
@@ -4130,6 +4161,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA,	false)
  * balancing.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 23/50] sched: Resist moving tasks towards nodes with fewer hinting faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

[mgorman@suse.de: changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c     | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  8 ++++++++
 2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 216908c..5649280 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4060,12 +4060,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 
 	return false;
 }
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	int src_nid, dst_nid;
+
+	if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+		return false;
+
+	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+		return false;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	if (src_nid == dst_nid ||
+	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+		return false;
+
+	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ 		return true;
+ 
+ 	return false;
+}
+
 #else
 static inline bool migrate_improves_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+					     struct lb_env *env)
+{
+	return false;
+}
 #endif
 
 /*
@@ -4130,6 +4161,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) too many balance attempts have failed.
 	 */
 	tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+	if (!tsk_cache_hot)
+		tsk_cache_hot = migrate_degrades_locality(p, env);
 
 	if (migrate_improves_locality(p, env)) {
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA,	false)
  * balancing.
  */
 SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 24/50] sched: Reschedule task on preferred NUMA node once selected
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 19 +++++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0dbd5cd..e94509d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4368,6 +4368,25 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	/* TODO: This is not properly updating schedstats */
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5649280..350c411 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46c2068..778f875 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 24/50] sched: Reschedule task on preferred NUMA node once selected
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c  | 19 +++++++++++++++++++
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0dbd5cd..e94509d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4368,6 +4368,25 @@ fail:
 	return ret;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+	struct migration_arg arg = { p, target_cpu };
+	int curr_cpu = task_cpu(p);
+
+	if (curr_cpu == target_cpu)
+		return 0;
+
+	if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+		return -EINVAL;
+
+	/* TODO: This is not properly updating schedstats */
+
+	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
 /*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5649280..350c411 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int i, idlest_cpu = this_cpu;
+
+	BUG_ON(cpu_to_node(this_cpu) == nid);
+
+	rcu_read_lock();
+	for_each_cpu(i, cpumask_of_node(nid)) {
+		load = weighted_cpuload(i);
+
+		if (load < min_load) {
+			min_load = load;
+			idlest_cpu = i;
+		}
+	}
+	rcu_read_unlock();
+
+	return idlest_cpu;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/* Update the tasks preferred node if necessary */
+	/*
+	 * Record the preferred node as the node with the most faults,
+	 * requeue the task to be running on the idlest CPU on the
+	 * preferred node and reset the scanning rate to recheck
+	 * the working set placement.
+	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
+		int preferred_cpu;
+
+		/*
+		 * If the task is not on the preferred node then find the most
+		 * idle CPU to migrate to.
+		 */
+		preferred_cpu = task_cpu(p);
+		if (cpu_to_node(preferred_cpu) != max_nid) {
+			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+							     max_nid);
+		}
+
+		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 0;
+		migrate_task_to(p, preferred_cpu);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46c2068..778f875 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 25/50] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 46 +++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |  5 +++--
 mm/memory.c           |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2e661d..6eb8fa6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1430,10 +1430,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350c411..108f357 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_faults)
+		return 0;
+
+	return p->numa_faults[task_faults_idx(nid, 0)] +
+		p->numa_faults[task_faults_idx(nid, 1)];
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 
 
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!numabalancing_enabled)
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults_buffer[node] += pages;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4099,7 +4123,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 		return false;
 
 	if (dst_nid == p->numa_preferred_nid ||
-	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+	    task_faults(p, dst_nid) > task_faults(p, src_nid))
 		return true;
 
 	return false;
@@ -4123,7 +4147,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
-	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
  		return true;
  
  	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 065a31d..ca66a8a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid;
+	int target_nid, last_nid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,6 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	last_nid = page_nid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1376,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 86c3caf..bd016c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
+	int last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3577,6 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	last_nid = page_nid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(page_nid, 1, migrated);
+		task_numa_fault(last_nid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3607,6 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3657,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
+		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3666,7 +3670,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
+			task_numa_fault(last_nid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 25/50] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  5 +++--
 kernel/sched/fair.c   | 46 +++++++++++++++++++++++++++++++++++-----------
 mm/huge_memory.c      |  5 +++--
 mm/memory.c           |  8 ++++++--
 4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2e661d..6eb8fa6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1430,10 +1430,11 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
 extern void set_numabalancing_state(bool enabled);
 #else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+				   bool migrated)
 {
 }
 static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350c411..108f357 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
 
+static inline int task_faults_idx(int nid, int priv)
+{
+	return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_faults)
+		return 0;
+
+	return p->numa_faults[task_faults_idx(nid, 0)] +
+		p->numa_faults[task_faults_idx(nid, 1)];
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 
 
@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults;
+		int priv, i;
 
-		/* Decay existing window and copy faults since last scan */
-		p->numa_faults[nid] >>= 1;
-		p->numa_faults[nid] += p->numa_faults_buffer[nid];
-		p->numa_faults_buffer[nid] = 0;
+		for (priv = 0; priv < 2; priv++) {
+			i = task_faults_idx(nid, priv);
 
-		faults = p->numa_faults[nid];
+			/* Decay existing window, copy faults since last scan */
+			p->numa_faults[i] >>= 1;
+			p->numa_faults[i] += p->numa_faults_buffer[i];
+			p->numa_faults_buffer[i] = 0;
+		}
+
+		/* Find maximum private faults */
+		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
+	int priv;
 
 	if (!numabalancing_enabled)
 		return;
 
+	/* For now, do not attempt to detect private/shared accesses */
+	priv = 1;
+
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
-		int size = sizeof(*p->numa_faults) * nr_node_ids;
+		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
 
 		/* numa_faults and numa_faults_buffer share the allocation */
 		p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 			return;
 
 		BUG_ON(p->numa_faults_buffer);
-		p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
 	}
 
 	/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
-	p->numa_faults_buffer[node] += pages;
+	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4099,7 +4123,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 		return false;
 
 	if (dst_nid == p->numa_preferred_nid ||
-	    p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+	    task_faults(p, dst_nid) > task_faults(p, src_nid))
 		return true;
 
 	return false;
@@ -4123,7 +4147,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
 		return false;
 
-	if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
  		return true;
  
  	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 065a31d..ca66a8a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid;
+	int target_nid, last_nid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,6 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	last_nid = page_nid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1376,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 86c3caf..bd016c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
+	int last_nid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3577,6 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	last_nid = page_nid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(page_nid, 1, migrated);
+		task_numa_fault(last_nid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3607,6 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	int last_nid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,6 +3657,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(page_mapcount(page) != 1))
 			continue;
 
+		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3666,7 +3670,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(page_nid, 1, migrated);
+			task_numa_fault(last_nid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 26/50] sched: Check current->mm before allocating NUMA faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 108f357..e259241 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 26/50] sched: Check current->mm before allocating NUMA faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 108f357..e259241 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
 	int seq, nid, max_nid = -1;
 	unsigned long max_faults = 0;
 
-	if (!p->mm)	/* for example, ksmd faulting in a user's mm */
-		return;
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
 		return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!numabalancing_enabled)
 		return;
 
+	/* for example, ksmd faulting in a user's mm */
+	if (!p->mm)
+		return;
+
 	/* For now, do not attempt to detect private/shared accesses */
 	priv = 1;
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/huge_memory.c        |  8 ++------
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 5 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ca66a8a..a8e624e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1498,12 +1498,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/*
-			 * Only check non-shared pages. See change_pte_range
-			 * for comment on why the zero page is not modified
-			 */
-			if (page_mapcount(page) == 1 &&
-			    !is_huge_zero_page(page) &&
+			/* See change_pte_range about the zero page */
+			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index bd016c2..e335ec0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3588,7 +3588,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
@@ -3653,16 +3653,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
+			migrated = migrate_misplaced_page(page, vma, target_nid);
 			if (migrated)
 				page_nid = target_nid;
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..08ac3ba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,7 +1596,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1604,10 +1605,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e9cef0..4a21819 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,9 +77,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/migrate.h |  7 ++++---
 mm/huge_memory.c        |  8 ++------
 mm/memory.c             |  7 ++-----
 mm/migrate.c            | 17 ++++++-----------
 mm/mprotect.c           |  4 +---
 5 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+				  struct vm_area_struct *vma, int node);
 extern bool migrate_ratelimited(int node);
 #else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+					 struct vm_area_struct *vma, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ca66a8a..a8e624e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1498,12 +1498,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		} else {
 			struct page *page = pmd_page(*pmd);
 
-			/*
-			 * Only check non-shared pages. See change_pte_range
-			 * for comment on why the zero page is not modified
-			 */
-			if (page_mapcount(page) == 1 &&
-			    !is_huge_zero_page(page) &&
+			/* See change_pte_range about the zero page */
+			if (!is_huge_zero_page(page) &&
 			    !pmd_numa(*pmd)) {
 				entry = pmdp_get_and_clear(mm, addr, pmd);
 				entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index bd016c2..e335ec0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3588,7 +3588,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated)
 		page_nid = target_nid;
 
@@ -3653,16 +3653,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
 			continue;
-		/* only check non-shared pages */
-		if (unlikely(page_mapcount(page) != 1))
-			continue;
 
 		last_nid = page_nid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
-			migrated = migrate_misplaced_page(page, target_nid);
+			migrated = migrate_misplaced_page(page, vma, target_nid);
 			if (migrated)
 				page_nid = target_nid;
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..08ac3ba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,7 +1596,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+			   int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
@@ -1604,10 +1605,11 @@ int migrate_misplaced_page(struct page *page, int node)
 	LIST_HEAD(migratepages);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
+	 * Don't migrate file pages that are mapped in multiple processes
+	 * with execute permissions as they are probably shared libraries.
 	 */
-	if (page_mapcount(page) != 1)
+	if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+	    (vma->vm_flags & VM_EXEC))
 		goto out;
 
 	/*
@@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 
 	/*
-	 * Don't migrate pages that are mapped in multiple processes.
-	 * TODO: Handle false sharing detection instead of this hammer
-	 */
-	if (page_mapcount(page) != 1)
-		goto out_dropref;
-
-	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e9cef0..4a21819 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,9 +77,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					if (last_nid != this_nid)
 						all_same_node = false;
 
-					/* only check non-shared pages */
-					if (!pte_numa(oldpte) &&
-					    page_mapcount(page) == 1) {
+					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
 						updated = true;
 					}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 28/50] sched: Remove check that skips small VMAs
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e259241..2d04112 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 28/50] sched: Remove check that skips small VMAs
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e259241..2d04112 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
-		/* Skip small VMAs. They are not likely to be of relevance */
-		if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
-			continue;
-
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 29/50] sched: Set preferred NUMA node based on number of private faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 89 +++++++++++++++++++++++++++++----------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 28 +++++++-----
 kernel/sched/fair.c               | 12 ++++--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    |  8 ++--
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 26 ++++++++----
 mm/page_alloc.c                   |  4 +-
 12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..0a0db6c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,48 +668,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
 {
-	page->_last_nid = -1;
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f12073..f46378e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d04112..223e1f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a8e624e..622bc7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nid = -1;
+	int target_nid, last_nidpid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index e335ec0..948ec32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nid;
+	int last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, 1, migrated);
+		task_numa_fault(last_nidpid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nid, page_nid, 1, migrated);
+			task_numa_fault(last_nidpid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4baf12e..8e2a364 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2292,9 +2292,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2317,8 +2319,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 08ac3ba..f56ca20 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4a21819..70ec934 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -71,11 +72,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * hits on the zero page
 				 */
 				if (page && !is_zero_pfn(page_to_pfn(page))) {
-					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_nid = nidpid_to_nid(nidpid);
+					int this_pid = nidpid_to_pid(nidpid);
+
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -115,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -142,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -174,7 +182,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..7bf960e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 29/50] sched: Set preferred NUMA node based on number of private faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

[riel@redhat.com: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 89 +++++++++++++++++++++++++++++----------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 28 +++++++-----
 kernel/sched/fair.c               | 12 ++++--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    |  8 ++--
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 26 ++++++++----
 mm/page_alloc.c                   |  4 +-
 12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..0a0db6c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,48 +668,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
 {
-	return xchg(&page->_last_nid, nid);
+	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
 {
-	return page->_last_nid;
+	return nidpid & LAST__PID_MASK;
 }
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
 {
-	page->_last_nid = -1;
+	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+	return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+	return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+	page->_last_nidpid = -1;
 }
 #else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
-	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
 }
 
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
 
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
 {
-	int nid = (1 << LAST_NID_SHIFT) - 1;
+	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-	page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 }
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
 #else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+	return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+	return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
 {
 }
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f12073..f46378e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-	int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+	int _last_nidpid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
  * The last is when there is insufficient space in page->flags and a separate
  * lookup is necessary.
  *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: |       NODE     | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE |          ... | FLAGS |
- *         " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
+ *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
 #else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
 #else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
 #endif
 
 /*
@@ -81,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d04112..223e1f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/* For now, do not attempt to detect private/shared accesses */
-	priv = 1;
+	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (!nidpid_pid_unset(last_nidpid))
+		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	else
+		priv = 1;
 
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a8e624e..622bc7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nid = -1;
+	int target_nid, last_nidpid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nid_xchg_last(page_tail, page_nid_last(page));
+		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index e335ec0..948ec32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nid;
+	int last_nidpid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nid = page_nid_last(page);
+	last_nidpid = page_nidpid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nid, page_nid, 1, migrated);
+		task_numa_fault(last_nidpid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nid;
+	int last_nidpid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nid, page_nid, 1, migrated);
+			task_numa_fault(last_nidpid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4baf12e..8e2a364 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2292,9 +2292,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nid;
+		int last_nidpid;
+		int this_nidpid;
 
 		polnid = numa_node_id();
+		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2317,8 +2319,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nid = page_nid_xchg_last(page, polnid);
-		if (last_nid != polnid)
+		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 08ac3ba..f56ca20 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nid_xchg_last(newpage, page_nid_last(page));
+		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
 
 	return newpage;
 }
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nid_xchg_last(new_page, page_nid_last(page));
+	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NID_WIDTH,
+		LAST_NIDPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnid %d\n",
+		"Section %d Node %d Zone %d Lastnidpid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NID_SHIFT);
+		LAST_NIDPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NID_PGSHIFT);
+		(unsigned long)LAST_NIDPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nid not in page flags");
+		"Last nidpid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
 {
 	unsigned long old_flags, flags;
-	int last_nid;
+	int last_nidpid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nid = page_nid_last(page);
+		last_nidpid = page_nidpid_last(page);
 
-		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
-		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nid;
+	return last_nidpid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4a21819..70ec934 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_node = true;
+	bool all_same_nidpid = true;
 	int last_nid = -1;
+	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -71,11 +72,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * hits on the zero page
 				 */
 				if (page && !is_zero_pfn(page_to_pfn(page))) {
-					int this_nid = page_to_nid(page);
+					int nidpid = page_nidpid_last(page);
+					int this_nid = nidpid_to_nid(nidpid);
+					int this_pid = nidpid_to_pid(nidpid);
+
 					if (last_nid == -1)
 						last_nid = this_nid;
-					if (last_nid != this_nid)
-						all_same_node = false;
+					if (last_pid == -1)
+						last_pid = this_pid;
+					if (last_nid != this_nid ||
+					    last_pid != this_pid) {
+						all_same_nidpid = false;
+					}
 
 					if (!pte_numa(oldpte)) {
 						ptent = pte_mknuma(ptent);
@@ -115,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_node = all_same_node;
+	*ret_all_same_nidpid = all_same_nidpid;
 	return pages;
 }
 
@@ -142,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_node;
+	bool all_same_nidpid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_node);
+				 dirty_accountable, prot_numa, &all_same_nidpid);
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -174,7 +182,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node)
+		if (prot_numa && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..7bf960e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nid_reset_last(page);
+	page_nidpid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nid_reset_last(page);
+		page_nidpid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 30/50] sched: Do not migrate memory immediately after switching node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Not-signed-off: Rik van Riel
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 18 ++++++++++++++++--
 mm/mempolicy.c      | 12 ++++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94509d..374da2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 0;
+	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 223e1f8..c2f1cf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
  * the preferred node but still allow the scheduler to move the task again if
  * the nodes CPUs are overloaded.
  */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
 static inline int task_faults_idx(int nid, int priv)
 {
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		p->numa_migrate_seq = 1;
 		migrate_task_to(p, preferred_cpu);
 	}
 }
@@ -4074,6 +4074,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_preferred_nid != -1) {
+		int src_nid = cpu_to_node(env->src_cpu);
+		int dst_nid = cpu_to_node(env->dst_cpu);
+
+		/*
+		 * If the load balancer has moved the task then limit
+		 * migrations from taking place in the short term in
+		 * case this is a short-lived migration.
+		 */
+		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+			p->numa_migrate_seq = 0;
+	}
+#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8e2a364..adc93b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2322,6 +2322,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
 		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * If the scheduler has just moved us away from our
+		 * preferred node, do not bother migrating pages yet.
+		 * This way a short and temporary process migration will
+		 * not cause excessive memory migration.
+		 */
+		if (polnid != current->numa_preferred_nid &&
+				!current->numa_migrate_seq)
+			goto out;
+#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 30/50] sched: Do not migrate memory immediately after switching node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Not-signed-off: Rik van Riel
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 18 ++++++++++++++++--
 mm/mempolicy.c      | 12 ++++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94509d..374da2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)
 
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 0;
+	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 223e1f8..c2f1cf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
  * the preferred node but still allow the scheduler to move the task again if
  * the nodes CPUs are overloaded.
  */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
 static inline int task_faults_idx(int nid, int priv)
 {
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)
 
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 0;
+		p->numa_migrate_seq = 1;
 		migrate_task_to(p, preferred_cpu);
 	}
 }
@@ -4074,6 +4074,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	set_task_cpu(p, env->dst_cpu);
 	activate_task(env->dst_rq, p, 0);
 	check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+	if (p->numa_preferred_nid != -1) {
+		int src_nid = cpu_to_node(env->src_cpu);
+		int dst_nid = cpu_to_node(env->dst_cpu);
+
+		/*
+		 * If the load balancer has moved the task then limit
+		 * migrations from taking place in the short term in
+		 * case this is a short-lived migration.
+		 */
+		if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+			p->numa_migrate_seq = 0;
+	}
+#endif
 }
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8e2a364..adc93b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2322,6 +2322,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
 		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
 			goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * If the scheduler has just moved us away from our
+		 * preferred node, do not bother migrating pages yet.
+		 * This way a short and temporary process migration will
+		 * not cause excessive memory migration.
+		 */
+		if (polnid != current->numa_preferred_nid &&
+				!current->numa_migrate_seq)
+			goto out;
+#endif
 	}
 
 	if (curnid != polnid)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 31/50] sched: Avoid overloading CPUs on a preferred NUMA node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2f1cf5..5f0388e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+struct numa_stats {
+	unsigned long load;
+	s64 eff_load;
+	unsigned long faults;
+};
 
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+	struct task_struct *p;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	int src_cpu, src_nid;
+	int dst_cpu, dst_nid;
 
-	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	struct numa_stats src_stats, dst_stats;
 
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	unsigned long best_load;
+	int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	struct task_numa_env env = {
+		.p = p,
+		.src_cpu = task_cpu(p),
+		.src_nid = cpu_to_node(task_cpu(p)),
+		.dst_cpu = node_cpu,
+		.dst_nid = p->numa_preferred_nid,
+		.best_load = ULONG_MAX,
+		.best_cpu = task_cpu(p),
+	};
+	struct sched_domain *sd;
+	int cpu;
+	struct task_group *tg = task_group(p);
+	unsigned long weight;
+	bool balanced;
+	int imbalance_pct, idx = -1;
+
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
+	rcu_read_lock();
+	for_each_domain(env.src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return 0;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+	env.src_stats.load = source_load(env.src_cpu, idx);
+	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+	env.src_stats.eff_load *= power_of(env.src_cpu);
+	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+		env.dst_cpu = cpu;
+		env.dst_stats.load = target_load(cpu, idx);
+ 
+ 		/* If the CPU is idle, use it */
+		if (!env.dst_stats.load) {
+			env.best_cpu = cpu;
+			goto migrate;
+		}
+
+		/* Otherwise check the target CPU load */
+		env.dst_stats.eff_load = 100;
+		env.dst_stats.eff_load *= power_of(cpu);
+		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+		/*
+		 * Destination is considered balanced if the destination CPU is
+		 * less loaded than the source CPU. Unfortunately there is a
+		 * risk that a task running on a lightly loaded CPU will not
+		 * migrate to its preferred node due to load imbalances.
+		 */
+		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+		if (!balanced)
+			continue;
+
+		if (env.dst_stats.eff_load < env.best_load) {
+			env.best_load = env.dst_stats.eff_load;
+			env.best_cpu = cpu;
+		}
+	}
+
+migrate:
+	return migrate_task_to(p, env.best_cpu);
 }
 
 static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
-
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		migrate_task_to(p, preferred_cpu);
+		task_numa_migrate(p);
 	}
 }
 
@@ -3274,7 +3348,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent)	/* the trivial, non-cgroup case */
+	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {
@@ -3327,8 +3401,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 31/50] sched: Avoid overloading CPUs on a preferred NUMA node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2f1cf5..5f0388e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+struct numa_stats {
+	unsigned long load;
+	s64 eff_load;
+	unsigned long faults;
+};
 
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
-	unsigned long load, min_load = ULONG_MAX;
-	int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+	struct task_struct *p;
 
-	BUG_ON(cpu_to_node(this_cpu) == nid);
+	int src_cpu, src_nid;
+	int dst_cpu, dst_nid;
 
-	rcu_read_lock();
-	for_each_cpu(i, cpumask_of_node(nid)) {
-		load = weighted_cpuload(i);
+	struct numa_stats src_stats, dst_stats;
 
-		if (load < min_load) {
-			min_load = load;
-			idlest_cpu = i;
+	unsigned long best_load;
+	int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	struct task_numa_env env = {
+		.p = p,
+		.src_cpu = task_cpu(p),
+		.src_nid = cpu_to_node(task_cpu(p)),
+		.dst_cpu = node_cpu,
+		.dst_nid = p->numa_preferred_nid,
+		.best_load = ULONG_MAX,
+		.best_cpu = task_cpu(p),
+	};
+	struct sched_domain *sd;
+	int cpu;
+	struct task_group *tg = task_group(p);
+	unsigned long weight;
+	bool balanced;
+	int imbalance_pct, idx = -1;
+
+	/*
+	 * Find the lowest common scheduling domain covering the nodes of both
+	 * the CPU the task is currently running on and the target NUMA node.
+	 */
+	rcu_read_lock();
+	for_each_domain(env.src_cpu, sd) {
+		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+			/*
+			 * busy_idx is used for the load decision as it is the
+			 * same index used by the regular load balancer for an
+			 * active cpu.
+			 */
+			idx = sd->busy_idx;
+			imbalance_pct = sd->imbalance_pct;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return idlest_cpu;
+	if (WARN_ON_ONCE(idx == -1))
+		return 0;
+
+	/*
+	 * XXX the below is mostly nicked from wake_affine(); we should
+	 * see about sharing a bit if at all possible; also it might want
+	 * some per entity weight love.
+	 */
+	weight = p->se.load.weight;
+	env.src_stats.load = source_load(env.src_cpu, idx);
+	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+	env.src_stats.eff_load *= power_of(env.src_cpu);
+	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+		env.dst_cpu = cpu;
+		env.dst_stats.load = target_load(cpu, idx);
+ 
+ 		/* If the CPU is idle, use it */
+		if (!env.dst_stats.load) {
+			env.best_cpu = cpu;
+			goto migrate;
+		}
+
+		/* Otherwise check the target CPU load */
+		env.dst_stats.eff_load = 100;
+		env.dst_stats.eff_load *= power_of(cpu);
+		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+		/*
+		 * Destination is considered balanced if the destination CPU is
+		 * less loaded than the source CPU. Unfortunately there is a
+		 * risk that a task running on a lightly loaded CPU will not
+		 * migrate to its preferred node due to load imbalances.
+		 */
+		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+		if (!balanced)
+			continue;
+
+		if (env.dst_stats.eff_load < env.best_load) {
+			env.best_load = env.dst_stats.eff_load;
+			env.best_cpu = cpu;
+		}
+	}
+
+migrate:
+	return migrate_task_to(p, env.best_cpu);
 }
 
 static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
 	 * the working set placement.
 	 */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int preferred_cpu;
-
-		/*
-		 * If the task is not on the preferred node then find the most
-		 * idle CPU to migrate to.
-		 */
-		preferred_cpu = task_cpu(p);
-		if (cpu_to_node(preferred_cpu) != max_nid) {
-			preferred_cpu = find_idlest_cpu_node(preferred_cpu,
-							     max_nid);
-		}
-
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		migrate_task_to(p, preferred_cpu);
+		task_numa_migrate(p);
 	}
 }
 
@@ -3274,7 +3348,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent)	/* the trivial, non-cgroup case */
+	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {
@@ -3327,8 +3401,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	return wl;
 }
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 32/50] sched: Retry migration of tasks to CPU on a preferred node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 26 +++++++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6eb8fa6..3418b0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,6 +1333,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f0388e..5b4d94e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,19 @@ migrate:
 	return migrate_task_to(p, env.best_cpu);
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	/* Success if task is already running on preferred CPU */
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+		return;
+
+	/* Otherwise, try migrate to a CPU on the preferred node */
+	if (task_numa_migrate(p) != 0)
+		p->numa_migrate_retry = jiffies + HZ*5;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -1045,17 +1058,12 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * Record the preferred node as the node with the most faults,
-	 * requeue the task to be running on the idlest CPU on the
-	 * preferred node and reset the scanning rate to recheck
-	 * the working set placement.
-	 */
+	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		task_numa_migrate(p);
+		numa_migrate_preferred(p);
 	}
 }
 
@@ -1111,6 +1119,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 32/50] sched: Retry migration of tasks to CPU on a preferred node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 26 +++++++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6eb8fa6..3418b0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,6 +1333,7 @@ struct task_struct {
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	unsigned int numa_scan_period_max;
+	unsigned long numa_migrate_retry;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f0388e..5b4d94e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,19 @@ migrate:
 	return migrate_task_to(p, env.best_cpu);
 }
 
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+	/* Success if task is already running on preferred CPU */
+	p->numa_migrate_retry = 0;
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+		return;
+
+	/* Otherwise, try migrate to a CPU on the preferred node */
+	if (task_numa_migrate(p) != 0)
+		p->numa_migrate_retry = jiffies + HZ*5;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1;
@@ -1045,17 +1058,12 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * Record the preferred node as the node with the most faults,
-	 * requeue the task to be running on the idlest CPU on the
-	 * preferred node and reset the scanning rate to recheck
-	 * the working set placement.
-	 */
+	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
 		p->numa_preferred_nid = max_nid;
 		p->numa_migrate_seq = 1;
-		task_numa_migrate(p);
+		numa_migrate_preferred(p);
 	}
 }
 
@@ -1111,6 +1119,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 
 	task_numa_placement(p);
 
+	/* Retry task to preferred node migration if it previously failed */
+	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+		numa_migrate_preferred(p);
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 33/50] sched: numa: increment numa_migrate_seq when task runs in correct location
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b4d94e..fd724bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+		/*
+		 * If migration is temporarily disabled due to a task migration
+		 * then re-enable it now as the task is running on its
+		 * preferred node and memory should migrate locally
+		 */
+		if (!p->numa_migrate_seq)
+			p->numa_migrate_seq++;
 		return;
+	}
 
 	/* Otherwise, try migrate to a CPU on the preferred node */
 	if (task_numa_migrate(p) != 0)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 33/50] sched: numa: increment numa_migrate_seq when task runs in correct location
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

[mgorman@suse.de: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b4d94e..fd724bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
 {
 	/* Success if task is already running on preferred CPU */
 	p->numa_migrate_retry = 0;
-	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+	if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+		/*
+		 * If migration is temporarily disabled due to a task migration
+		 * then re-enable it now as the task is running on its
+		 * preferred node and memory should migrate locally
+		 */
+		if (!p->numa_migrate_seq)
+			p->numa_migrate_seq++;
 		return;
+	}
 
 	/* Otherwise, try migrate to a CPU on the preferred node */
 	if (task_numa_migrate(p) != 0)
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd724bc..5d244d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd724bc..5d244d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 35/50] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 70ec934..191a89a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -154,6 +154,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		unsigned long this_pages;
+
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -173,8 +175,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma, pmd, addr, next, newprot,
+		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_nidpid);
+		pages += this_pages;
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -182,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 35/50] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/mprotect.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 70ec934..191a89a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -154,6 +154,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 	pmd = pmd_offset(pud, addr);
 	do {
+		unsigned long this_pages;
+
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
@@ -173,8 +175,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma, pmd, addr, next, newprot,
+		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_nidpid);
+		pages += this_pages;
 
 		/*
 		 * If we are changing protections for NUMA hinting faults then
@@ -182,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_nidpid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 36/50] stop_machine: Introduce stop_two_cpus()
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/stop_machine.h |   1 +
 kernel/stop_machine.c        | 272 +++++++++++++++++++++++++++----------------
 2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 			 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	return done.executed ? done.ret : -ENOENT;
 }
 
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+	/* Dummy starting state for thread. */
+	MULTI_STOP_NONE,
+	/* Awaiting everyone to be scheduled. */
+	MULTI_STOP_PREPARE,
+	/* Disable interrupts. */
+	MULTI_STOP_DISABLE_IRQ,
+	/* Run the function */
+	MULTI_STOP_RUN,
+	/* Exit */
+	MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+	int			(*fn)(void *);
+	void			*data;
+	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+	unsigned int		num_threads;
+	const struct cpumask	*active_cpus;
+
+	enum multi_stop_state	state;
+	atomic_t		thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+		      enum multi_stop_state newstate)
+{
+	/* Reset ack counter. */
+	atomic_set(&msdata->thread_ack, msdata->num_threads);
+	smp_wmb();
+	msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+	if (atomic_dec_and_test(&msdata->thread_ack))
+		set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+	struct multi_stop_data *msdata = data;
+	enum multi_stop_state curstate = MULTI_STOP_NONE;
+	int cpu = smp_processor_id(), err = 0;
+	unsigned long flags;
+	bool is_active;
+
+	/*
+	 * When called from stop_machine_from_inactive_cpu(), irq might
+	 * already be disabled.  Save the state and restore it on exit.
+	 */
+	local_save_flags(flags);
+
+	if (!msdata->active_cpus)
+		is_active = cpu == cpumask_first(cpu_online_mask);
+	else
+		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+	/* Simple state machine */
+	do {
+		/* Chill out and ensure we re-read multi_stop_state. */
+		cpu_relax();
+		if (msdata->state != curstate) {
+			curstate = msdata->state;
+			switch (curstate) {
+			case MULTI_STOP_DISABLE_IRQ:
+				local_irq_disable();
+				hard_irq_disable();
+				break;
+			case MULTI_STOP_RUN:
+				if (is_active)
+					err = msdata->fn(msdata->data);
+				break;
+			default:
+				break;
+			}
+			ack_state(msdata);
+		}
+	} while (curstate != MULTI_STOP_EXIT);
+
+	local_irq_restore(flags);
+	return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+	int cpu1;
+	int cpu2;
+	struct cpu_stop_work *work1;
+	struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+	struct irq_cpu_stop_queue_work_info *info = arg;
+	cpu_stop_queue_work(info->cpu1, info->work1);
+	cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+	int call_cpu;
+	struct cpu_stop_done done;
+	struct cpu_stop_work work1, work2;
+	struct irq_cpu_stop_queue_work_info call_args;
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = arg,
+		.num_threads = 2,
+		.active_cpus = cpumask_of(cpu1),
+	};
+
+	work1 = work2 = (struct cpu_stop_work){
+		.fn = multi_cpu_stop,
+		.arg = &msdata,
+		.done = &done
+	};
+
+	call_args = (struct irq_cpu_stop_queue_work_info){
+		.cpu1 = cpu1,
+		.cpu2 = cpu2,
+		.work1 = &work1,
+		.work2 = &work2,
+	};
+
+	cpu_stop_init_done(&done, 2);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+
+	/*
+	 * Queuing needs to be done by the lowest numbered CPU, to ensure
+	 * that works are always queued in the same order on every CPU.
+	 * This prevents deadlocks.
+	 */
+	call_cpu = min(cpu1, cpu2);
+
+	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+				 &call_args, 0);
+
+	wait_for_completion(&done.completion);
+	return done.executed ? done.ret : -ENOENT;
+}
+
 /**
  * stop_one_cpu_nowait - stop a cpu but don't wait for completion
  * @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
 
 #ifdef CONFIG_STOP_MACHINE
 
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
-	/* Dummy starting state for thread. */
-	STOPMACHINE_NONE,
-	/* Awaiting everyone to be scheduled. */
-	STOPMACHINE_PREPARE,
-	/* Disable interrupts. */
-	STOPMACHINE_DISABLE_IRQ,
-	/* Run the function */
-	STOPMACHINE_RUN,
-	/* Exit */
-	STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
-	int			(*fn)(void *);
-	void			*data;
-	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-	unsigned int		num_threads;
-	const struct cpumask	*active_cpus;
-
-	enum stopmachine_state	state;
-	atomic_t		thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
-		      enum stopmachine_state newstate)
-{
-	/* Reset ack counter. */
-	atomic_set(&smdata->thread_ack, smdata->num_threads);
-	smp_wmb();
-	smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
-	if (atomic_dec_and_test(&smdata->thread_ack))
-		set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
-	struct stop_machine_data *smdata = data;
-	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	int cpu = smp_processor_id(), err = 0;
-	unsigned long flags;
-	bool is_active;
-
-	/*
-	 * When called from stop_machine_from_inactive_cpu(), irq might
-	 * already be disabled.  Save the state and restore it on exit.
-	 */
-	local_save_flags(flags);
-
-	if (!smdata->active_cpus)
-		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
-		is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
-	/* Simple state machine */
-	do {
-		/* Chill out and ensure we re-read stopmachine_state. */
-		cpu_relax();
-		if (smdata->state != curstate) {
-			curstate = smdata->state;
-			switch (curstate) {
-			case STOPMACHINE_DISABLE_IRQ:
-				local_irq_disable();
-				hard_irq_disable();
-				break;
-			case STOPMACHINE_RUN:
-				if (is_active)
-					err = smdata->fn(smdata->data);
-				break;
-			default:
-				break;
-			}
-			ack_state(smdata);
-		}
-	} while (curstate != STOPMACHINE_EXIT);
-
-	local_irq_restore(flags);
-	return err;
-}
-
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
-					    .num_threads = num_online_cpus(),
-					    .active_cpus = cpus };
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = data,
+		.num_threads = num_online_cpus(),
+		.active_cpus = cpus,
+	};
 
 	if (!stop_machine_initialized) {
 		/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 		unsigned long flags;
 		int ret;
 
-		WARN_ON_ONCE(smdata.num_threads != 1);
+		WARN_ON_ONCE(msdata.num_threads != 1);
 
 		local_irq_save(flags);
 		hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	}
 
 	/* Set the initial state and stop all online cpus. */
-	set_state(&smdata, STOPMACHINE_PREPARE);
-	return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
 int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				  const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
+	struct multi_stop_data msdata = { .fn = fn, .data = data,
 					    .active_cpus = cpus };
 	struct cpu_stop_done done;
 	int ret;
 
 	/* Local CPU must be inactive and CPU hotplug in progress. */
 	BUG_ON(cpu_active(raw_smp_processor_id()));
-	smdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
+	msdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
 
 	/* No proper task established and can't sleep - busy wait for lock. */
 	while (!mutex_trylock(&stop_cpus_mutex))
 		cpu_relax();
 
 	/* Schedule work on other CPUs and execute directly for local CPU */
-	set_state(&smdata, STOPMACHINE_PREPARE);
+	set_state(&msdata, MULTI_STOP_PREPARE);
 	cpu_stop_init_done(&done, num_active_cpus());
-	queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+	queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
 			     &done);
-	ret = stop_machine_cpu_stop(&smdata);
+	ret = multi_cpu_stop(&msdata);
 
 	/* Busy wait for completion. */
 	while (!completion_done(&done.completion))
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 36/50] stop_machine: Introduce stop_two_cpus()
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

[riel@redhat.com: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/stop_machine.h |   1 +
 kernel/stop_machine.c        | 272 +++++++++++++++++++++++++++----------------
 2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 			 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
 	return done.executed ? done.ret : -ENOENT;
 }
 
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+	/* Dummy starting state for thread. */
+	MULTI_STOP_NONE,
+	/* Awaiting everyone to be scheduled. */
+	MULTI_STOP_PREPARE,
+	/* Disable interrupts. */
+	MULTI_STOP_DISABLE_IRQ,
+	/* Run the function */
+	MULTI_STOP_RUN,
+	/* Exit */
+	MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+	int			(*fn)(void *);
+	void			*data;
+	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+	unsigned int		num_threads;
+	const struct cpumask	*active_cpus;
+
+	enum multi_stop_state	state;
+	atomic_t		thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+		      enum multi_stop_state newstate)
+{
+	/* Reset ack counter. */
+	atomic_set(&msdata->thread_ack, msdata->num_threads);
+	smp_wmb();
+	msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+	if (atomic_dec_and_test(&msdata->thread_ack))
+		set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+	struct multi_stop_data *msdata = data;
+	enum multi_stop_state curstate = MULTI_STOP_NONE;
+	int cpu = smp_processor_id(), err = 0;
+	unsigned long flags;
+	bool is_active;
+
+	/*
+	 * When called from stop_machine_from_inactive_cpu(), irq might
+	 * already be disabled.  Save the state and restore it on exit.
+	 */
+	local_save_flags(flags);
+
+	if (!msdata->active_cpus)
+		is_active = cpu == cpumask_first(cpu_online_mask);
+	else
+		is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+	/* Simple state machine */
+	do {
+		/* Chill out and ensure we re-read multi_stop_state. */
+		cpu_relax();
+		if (msdata->state != curstate) {
+			curstate = msdata->state;
+			switch (curstate) {
+			case MULTI_STOP_DISABLE_IRQ:
+				local_irq_disable();
+				hard_irq_disable();
+				break;
+			case MULTI_STOP_RUN:
+				if (is_active)
+					err = msdata->fn(msdata->data);
+				break;
+			default:
+				break;
+			}
+			ack_state(msdata);
+		}
+	} while (curstate != MULTI_STOP_EXIT);
+
+	local_irq_restore(flags);
+	return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+	int cpu1;
+	int cpu2;
+	struct cpu_stop_work *work1;
+	struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+	struct irq_cpu_stop_queue_work_info *info = arg;
+	cpu_stop_queue_work(info->cpu1, info->work1);
+	cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+	int call_cpu;
+	struct cpu_stop_done done;
+	struct cpu_stop_work work1, work2;
+	struct irq_cpu_stop_queue_work_info call_args;
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = arg,
+		.num_threads = 2,
+		.active_cpus = cpumask_of(cpu1),
+	};
+
+	work1 = work2 = (struct cpu_stop_work){
+		.fn = multi_cpu_stop,
+		.arg = &msdata,
+		.done = &done
+	};
+
+	call_args = (struct irq_cpu_stop_queue_work_info){
+		.cpu1 = cpu1,
+		.cpu2 = cpu2,
+		.work1 = &work1,
+		.work2 = &work2,
+	};
+
+	cpu_stop_init_done(&done, 2);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+
+	/*
+	 * Queuing needs to be done by the lowest numbered CPU, to ensure
+	 * that works are always queued in the same order on every CPU.
+	 * This prevents deadlocks.
+	 */
+	call_cpu = min(cpu1, cpu2);
+
+	smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+				 &call_args, 0);
+
+	wait_for_completion(&done.completion);
+	return done.executed ? done.ret : -ENOENT;
+}
+
 /**
  * stop_one_cpu_nowait - stop a cpu but don't wait for completion
  * @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);
 
 #ifdef CONFIG_STOP_MACHINE
 
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
-	/* Dummy starting state for thread. */
-	STOPMACHINE_NONE,
-	/* Awaiting everyone to be scheduled. */
-	STOPMACHINE_PREPARE,
-	/* Disable interrupts. */
-	STOPMACHINE_DISABLE_IRQ,
-	/* Run the function */
-	STOPMACHINE_RUN,
-	/* Exit */
-	STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
-	int			(*fn)(void *);
-	void			*data;
-	/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
-	unsigned int		num_threads;
-	const struct cpumask	*active_cpus;
-
-	enum stopmachine_state	state;
-	atomic_t		thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
-		      enum stopmachine_state newstate)
-{
-	/* Reset ack counter. */
-	atomic_set(&smdata->thread_ack, smdata->num_threads);
-	smp_wmb();
-	smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
-	if (atomic_dec_and_test(&smdata->thread_ack))
-		set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
-	struct stop_machine_data *smdata = data;
-	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	int cpu = smp_processor_id(), err = 0;
-	unsigned long flags;
-	bool is_active;
-
-	/*
-	 * When called from stop_machine_from_inactive_cpu(), irq might
-	 * already be disabled.  Save the state and restore it on exit.
-	 */
-	local_save_flags(flags);
-
-	if (!smdata->active_cpus)
-		is_active = cpu == cpumask_first(cpu_online_mask);
-	else
-		is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
-	/* Simple state machine */
-	do {
-		/* Chill out and ensure we re-read stopmachine_state. */
-		cpu_relax();
-		if (smdata->state != curstate) {
-			curstate = smdata->state;
-			switch (curstate) {
-			case STOPMACHINE_DISABLE_IRQ:
-				local_irq_disable();
-				hard_irq_disable();
-				break;
-			case STOPMACHINE_RUN:
-				if (is_active)
-					err = smdata->fn(smdata->data);
-				break;
-			default:
-				break;
-			}
-			ack_state(smdata);
-		}
-	} while (curstate != STOPMACHINE_EXIT);
-
-	local_irq_restore(flags);
-	return err;
-}
-
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
-					    .num_threads = num_online_cpus(),
-					    .active_cpus = cpus };
+	struct multi_stop_data msdata = {
+		.fn = fn,
+		.data = data,
+		.num_threads = num_online_cpus(),
+		.active_cpus = cpus,
+	};
 
 	if (!stop_machine_initialized) {
 		/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 		unsigned long flags;
 		int ret;
 
-		WARN_ON_ONCE(smdata.num_threads != 1);
+		WARN_ON_ONCE(msdata.num_threads != 1);
 
 		local_irq_save(flags);
 		hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	}
 
 	/* Set the initial state and stop all online cpus. */
-	set_state(&smdata, STOPMACHINE_PREPARE);
-	return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
 int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				  const struct cpumask *cpus)
 {
-	struct stop_machine_data smdata = { .fn = fn, .data = data,
+	struct multi_stop_data msdata = { .fn = fn, .data = data,
 					    .active_cpus = cpus };
 	struct cpu_stop_done done;
 	int ret;
 
 	/* Local CPU must be inactive and CPU hotplug in progress. */
 	BUG_ON(cpu_active(raw_smp_processor_id()));
-	smdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
+	msdata.num_threads = num_active_cpus() + 1;	/* +1 for local */
 
 	/* No proper task established and can't sleep - busy wait for lock. */
 	while (!mutex_trylock(&stop_cpus_mutex))
 		cpu_relax();
 
 	/* Schedule work on other CPUs and execute directly for local CPU */
-	set_state(&smdata, STOPMACHINE_PREPARE);
+	set_state(&msdata, MULTI_STOP_PREPARE);
 	cpu_stop_init_done(&done, num_active_cpus());
-	queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+	queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
 			     &done);
-	ret = stop_machine_cpu_stop(&smdata);
+	ret = multi_cpu_stop(&msdata);
 
 	/* Busy wait for completion. */
 	while (!completion_done(&done.completion))
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 37/50] sched: Introduce migrate_swap()
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
wake_cpu to the actual task cpu; just couldn't be bothered to think through
all the cases.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h    |   1 +
 kernel/sched/core.c      | 103 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c      |   3 +-
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/rt.c        |   5 +--
 kernel/sched/sched.h     |   3 +-
 kernel/sched/stop_task.c |   2 +-
 7 files changed, 105 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3418b0b..3e8c547 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1035,6 +1035,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+	int wake_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 374da2b..67f2b7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,6 +1032,90 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+	if (p->on_rq) {
+		struct rq *src_rq, *dst_rq;
+
+		src_rq = task_rq(p);
+		dst_rq = cpu_rq(cpu);
+
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	} else {
+		/*
+		 * Task isn't running anymore; make it appear like we migrated
+		 * it before it went to sleep. This means on wakeup we make the
+		 * previous cpu or targer instead of where it really is.
+		 */
+		p->wake_cpu = cpu;
+	}
+}
+
+struct migration_swap_arg {
+	struct task_struct *src_task, *dst_task;
+	int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+	struct migration_swap_arg *arg = data;
+	struct rq *src_rq, *dst_rq;
+	int ret = -EAGAIN;
+
+	src_rq = cpu_rq(arg->src_cpu);
+	dst_rq = cpu_rq(arg->dst_cpu);
+
+	double_rq_lock(src_rq, dst_rq);
+	if (task_cpu(arg->dst_task) != arg->dst_cpu)
+		goto unlock;
+
+	if (task_cpu(arg->src_task) != arg->src_cpu)
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+		goto unlock;
+
+	__migrate_swap_task(arg->src_task, arg->dst_cpu);
+	__migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+	ret = 0;
+
+unlock:
+	double_rq_unlock(src_rq, dst_rq);
+
+	return ret;
+}
+
+/*
+ * XXX worry about hotplug
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+	struct migration_swap_arg arg = {
+		.src_task = cur,
+		.src_cpu = task_cpu(cur),
+		.dst_task = p,
+		.dst_cpu = task_cpu(p),
+	};
+
+	if (arg.src_cpu == arg.dst_cpu)
+		return -EINVAL;
+
+	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+		return -EINVAL;
+
+	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+		return -EINVAL;
+
+	return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+}
+
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
@@ -1251,9 +1335,9 @@ out:
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1528,7 +1612,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	if (p->wake_cpu != -1) {	/* XXX make this condition go away */
+		cpu = p->wake_cpu;
+		p->wake_cpu = -1;
+	}
+
+	cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1614,6 +1703,10 @@ static void __sched_fork(struct task_struct *p)
 {
 	p->on_rq			= 0;
 
+#ifdef CONFIG_SMP
+	p->wake_cpu			= -1;
+#endif
+
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
@@ -1765,7 +1858,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
 	/* Initialize new task's runnable average */
@@ -2093,7 +2186,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d244d0..cf16c1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3655,11 +3655,10 @@ done:
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..d81866d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
-	int cpu;
-
-	cpu = task_cpu(p);
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 778f875..99b1ecd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
@@ -988,7 +989,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 37/50] sched: Introduce migrate_swap()
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
wake_cpu to the actual task cpu; just couldn't be bothered to think through
all the cases.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h    |   1 +
 kernel/sched/core.c      | 103 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c      |   3 +-
 kernel/sched/idle_task.c |   2 +-
 kernel/sched/rt.c        |   5 +--
 kernel/sched/sched.h     |   3 +-
 kernel/sched/stop_task.c |   2 +-
 7 files changed, 105 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3418b0b..3e8c547 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1035,6 +1035,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+	int wake_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 374da2b..67f2b7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,6 +1032,90 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+	if (p->on_rq) {
+		struct rq *src_rq, *dst_rq;
+
+		src_rq = task_rq(p);
+		dst_rq = cpu_rq(cpu);
+
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	} else {
+		/*
+		 * Task isn't running anymore; make it appear like we migrated
+		 * it before it went to sleep. This means on wakeup we make the
+		 * previous cpu or targer instead of where it really is.
+		 */
+		p->wake_cpu = cpu;
+	}
+}
+
+struct migration_swap_arg {
+	struct task_struct *src_task, *dst_task;
+	int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+	struct migration_swap_arg *arg = data;
+	struct rq *src_rq, *dst_rq;
+	int ret = -EAGAIN;
+
+	src_rq = cpu_rq(arg->src_cpu);
+	dst_rq = cpu_rq(arg->dst_cpu);
+
+	double_rq_lock(src_rq, dst_rq);
+	if (task_cpu(arg->dst_task) != arg->dst_cpu)
+		goto unlock;
+
+	if (task_cpu(arg->src_task) != arg->src_cpu)
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+		goto unlock;
+
+	if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+		goto unlock;
+
+	__migrate_swap_task(arg->src_task, arg->dst_cpu);
+	__migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+	ret = 0;
+
+unlock:
+	double_rq_unlock(src_rq, dst_rq);
+
+	return ret;
+}
+
+/*
+ * XXX worry about hotplug
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+	struct migration_swap_arg arg = {
+		.src_task = cur,
+		.src_cpu = task_cpu(cur),
+		.dst_task = p,
+		.dst_cpu = task_cpu(p),
+	};
+
+	if (arg.src_cpu == arg.dst_cpu)
+		return -EINVAL;
+
+	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+		return -EINVAL;
+
+	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+		return -EINVAL;
+
+	return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+}
+
 struct migration_arg {
 	struct task_struct *task;
 	int dest_cpu;
@@ -1251,9 +1335,9 @@ out:
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+	cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -1528,7 +1612,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	if (p->wake_cpu != -1) {	/* XXX make this condition go away */
+		cpu = p->wake_cpu;
+		p->wake_cpu = -1;
+	}
+
+	cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1614,6 +1703,10 @@ static void __sched_fork(struct task_struct *p)
 {
 	p->on_rq			= 0;
 
+#ifdef CONFIG_SMP
+	p->wake_cpu			= -1;
+#endif
+
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
@@ -1765,7 +1858,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
 	/* Initialize new task's runnable average */
@@ -2093,7 +2186,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d244d0..cf16c1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3655,11 +3655,10 @@ done:
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..d81866d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
-	int cpu;
-
-	cpu = task_cpu(p);
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 778f875..99b1ecd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
 static inline void task_numa_free(struct task_struct *p)
 {
 	kfree(p->numa_faults);
@@ -988,7 +989,7 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 38/50] sched: numa: Use a system-wide search to find swap/migration candidates
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Note from Mel: There appears to be no guarantee that the node the source
	task is placed on by task_numa_migrate() has any relationship
	to the newly selected task->numa_preferred_nid. It is not clear
	if this is deliberate but it looks accidental.

[riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 244 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 178 insertions(+), 66 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf16c1a..12b42a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static unsigned long task_h_load(struct task_struct *p);
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+/* Cached statistics for all CPUs within a node */
 struct numa_stats {
+	unsigned long nr_running;
 	unsigned long load;
-	s64 eff_load;
-	unsigned long faults;
+
+	/* Total compute capacity of CPUs on a node */
+	unsigned long power;
+
+	/* Approximate capacity in terms of runnable tasks on a node */
+	unsigned long capacity;
+	int has_capacity;
 };
 
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->nr_running += rq->nr_running;
+		ns->load += weighted_cpuload(cpu);
+		ns->power += power_of(cpu);
+	}
+
+	ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+	ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+	ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -920,28 +950,126 @@ struct task_numa_env {
 
 	struct numa_stats src_stats, dst_stats;
 
-	unsigned long best_load;
+	int imbalance_pct, idx;
+
+	struct task_struct *best_task;
+	long best_imp;
 	int best_cpu;
 };
 
+static void task_numa_assign(struct task_numa_env *env,
+			     struct task_struct *p, long imp)
+{
+	if (env->best_task)
+		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
+
+	env->best_task = p;
+	env->best_imp = imp;
+	env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+	struct rq *src_rq = cpu_rq(env->src_cpu);
+	struct rq *dst_rq = cpu_rq(env->dst_cpu);
+	struct task_struct *cur;
+	long dst_load, src_load;
+	long load;
+
+	rcu_read_lock();
+	cur = ACCESS_ONCE(dst_rq->curr);
+	if (cur->pid == 0) /* idle */
+		cur = NULL;
+
+	/*
+	 * "imp" is the fault differential for the source task between the
+	 * source and destination node. Calculate the total differential for
+	 * the source task and potential destination task. The more negative
+	 * the value is, the more rmeote accesses that would be expected to
+	 * be incurred if the tasks were swapped.
+	 */
+	if (cur) {
+		/* Skip this swap candidate if cannot move to the source cpu */
+		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+			goto unlock;
+
+		imp += task_faults(cur, env->src_nid) -
+		       task_faults(cur, env->dst_nid);
+	}
+
+	if (imp < env->best_imp)
+		goto unlock;
+
+	if (!cur) {
+		/* Is there capacity at our destination? */
+		if (env->src_stats.has_capacity &&
+		    !env->dst_stats.has_capacity)
+			goto unlock;
+
+		goto balance;
+	}
+
+	/* Balance doesn't matter much if we're running a task per cpu */
+	if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+		goto assign;
+
+	/*
+	 * In the overloaded case, try and keep the load balanced.
+	 */
+balance:
+	dst_load = env->dst_stats.load;
+	src_load = env->src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(env->p);
+	dst_load += load;
+	src_load -= load;
+
+	if (cur) {
+		load = task_h_load(cur);
+		dst_load -= load;
+		src_load += load;
+	}
+
+	/* make src_load the smaller */
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
+
+	if (src_load * env->imbalance_pct < dst_load * 100)
+		goto unlock;
+
+assign:
+	task_numa_assign(env, cur, imp);
+unlock:
+	rcu_read_unlock();
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
-	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
 	struct task_numa_env env = {
 		.p = p,
+
 		.src_cpu = task_cpu(p),
 		.src_nid = cpu_to_node(task_cpu(p)),
-		.dst_cpu = node_cpu,
-		.dst_nid = p->numa_preferred_nid,
-		.best_load = ULONG_MAX,
-		.best_cpu = task_cpu(p),
+
+		.imbalance_pct = 112,
+
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
 	};
-	struct sched_domain *sd;
-	int cpu;
-	struct task_group *tg = task_group(p);
-	unsigned long weight;
-	bool balanced;
-	int imbalance_pct, idx = -1;
+ 	struct sched_domain *sd;
+	unsigned long faults;
+	int nid, cpu, ret;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -949,66 +1077,52 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	for_each_domain(env.src_cpu, sd) {
-		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
-			/*
-			 * busy_idx is used for the load decision as it is the
-			 * same index used by the regular load balancer for an
-			 * active cpu.
-			 */
-			idx = sd->busy_idx;
-			imbalance_pct = sd->imbalance_pct;
+		if (cpumask_intersects(cpumask, sched_domain_span(sd))) {
+			env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 			break;
 		}
 	}
 	rcu_read_unlock();
 
-	if (WARN_ON_ONCE(idx == -1))
-		return 0;
+	faults = task_faults(p, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid);
 
-	/*
-	 * XXX the below is mostly nicked from wake_affine(); we should
-	 * see about sharing a bit if at all possible; also it might want
-	 * some per entity weight love.
-	 */
-	weight = p->se.load.weight;
-	env.src_stats.load = source_load(env.src_cpu, idx);
-	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
-	env.src_stats.eff_load *= power_of(env.src_cpu);
-	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
-	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
-		env.dst_cpu = cpu;
-		env.dst_stats.load = target_load(cpu, idx);
- 
- 		/* If the CPU is idle, use it */
-		if (!env.dst_stats.load) {
-			env.best_cpu = cpu;
-			goto migrate;
-		}
+	/* Find an alternative node with relatively better statistics */
+	for_each_online_node(nid) {
+		long imp;
 
-		/* Otherwise check the target CPU load */
-		env.dst_stats.eff_load = 100;
-		env.dst_stats.eff_load *= power_of(cpu);
-		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+		if (nid == env.src_nid)
+			continue;
 
-		/*
-		 * Destination is considered balanced if the destination CPU is
-		 * less loaded than the source CPU. Unfortunately there is a
-		 * risk that a task running on a lightly loaded CPU will not
-		 * migrate to its preferred node due to load imbalances.
-		 */
-		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
-		if (!balanced)
+		/* Only consider nodes that recorded more faults */
+		imp = task_faults(p, nid) - faults;
+		if (imp < 0)
 			continue;
 
-		if (env.dst_stats.eff_load < env.best_load) {
-			env.best_load = env.dst_stats.eff_load;
-			env.best_cpu = cpu;
+		env.dst_nid = nid;
+		update_numa_stats(&env.dst_stats, env.dst_nid);
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			/* Skip this CPU if the source task cannot migrate */
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				continue;
+
+			env.dst_cpu = cpu;
+			task_numa_compare(&env, imp);
 		}
 	}
 
-migrate:
-	return migrate_task_to(p, env.best_cpu);
+	/* No better CPU than the current one was found. */
+	if (env.best_cpu == -1)
+		return -EAGAIN;
+
+	if (env.best_task == NULL) {
+		int ret = migrate_task_to(p, env.best_cpu);
+		return ret;
+	}
+
+	ret = migrate_swap(p, env.best_task);
+	put_task_struct(env.best_task);
+	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1046,7 +1160,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults;
+		unsigned long faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1056,10 +1170,10 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
+
+			faults += p->numa_faults[i];
 		}
 
-		/* Find maximum private faults */
-		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -4405,8 +4519,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 38/50] sched: numa: Use a system-wide search to find swap/migration candidates
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Note from Mel: There appears to be no guarantee that the node the source
	task is placed on by task_numa_migrate() has any relationship
	to the newly selected task->numa_preferred_nid. It is not clear
	if this is deliberate but it looks accidental.

[riel@redhat.com: Do not swap with tasks that cannot run on source cpu]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 244 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 178 insertions(+), 66 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf16c1a..12b42a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static unsigned long task_h_load(struct task_struct *p);
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
+/* Cached statistics for all CPUs within a node */
 struct numa_stats {
+	unsigned long nr_running;
 	unsigned long load;
-	s64 eff_load;
-	unsigned long faults;
+
+	/* Total compute capacity of CPUs on a node */
+	unsigned long power;
+
+	/* Approximate capacity in terms of runnable tasks on a node */
+	unsigned long capacity;
+	int has_capacity;
 };
 
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->nr_running += rq->nr_running;
+		ns->load += weighted_cpuload(cpu);
+		ns->power += power_of(cpu);
+	}
+
+	ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+	ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+	ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -920,28 +950,126 @@ struct task_numa_env {
 
 	struct numa_stats src_stats, dst_stats;
 
-	unsigned long best_load;
+	int imbalance_pct, idx;
+
+	struct task_struct *best_task;
+	long best_imp;
 	int best_cpu;
 };
 
+static void task_numa_assign(struct task_numa_env *env,
+			     struct task_struct *p, long imp)
+{
+	if (env->best_task)
+		put_task_struct(env->best_task);
+	if (p)
+		get_task_struct(p);
+
+	env->best_task = p;
+	env->best_imp = imp;
+	env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+	struct rq *src_rq = cpu_rq(env->src_cpu);
+	struct rq *dst_rq = cpu_rq(env->dst_cpu);
+	struct task_struct *cur;
+	long dst_load, src_load;
+	long load;
+
+	rcu_read_lock();
+	cur = ACCESS_ONCE(dst_rq->curr);
+	if (cur->pid == 0) /* idle */
+		cur = NULL;
+
+	/*
+	 * "imp" is the fault differential for the source task between the
+	 * source and destination node. Calculate the total differential for
+	 * the source task and potential destination task. The more negative
+	 * the value is, the more rmeote accesses that would be expected to
+	 * be incurred if the tasks were swapped.
+	 */
+	if (cur) {
+		/* Skip this swap candidate if cannot move to the source cpu */
+		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+			goto unlock;
+
+		imp += task_faults(cur, env->src_nid) -
+		       task_faults(cur, env->dst_nid);
+	}
+
+	if (imp < env->best_imp)
+		goto unlock;
+
+	if (!cur) {
+		/* Is there capacity at our destination? */
+		if (env->src_stats.has_capacity &&
+		    !env->dst_stats.has_capacity)
+			goto unlock;
+
+		goto balance;
+	}
+
+	/* Balance doesn't matter much if we're running a task per cpu */
+	if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+		goto assign;
+
+	/*
+	 * In the overloaded case, try and keep the load balanced.
+	 */
+balance:
+	dst_load = env->dst_stats.load;
+	src_load = env->src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(env->p);
+	dst_load += load;
+	src_load -= load;
+
+	if (cur) {
+		load = task_h_load(cur);
+		dst_load -= load;
+		src_load += load;
+	}
+
+	/* make src_load the smaller */
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
+
+	if (src_load * env->imbalance_pct < dst_load * 100)
+		goto unlock;
+
+assign:
+	task_numa_assign(env, cur, imp);
+unlock:
+	rcu_read_unlock();
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
-	int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+	const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
 	struct task_numa_env env = {
 		.p = p,
+
 		.src_cpu = task_cpu(p),
 		.src_nid = cpu_to_node(task_cpu(p)),
-		.dst_cpu = node_cpu,
-		.dst_nid = p->numa_preferred_nid,
-		.best_load = ULONG_MAX,
-		.best_cpu = task_cpu(p),
+
+		.imbalance_pct = 112,
+
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
 	};
-	struct sched_domain *sd;
-	int cpu;
-	struct task_group *tg = task_group(p);
-	unsigned long weight;
-	bool balanced;
-	int imbalance_pct, idx = -1;
+ 	struct sched_domain *sd;
+	unsigned long faults;
+	int nid, cpu, ret;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -949,66 +1077,52 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	for_each_domain(env.src_cpu, sd) {
-		if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
-			/*
-			 * busy_idx is used for the load decision as it is the
-			 * same index used by the regular load balancer for an
-			 * active cpu.
-			 */
-			idx = sd->busy_idx;
-			imbalance_pct = sd->imbalance_pct;
+		if (cpumask_intersects(cpumask, sched_domain_span(sd))) {
+			env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
 			break;
 		}
 	}
 	rcu_read_unlock();
 
-	if (WARN_ON_ONCE(idx == -1))
-		return 0;
+	faults = task_faults(p, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid);
 
-	/*
-	 * XXX the below is mostly nicked from wake_affine(); we should
-	 * see about sharing a bit if at all possible; also it might want
-	 * some per entity weight love.
-	 */
-	weight = p->se.load.weight;
-	env.src_stats.load = source_load(env.src_cpu, idx);
-	env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
-	env.src_stats.eff_load *= power_of(env.src_cpu);
-	env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
-	for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
-		env.dst_cpu = cpu;
-		env.dst_stats.load = target_load(cpu, idx);
- 
- 		/* If the CPU is idle, use it */
-		if (!env.dst_stats.load) {
-			env.best_cpu = cpu;
-			goto migrate;
-		}
+	/* Find an alternative node with relatively better statistics */
+	for_each_online_node(nid) {
+		long imp;
 
-		/* Otherwise check the target CPU load */
-		env.dst_stats.eff_load = 100;
-		env.dst_stats.eff_load *= power_of(cpu);
-		env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+		if (nid == env.src_nid)
+			continue;
 
-		/*
-		 * Destination is considered balanced if the destination CPU is
-		 * less loaded than the source CPU. Unfortunately there is a
-		 * risk that a task running on a lightly loaded CPU will not
-		 * migrate to its preferred node due to load imbalances.
-		 */
-		balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
-		if (!balanced)
+		/* Only consider nodes that recorded more faults */
+		imp = task_faults(p, nid) - faults;
+		if (imp < 0)
 			continue;
 
-		if (env.dst_stats.eff_load < env.best_load) {
-			env.best_load = env.dst_stats.eff_load;
-			env.best_cpu = cpu;
+		env.dst_nid = nid;
+		update_numa_stats(&env.dst_stats, env.dst_nid);
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			/* Skip this CPU if the source task cannot migrate */
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				continue;
+
+			env.dst_cpu = cpu;
+			task_numa_compare(&env, imp);
 		}
 	}
 
-migrate:
-	return migrate_task_to(p, env.best_cpu);
+	/* No better CPU than the current one was found. */
+	if (env.best_cpu == -1)
+		return -EAGAIN;
+
+	if (env.best_task == NULL) {
+		int ret = migrate_task_to(p, env.best_cpu);
+		return ret;
+	}
+
+	ret = migrate_swap(p, env.best_task);
+	put_task_struct(env.best_task);
+	return ret;
 }
 
 /* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1046,7 +1160,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults;
+		unsigned long faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1056,10 +1170,10 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults[i] >>= 1;
 			p->numa_faults[i] += p->numa_faults_buffer[i];
 			p->numa_faults_buffer[i] = 0;
+
+			faults += p->numa_faults[i];
 		}
 
-		/* Find maximum private faults */
-		faults = p->numa_faults[task_faults_idx(nid, 1)];
 		if (faults > max_faults) {
 			max_faults = faults;
 			max_nid = nid;
@@ -4405,8 +4519,6 @@ static int move_one_task(struct lb_env *env)
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
-
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 39/50] sched: numa: Favor placing a task on the preferred node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12b42a6..f2bd291 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
 	rcu_read_unlock();
 }
 
+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		/* Skip this CPU if the source task cannot migrate */
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+			continue;
+
+		env->dst_cpu = cpu;
+		task_numa_compare(env, imp);
+	}
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
@@ -1069,7 +1083,8 @@ static int task_numa_migrate(struct task_struct *p)
 	};
  	struct sched_domain *sd;
 	unsigned long faults;
-	int nid, cpu, ret;
+	int nid, ret;
+	long imp;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -1086,28 +1101,29 @@ static int task_numa_migrate(struct task_struct *p)
 
 	faults = task_faults(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
+	env.dst_nid = p->numa_preferred_nid;
+	imp = task_faults(env.p, env.dst_nid) - faults;
+	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/* Find an alternative node with relatively better statistics */
-	for_each_online_node(nid) {
-		long imp;
-
-		if (nid == env.src_nid)
-			continue;
-
-		/* Only consider nodes that recorded more faults */
-		imp = task_faults(p, nid) - faults;
-		if (imp < 0)
-			continue;
+	/*
+	 * If the preferred nid has capacity then use it. Otherwise find an
+	 * alternative node with relatively better statistics.
+	 */
+	if (env.dst_stats.has_capacity) {
+		task_numa_find_cpu(&env, imp);
+	} else {
+		for_each_online_node(nid) {
+			if (nid == env.src_nid || nid == p->numa_preferred_nid)
+				continue;
 
-		env.dst_nid = nid;
-		update_numa_stats(&env.dst_stats, env.dst_nid);
-		for_each_cpu(cpu, cpumask_of_node(nid)) {
-			/* Skip this CPU if the source task cannot migrate */
-			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			/* Only consider nodes that recorded more faults */
+			imp = task_faults(env.p, nid) - faults;
+			if (imp < 0)
 				continue;
 
-			env.dst_cpu = cpu;
-			task_numa_compare(&env, imp);
+			env.dst_nid = nid;
+			update_numa_stats(&env.dst_stats, env.dst_nid);
+			task_numa_find_cpu(&env, imp);
 		}
 	}
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 39/50] sched: numa: Favor placing a task on the preferred node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12b42a6..f2bd291 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
 	rcu_read_unlock();
 }
 
+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		/* Skip this CPU if the source task cannot migrate */
+		if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+			continue;
+
+		env->dst_cpu = cpu;
+		task_numa_compare(env, imp);
+	}
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
@@ -1069,7 +1083,8 @@ static int task_numa_migrate(struct task_struct *p)
 	};
  	struct sched_domain *sd;
 	unsigned long faults;
-	int nid, cpu, ret;
+	int nid, ret;
+	long imp;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -1086,28 +1101,29 @@ static int task_numa_migrate(struct task_struct *p)
 
 	faults = task_faults(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
+	env.dst_nid = p->numa_preferred_nid;
+	imp = task_faults(env.p, env.dst_nid) - faults;
+	update_numa_stats(&env.dst_stats, env.dst_nid);
 
-	/* Find an alternative node with relatively better statistics */
-	for_each_online_node(nid) {
-		long imp;
-
-		if (nid == env.src_nid)
-			continue;
-
-		/* Only consider nodes that recorded more faults */
-		imp = task_faults(p, nid) - faults;
-		if (imp < 0)
-			continue;
+	/*
+	 * If the preferred nid has capacity then use it. Otherwise find an
+	 * alternative node with relatively better statistics.
+	 */
+	if (env.dst_stats.has_capacity) {
+		task_numa_find_cpu(&env, imp);
+	} else {
+		for_each_online_node(nid) {
+			if (nid == env.src_nid || nid == p->numa_preferred_nid)
+				continue;
 
-		env.dst_nid = nid;
-		update_numa_stats(&env.dst_stats, env.dst_nid);
-		for_each_cpu(cpu, cpumask_of_node(nid)) {
-			/* Skip this CPU if the source task cannot migrate */
-			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			/* Only consider nodes that recorded more faults */
+			imp = task_faults(env.p, nid) - faults;
+			if (imp < 0)
 				continue;
 
-			env.dst_cpu = cpu;
-			task_numa_compare(&env, imp);
+			env.dst_nid = nid;
+			update_numa_stats(&env.dst_stats, env.dst_nid);
+			task_numa_find_cpu(&env, imp);
 		}
 	}
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 40/50] mm: numa: Change page last {nid,pid} into {cpu,pid}
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 90 ++++++++++++++++++++++-----------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 22 +++++-----
 kernel/bounds.c                   |  4 ++
 kernel/sched/fair.c               |  6 +--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    | 16 ++++---
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 28 ++++++------
 mm/page_alloc.c                   |  4 +-
 13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a0db6c..61dc023 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,96 +668,106 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
-	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+	return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
-	return nidpid & LAST__PID_MASK;
+	return cpupid & LAST__PID_MASK;
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
 {
-	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+	return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
-	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+	return cpu_to_node(cpupid_to_cpu(cpupid));
 }
 
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
 {
-	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+	return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
 }
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
 {
-	return xchg(&page->_last_nidpid, nid);
+	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page->_last_nidpid;
+	return xchg(&page->_last_cpupid, cpupid);
 }
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+	return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->_last_nidpid = -1;
+	page->_last_cpupid = -1;
 }
 #else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
 }
 
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 }
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
 {
 	return -1;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
 {
 	return 1;
 }
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f46378e..b0370cd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-	int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
  * lookup is necessary.
  *
  * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
 #define LAST__PID_SHIFT 8
 #define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
 
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
 
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
 #endif
 
 /*
@@ -87,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f2bd291..bafa8d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1208,7 +1208,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1224,8 +1224,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
 	 */
-	if (!nidpid_pid_unset(last_nidpid))
-		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	if (!cpupid_pid_unset(last_cpupid))
+		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
 	else
 		priv = 1;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622bc7e..cf903fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nidpid = -1;
+	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 948ec32..6b558a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nidpid;
+	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nidpid;
+	int last_cpupid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nidpid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc93b2..a458b82 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2244,6 +2244,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	int thiscpu = raw_smp_processor_id();
+	int thisnid = cpu_to_node(thiscpu);
 	int polnid = -1;
 	int ret = -1;
 
@@ -2292,11 +2294,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nidpid;
-		int this_nidpid;
+		int last_cpupid;
+		int this_cpupid;
 
-		polnid = numa_node_id();
-		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+		polnid = thisnid;
+		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2319,8 +2321,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
-		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -2330,7 +2332,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * This way a short and temporary process migration will
 		 * not cause excessive memory migration.
 		 */
-		if (polnid != current->numa_preferred_nid &&
+		if (thisnid != current->numa_preferred_nid &&
 				!current->numa_migrate_seq)
 			goto out;
 #endif
diff --git a/mm/migrate.c b/mm/migrate.c
index f56ca20..637aac7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+		page_cpupid_xchg_last(newpage, page_cpupid_last(page));
 
 	return newpage;
 }
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+	page_cpupid_xchg_last(new_page, page_cpupid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NIDPID_WIDTH,
+		LAST_CPUPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnidpid %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NIDPID_SHIFT);
+		LAST_CPUPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NIDPID_PGSHIFT);
+		(unsigned long)LAST_CPUPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nidpid not in page flags");
+		"Last cpupid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..2c70c3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_nidpid;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 
-		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nidpid;
+	return last_cpupid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 191a89a..8ae8909 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_nidpid = true;
-	int last_nid = -1;
+	bool all_same_cpupid = true;
+	int last_cpu = -1;
 	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -72,17 +72,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * hits on the zero page
 				 */
 				if (page && !is_zero_pfn(page_to_pfn(page))) {
-					int nidpid = page_nidpid_last(page);
-					int this_nid = nidpid_to_nid(nidpid);
-					int this_pid = nidpid_to_pid(nidpid);
+					int cpupid = page_cpupid_last(page);
+					int this_cpu = cpupid_to_cpu(cpupid);
+					int this_pid = cpupid_to_pid(cpupid);
 
-					if (last_nid == -1)
-						last_nid = this_nid;
+					if (last_cpu == -1)
+						last_cpu = this_cpu;
 					if (last_pid == -1)
 						last_pid = this_pid;
-					if (last_nid != this_nid ||
+					if (last_cpu != this_cpu ||
 					    last_pid != this_pid) {
-						all_same_nidpid = false;
+						all_same_cpupid = false;
 					}
 
 					if (!pte_numa(oldpte)) {
@@ -123,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_nidpid = all_same_nidpid;
+	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
@@ -150,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_nidpid;
+	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -176,7 +176,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_nidpid);
+				 dirty_accountable, prot_numa, &all_same_cpupid);
 		pages += this_pages;
 
 		/*
@@ -185,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && this_pages && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_cpupid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bf960e..4b6c4e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nidpid_reset_last(page);
+	page_cpupid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nidpid_reset_last(page);
+		page_cpupid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 40/50] mm: numa: Change page last {nid,pid} into {cpu,pid}
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mm.h                | 90 ++++++++++++++++++++++-----------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 22 +++++-----
 kernel/bounds.c                   |  4 ++
 kernel/sched/fair.c               |  6 +--
 mm/huge_memory.c                  |  8 ++--
 mm/memory.c                       | 16 +++----
 mm/mempolicy.c                    | 16 ++++---
 mm/migrate.c                      |  4 +-
 mm/mm_init.c                      | 18 ++++----
 mm/mmzone.c                       | 14 +++---
 mm/mprotect.c                     | 28 ++++++------
 mm/page_alloc.c                   |  4 +-
 13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a0db6c..61dc023 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
  * sets it, so none of the operations on it need to be atomic.
  */
 
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF	(ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT	(LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK	((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,96 +668,106 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
-	return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+	return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
-	return nidpid & LAST__PID_MASK;
+	return cpupid & LAST__PID_MASK;
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
 {
-	return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+	return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
-	return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+	return cpu_to_node(cpupid_to_cpu(cpupid));
 }
 
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
 {
-	return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+	return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
 }
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
 {
-	return xchg(&page->_last_nidpid, nid);
+	return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
 }
 
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page->_last_nidpid;
+	return xchg(&page->_last_cpupid, cpupid);
 }
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+	return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->_last_nidpid = -1;
+	page->_last_cpupid = -1;
 }
 #else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
 }
 
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
-	int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
 
-	page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-	page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 }
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
 {
-	return page_to_nid(page);
+	return page_to_nid(page); /* XXX */
 }
 
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
 {
 	return -1;
 }
 
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
 {
 	return -1;
 }
 
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+	return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
 {
 	return 1;
 }
 
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f46378e..b0370cd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-	int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
  * lookup is necessary.
  *
  * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: |       NODE     | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
- *      " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ *      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
  * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
  */
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
 #define LAST__PID_SHIFT 8
 #define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)
 
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK  ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)
 
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
 #else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
 #endif
 
 /*
@@ -87,8 +87,8 @@
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
 #include <linux/mmzone.h>
 #include <linux/kbuild.h>
 #include <linux/page_cgroup.h>
+#include <linux/log2.h>
 
 void foo(void)
 {
@@ -17,5 +18,8 @@ void foo(void)
 	DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
 	DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
 	DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
 	/* End of constants */
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f2bd291..bafa8d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1208,7 +1208,7 @@ static void task_numa_placement(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 {
 	struct task_struct *p = current;
 	int priv;
@@ -1224,8 +1224,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
 	 */
-	if (!nidpid_pid_unset(last_nidpid))
-		priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+	if (!cpupid_pid_unset(last_cpupid))
+		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
 	else
 		priv = 1;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622bc7e..cf903fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
-	int target_nid, last_nidpid = -1;
+	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
 
@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
 
 	return 0;
 }
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 948ec32..6b558a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	int page_nid = -1;
-	int last_nidpid;
+	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 
@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
-	last_nidpid = page_nidpid_last(page);
+	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_nidpid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, migrated);
 	return 0;
 }
 
@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
-	int last_nidpid;
+	int last_cpupid;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_nidpid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc93b2..a458b82 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2244,6 +2244,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	int thiscpu = raw_smp_processor_id();
+	int thisnid = cpu_to_node(thiscpu);
 	int polnid = -1;
 	int ret = -1;
 
@@ -2292,11 +2294,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int last_nidpid;
-		int this_nidpid;
+		int last_cpupid;
+		int this_cpupid;
 
-		polnid = numa_node_id();
-		this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+		polnid = thisnid;
+		this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
 
 		/*
 		 * Multi-stage node selection is used in conjunction
@@ -2319,8 +2321,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
-		if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+		last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+		if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
 			goto out;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -2330,7 +2332,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * This way a short and temporary process migration will
 		 * not cause excessive memory migration.
 		 */
-		if (polnid != current->numa_preferred_nid &&
+		if (thisnid != current->numa_preferred_nid &&
 				!current->numa_migrate_seq)
 			goto out;
 #endif
diff --git a/mm/migrate.c b/mm/migrate.c
index f56ca20..637aac7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+		page_cpupid_xchg_last(newpage, page_cpupid_last(page));
 
 	return newpage;
 }
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	if (!new_page)
 		goto out_fail;
 
-	page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+	page_cpupid_xchg_last(new_page, page_cpupid_last(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
 	unsigned long or_mask, add_mask;
 
 	shift = 8 * sizeof(unsigned long);
-	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
-		LAST_NIDPID_WIDTH,
+		LAST_CPUPID_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastnidpid %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
-		LAST_NIDPID_SHIFT);
+		LAST_CPUPID_SHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
-		(unsigned long)LAST_NIDPID_PGSHIFT);
+		(unsigned long)LAST_CPUPID_PGSHIFT);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
 		"Node not in page flags");
 #endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
-		"Last nidpid not in page flags");
+		"Last cpupid not in page flags");
 #endif
 
 	if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..2c70c3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
 }
 
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_nidpid;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_nidpid = page_nidpid_last(page);
+		last_cpupid = page_cpupid_last(page);
 
-		flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
-		flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_nidpid;
+	return last_cpupid;
 }
 #endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 191a89a..8ae8909 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+		int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
-	bool all_same_nidpid = true;
-	int last_nid = -1;
+	bool all_same_cpupid = true;
+	int last_cpu = -1;
 	int last_pid = -1;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -72,17 +72,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * hits on the zero page
 				 */
 				if (page && !is_zero_pfn(page_to_pfn(page))) {
-					int nidpid = page_nidpid_last(page);
-					int this_nid = nidpid_to_nid(nidpid);
-					int this_pid = nidpid_to_pid(nidpid);
+					int cpupid = page_cpupid_last(page);
+					int this_cpu = cpupid_to_cpu(cpupid);
+					int this_pid = cpupid_to_pid(cpupid);
 
-					if (last_nid == -1)
-						last_nid = this_nid;
+					if (last_cpu == -1)
+						last_cpu = this_cpu;
 					if (last_pid == -1)
 						last_pid = this_pid;
-					if (last_nid != this_nid ||
+					if (last_cpu != this_cpu ||
 					    last_pid != this_pid) {
-						all_same_nidpid = false;
+						all_same_cpupid = false;
 					}
 
 					if (!pte_numa(oldpte)) {
@@ -123,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-	*ret_all_same_nidpid = all_same_nidpid;
+	*ret_all_same_cpupid = all_same_cpupid;
 	return pages;
 }
 
@@ -150,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd_t *pmd;
 	unsigned long next;
 	unsigned long pages = 0;
-	bool all_same_nidpid;
+	bool all_same_cpupid;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -176,7 +176,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
-				 dirty_accountable, prot_numa, &all_same_nidpid);
+				 dirty_accountable, prot_numa, &all_same_cpupid);
 		pages += this_pages;
 
 		/*
@@ -185,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && this_pages && all_same_nidpid)
+		if (prot_numa && this_pages && all_same_cpupid)
 			change_pmd_protnuma(vma->vm_mm, addr, pmd);
 	} while (pmd++, addr = next, addr != end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bf960e..4b6c4e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	page_nidpid_reset_last(page);
+	page_cpupid_reset_last(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		page_mapcount_reset(page);
-		page_nidpid_reset_last(page);
+		page_cpupid_reset_last(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |   3 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 169 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   5 +-
 mm/memory.c           |   8 +++
 5 files changed, 175 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e8c547..ea057a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1338,6 +1338,9 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	struct list_head numa_entry;
+	struct numa_group *numa_group;
+
 	/*
 	 * Exponential decaying average of faults on a per-node basis.
 	 * Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67f2b7b..3808860 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1740,6 +1740,9 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+
+	INIT_LIST_HEAD(&p->numa_entry);
+	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bafa8d7..b80eaa2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+struct numa_group {
+	atomic_t refcount;
+
+	spinlock_t lock; /* nr_tasks, tasks */
+	int nr_tasks;
+	struct list_head task_list;
+
+	struct rcu_head rcu;
+	atomic_long_t faults[0];
+};
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1180,7 +1191,10 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
+			long diff;
+
 			i = task_faults_idx(nid, priv);
+			diff = -p->numa_faults[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1188,6 +1202,11 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
+			diff += p->numa_faults[i];
+			if (p->numa_group) {
+				/* safe because we can only change our own group */
+				atomic_long_add(diff, &p->numa_group->faults[i]);
+			}
 		}
 
 		if (faults > max_faults) {
@@ -1205,6 +1224,130 @@ static void task_numa_placement(struct task_struct *p)
 	}
 }
 
+static inline int get_numa_group(struct numa_group *grp)
+{
+	return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+	if (atomic_dec_and_test(&grp->refcount))
+		kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+	struct numa_group *grp, *my_grp;
+	struct task_struct *tsk;
+	bool join = false;
+	int i;
+
+	if (unlikely(!p->numa_group)) {
+		unsigned int size = sizeof(struct numa_group) +
+			            2*nr_node_ids*sizeof(atomic_long_t);
+
+		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!grp)
+			return;
+
+		atomic_set(&grp->refcount, 1);
+		spin_lock_init(&grp->lock);
+		INIT_LIST_HEAD(&grp->task_list);
+
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+		list_add(&p->numa_entry, &grp->task_list);
+		grp->nr_tasks++;
+		rcu_assign_pointer(p->numa_group, grp);
+	}
+
+	rcu_read_lock();
+	tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+	if ((tsk->pid & LAST__PID_MASK) != pid)
+		goto unlock;
+
+	grp = rcu_dereference(tsk->numa_group);
+	if (!grp)
+		goto unlock;
+
+	my_grp = p->numa_group;
+	if (grp == my_grp)
+		goto unlock;
+
+	/*
+	 * Only join the other group if its bigger; if we're the bigger group,
+	 * the other task will join us.
+	 */
+	if (my_grp->nr_tasks > grp->nr_tasks)
+	    	goto unlock;
+
+	/*
+	 * Tie-break on the grp address.
+	 */
+	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+		goto unlock;
+
+	if (!get_numa_group(grp))
+		goto unlock;
+
+	join = true;
+
+unlock:
+	rcu_read_unlock();
+
+	if (!join)
+		return;
+
+	for (i = 0; i < 2*nr_node_ids; i++) {
+		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+	}
+
+	double_lock(&my_grp->lock, &grp->lock);
+
+	list_move(&p->numa_entry, &grp->task_list);
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+	struct numa_group *grp = p->numa_group;
+	int i;
+
+	kfree(p->numa_faults);
+
+	if (grp) {
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+		spin_lock(&grp->lock);
+		list_del(&p->numa_entry);
+		grp->nr_tasks--;
+		spin_unlock(&grp->lock);
+		rcu_assign_pointer(p->numa_group, NULL);
+		put_numa_group(grp);
+	}
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -1220,15 +1363,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/*
-	 * First accesses are treated as private, otherwise consider accesses
-	 * to be private if the accessing pid has not changed
-	 */
-	if (!cpupid_pid_unset(last_cpupid))
-		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
-	else
-		priv = 1;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1243,6 +1377,23 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	}
 
 	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+		priv = 1;
+	} else {
+		int cpu, pid;
+
+		cpu = cpupid_to_cpu(last_cpupid);
+		pid = cpupid_to_pid(last_cpupid);
+
+		priv = (pid == (p->pid & LAST__PID_MASK));
+		if (!priv)
+			task_numa_group(p, cpu, pid);
+	}
+
+	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b1ecd..4c6ec25 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -557,10 +557,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
-	kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
 #else /* CONFIG_NUMA_BALANCING */
 static inline void task_numa_free(struct task_struct *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 6b558a5..f779403 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2730,6 +2730,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		get_page(dirty_page);
 
 reuse:
+		/*
+		 * Clear the pages cpupid information as the existing
+		 * information potentially belongs to a now completely
+		 * unrelated process.
+		 */
+		if (old_page)
+			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |   3 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 169 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   5 +-
 mm/memory.c           |   8 +++
 5 files changed, 175 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e8c547..ea057a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1338,6 +1338,9 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 
+	struct list_head numa_entry;
+	struct numa_group *numa_group;
+
 	/*
 	 * Exponential decaying average of faults on a per-node basis.
 	 * Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67f2b7b..3808860 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1740,6 +1740,9 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
+
+	INIT_LIST_HEAD(&p->numa_entry);
+	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bafa8d7..b80eaa2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+struct numa_group {
+	atomic_t refcount;
+
+	spinlock_t lock; /* nr_tasks, tasks */
+	int nr_tasks;
+	struct list_head task_list;
+
+	struct rcu_head rcu;
+	atomic_long_t faults[0];
+};
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1180,7 +1191,10 @@ static void task_numa_placement(struct task_struct *p)
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
+			long diff;
+
 			i = task_faults_idx(nid, priv);
+			diff = -p->numa_faults[i];
 
 			/* Decay existing window, copy faults since last scan */
 			p->numa_faults[i] >>= 1;
@@ -1188,6 +1202,11 @@ static void task_numa_placement(struct task_struct *p)
 			p->numa_faults_buffer[i] = 0;
 
 			faults += p->numa_faults[i];
+			diff += p->numa_faults[i];
+			if (p->numa_group) {
+				/* safe because we can only change our own group */
+				atomic_long_add(diff, &p->numa_group->faults[i]);
+			}
 		}
 
 		if (faults > max_faults) {
@@ -1205,6 +1224,130 @@ static void task_numa_placement(struct task_struct *p)
 	}
 }
 
+static inline int get_numa_group(struct numa_group *grp)
+{
+	return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+	if (atomic_dec_and_test(&grp->refcount))
+		kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+	struct numa_group *grp, *my_grp;
+	struct task_struct *tsk;
+	bool join = false;
+	int i;
+
+	if (unlikely(!p->numa_group)) {
+		unsigned int size = sizeof(struct numa_group) +
+			            2*nr_node_ids*sizeof(atomic_long_t);
+
+		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!grp)
+			return;
+
+		atomic_set(&grp->refcount, 1);
+		spin_lock_init(&grp->lock);
+		INIT_LIST_HEAD(&grp->task_list);
+
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+		list_add(&p->numa_entry, &grp->task_list);
+		grp->nr_tasks++;
+		rcu_assign_pointer(p->numa_group, grp);
+	}
+
+	rcu_read_lock();
+	tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+	if ((tsk->pid & LAST__PID_MASK) != pid)
+		goto unlock;
+
+	grp = rcu_dereference(tsk->numa_group);
+	if (!grp)
+		goto unlock;
+
+	my_grp = p->numa_group;
+	if (grp == my_grp)
+		goto unlock;
+
+	/*
+	 * Only join the other group if its bigger; if we're the bigger group,
+	 * the other task will join us.
+	 */
+	if (my_grp->nr_tasks > grp->nr_tasks)
+	    	goto unlock;
+
+	/*
+	 * Tie-break on the grp address.
+	 */
+	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+		goto unlock;
+
+	if (!get_numa_group(grp))
+		goto unlock;
+
+	join = true;
+
+unlock:
+	rcu_read_unlock();
+
+	if (!join)
+		return;
+
+	for (i = 0; i < 2*nr_node_ids; i++) {
+		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+	}
+
+	double_lock(&my_grp->lock, &grp->lock);
+
+	list_move(&p->numa_entry, &grp->task_list);
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+	struct numa_group *grp = p->numa_group;
+	int i;
+
+	kfree(p->numa_faults);
+
+	if (grp) {
+		for (i = 0; i < 2*nr_node_ids; i++)
+			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+		spin_lock(&grp->lock);
+		list_del(&p->numa_entry);
+		grp->nr_tasks--;
+		spin_unlock(&grp->lock);
+		rcu_assign_pointer(p->numa_group, NULL);
+		put_numa_group(grp);
+	}
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -1220,15 +1363,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	if (!p->mm)
 		return;
 
-	/*
-	 * First accesses are treated as private, otherwise consider accesses
-	 * to be private if the accessing pid has not changed
-	 */
-	if (!cpupid_pid_unset(last_cpupid))
-		priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
-	else
-		priv = 1;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1243,6 +1377,23 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 	}
 
 	/*
+	 * First accesses are treated as private, otherwise consider accesses
+	 * to be private if the accessing pid has not changed
+	 */
+	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+		priv = 1;
+	} else {
+		int cpu, pid;
+
+		cpu = cpupid_to_cpu(last_cpupid);
+		pid = cpupid_to_pid(last_cpupid);
+
+		priv = (pid == (p->pid & LAST__PID_MASK));
+		if (!priv)
+			task_numa_group(p, cpu, pid);
+	}
+
+	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
 	 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b1ecd..4c6ec25 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -557,10 +557,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 #ifdef CONFIG_NUMA_BALANCING
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
-	kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
 #else /* CONFIG_NUMA_BALANCING */
 static inline void task_numa_free(struct task_struct *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 6b558a5..f779403 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2730,6 +2730,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		get_page(dirty_page);
 
 reuse:
+		/*
+		 * Clear the pages cpupid information as the existing
+		 * information potentially belongs to a now completely
+		 * unrelated process.
+		 */
+		if (old_page)
+			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 42/50] sched: numa: Report a NUMA task group ID
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/proc/array.c       | 2 ++
 include/linux/sched.h | 5 +++++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
+		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
+		task_numa_group_id(p),
 		pid_nr_ns(pid, ns),
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ea057a2..4fad1f17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1436,12 +1436,17 @@ struct task_struct {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   bool migrated)
 {
 }
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+	return 0;
+}
 static inline void set_numabalancing_state(bool enabled)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b80eaa2..1faf3ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	pid_t gid;
 	struct list_head task_list;
 
 	struct rcu_head rcu;
 	atomic_long_t faults[0];
 };
 
+pid_t task_numa_group_id(struct task_struct *p)
+{
+	return p->numa_group ? p->numa_group->gid : 0;
+}
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1262,6 +1268,7 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
 		atomic_set(&grp->refcount, 1);
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
+		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 42/50] sched: numa: Report a NUMA task group ID
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/proc/array.c       | 2 ++
 include/linux/sched.h | 5 +++++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
+		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
+		task_numa_group_id(p),
 		pid_nr_ns(pid, ns),
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ea057a2..4fad1f17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1436,12 +1436,17 @@ struct task_struct {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   bool migrated)
 {
 }
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+	return 0;
+}
 static inline void set_numabalancing_state(bool enabled)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b80eaa2..1faf3ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {
 
 	spinlock_t lock; /* nr_tasks, tasks */
 	int nr_tasks;
+	pid_t gid;
 	struct list_head task_list;
 
 	struct rcu_head rcu;
 	atomic_long_t faults[0];
 };
 
+pid_t task_numa_group_id(struct task_struct *p)
+{
+	return p->numa_group ? p->numa_group->gid : 0;
+}
+
 static inline int task_faults_idx(int nid, int priv)
 {
 	return 2 * nid + priv;
@@ -1262,6 +1268,7 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
 		atomic_set(&grp->refcount, 1);
 		spin_lock_init(&grp->lock);
 		INIT_LIST_HEAD(&grp->task_list);
+		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 43/50] mm: numa: Do not group on RO pages
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

[riel@redhat.com: mapcount 1]
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  7 +++++--
 kernel/sched/fair.c   |  5 +++--
 mm/huge_memory.c      | 17 ++++++++++++++---
 mm/memory.c           | 30 ++++++++++++++++++++++++++----
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4fad1f17..15888f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1434,13 +1434,16 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#define TNF_MIGRATED	0x01
+#define TNF_NO_GROUP	0x02
+
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   bool migrated)
+				   int flags)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1faf3ff..ecfce3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,9 +1358,10 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
+	bool migrated = flags & TNF_MIGRATED;
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1396,7 +1397,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 		pid = cpupid_to_pid(last_cpupid);
 
 		priv = (pid == (p->pid & LAST__PID_MASK));
-		if (!priv)
+		if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, cpu, pid);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf903fc..5c339a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1297,6 +1297,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	int flags = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1311,6 +1312,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pmd_write(pmd))
+		flags |= TNF_NO_GROUP;
+
+	/*
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
@@ -1350,10 +1359,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
-	else
+		flags |= TNF_MIGRATED;
+	} else {
 		goto check_same;
+	}
 
 	goto out;
 
@@ -1377,7 +1388,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index f779403..1aa4187 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3558,6 +3558,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3586,6 +3587,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pte_write(pte))
+		flags |= TNF_NO_GROUP;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3597,12 +3606,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	}
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }
 
@@ -3643,6 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int page_nid = -1;
 		int target_nid;
 		bool migrated = false;
+		int flags = 0;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3662,20 +3674,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
+		/*
+		 * Avoid grouping on DSO/COW pages in specific and RO pages
+		 * in general, RO pages shouldn't hurt as much anyway since
+		 * they can be in shared cache state.
+		 */
+		if (!pte_write(pteval))
+			flags |= TNF_NO_GROUP;
+
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
 			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated)
+			if (migrated) {
 				page_nid = target_nid;
+				flags |= TNF_MIGRATED;
+			}
 		} else {
 			put_page(page);
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, flags);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 43/50] mm: numa: Do not group on RO pages
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

[riel@redhat.com: mapcount 1]
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  7 +++++--
 kernel/sched/fair.c   |  5 +++--
 mm/huge_memory.c      | 17 ++++++++++++++---
 mm/memory.c           | 30 ++++++++++++++++++++++++++----
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4fad1f17..15888f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1434,13 +1434,16 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#define TNF_MIGRATED	0x01
+#define TNF_NO_GROUP	0x02
+
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   bool migrated)
+				   int flags)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1faf3ff..ecfce3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,9 +1358,10 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 {
 	struct task_struct *p = current;
+	bool migrated = flags & TNF_MIGRATED;
 	int priv;
 
 	if (!numabalancing_enabled)
@@ -1396,7 +1397,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
 		pid = cpupid_to_pid(last_cpupid);
 
 		priv = (pid == (p->pid & LAST__PID_MASK));
-		if (!priv)
+		if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, cpu, pid);
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf903fc..5c339a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1297,6 +1297,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	int flags = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1311,6 +1312,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pmd_write(pmd))
+		flags |= TNF_NO_GROUP;
+
+	/*
 	 * Acquire the page lock to serialise THP migrations but avoid dropping
 	 * page_table_lock if at all possible
 	 */
@@ -1350,10 +1359,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(&mm->page_table_lock);
 	migrated = migrate_misplaced_transhuge_page(mm, vma,
 				pmdp, pmd, addr, page, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
-	else
+		flags |= TNF_MIGRATED;
+	} else {
 		goto check_same;
+	}
 
 	goto out;
 
@@ -1377,7 +1388,7 @@ out:
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index f779403..1aa4187 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3558,6 +3558,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	int flags = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3586,6 +3587,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
 
+	/*
+	 * Avoid grouping on DSO/COW pages in specific and RO pages
+	 * in general, RO pages shouldn't hurt as much anyway since
+	 * they can be in shared cache state.
+	 */
+	if (!pte_write(pte))
+		flags |= TNF_NO_GROUP;
+
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3597,12 +3606,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
-	if (migrated)
+	if (migrated) {
 		page_nid = target_nid;
+		flags |= TNF_MIGRATED;
+	}
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, migrated);
+		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }
 
@@ -3643,6 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int page_nid = -1;
 		int target_nid;
 		bool migrated = false;
+		int flags = 0;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3662,20 +3674,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!page))
 			continue;
 
+		/*
+		 * Avoid grouping on DSO/COW pages in specific and RO pages
+		 * in general, RO pages shouldn't hurt as much anyway since
+		 * they can be in shared cache state.
+		 */
+		if (!pte_write(pteval))
+			flags |= TNF_NO_GROUP;
+
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
 			migrated = migrate_misplaced_page(page, vma, target_nid);
-			if (migrated)
+			if (migrated) {
 				page_nid = target_nid;
+				flags |= TNF_MIGRATED;
+			}
 		} else {
 			put_page(page);
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, migrated);
+			task_numa_fault(last_cpupid, page_nid, 1, flags);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 44/50] sched: numa: stay on the same node if CLONE_VM
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15888f5..4f51ceb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index f693bdf..2bc7f88 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1309,7 +1309,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3808860..7bf0827 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1699,7 +1699,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1732,11 +1732,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1768,12 +1772,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4304,7 +4308,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 44/50] sched: numa: stay on the same node if CLONE_VM
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  2 +-
 kernel/fork.c         |  2 +-
 kernel/sched/core.c   | 14 +++++++++-----
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15888f5..4f51ceb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index f693bdf..2bc7f88 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1309,7 +1309,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	sched_fork(clone_flags, p);
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3808860..7bf0827 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1699,7 +1699,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  *
  * __sched_fork() is basic setup used by init_idle() too:
  */
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
 
@@ -1732,11 +1732,15 @@ static void __sched_fork(struct task_struct *p)
 		p->mm->numa_scan_seq = 0;
 	}
 
+	if (clone_flags & CLONE_VM)
+		p->numa_preferred_nid = current->numa_preferred_nid;
+	else
+		p->numa_preferred_nid = -1;
+
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = 1;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
-	p->numa_preferred_nid = -1;
 	p->numa_work.next = &p->numa_work;
 	p->numa_faults = NULL;
 	p->numa_faults_buffer = NULL;
@@ -1768,12 +1772,12 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
 
-	__sched_fork(p);
+	__sched_fork(clone_flags, p);
 	/*
 	 * We mark the process as running here. This guarantees that
 	 * nobody will actually run it, and a signal or other external
@@ -4304,7 +4308,7 @@ void init_idle(struct task_struct *idle, int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__sched_fork(idle);
+	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 45/50] sched: numa: use group fault statistics in numa placement
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   1 +
 kernel/sched/fair.c   | 134 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 113 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f51ceb..46fb36a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,7 @@ struct task_struct {
 	 * The values remain static for the duration of a PTE scan
 	 */
 	unsigned long *numa_faults;
+	unsigned long total_numa_faults;
 
 	/*
 	 * numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfce3e..3a92c58 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	atomic_long_t total_faults;
 	atomic_long_t faults[0];
 };
 
@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 		p->numa_faults[task_faults_idx(nid, 1)];
 }
 
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_group)
+		return 0;
+
+	return atomic_long_read(&p->numa_group->faults[2*nid]) +
+	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node.  The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_faults)
+		return 0;
+
+	total_faults = p->total_numa_faults;
+
+	if (!total_faults)
+		return 0;
+
+	return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+	if (!total_faults)
+		return 0;
+
+	return 1200 * group_faults(p, nid) / total_faults;
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_faults(cur, env->src_nid) -
-		       task_faults(cur, env->dst_nid);
+		imp += task_weight(cur, env->src_nid) +
+		       group_weight(cur, env->src_nid) -
+		       task_weight(cur, env->dst_nid) -
+		       group_weight(cur, env->dst_nid);
 	}
 
 	if (imp < env->best_imp)
@@ -1099,7 +1147,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
  	struct sched_domain *sd;
-	unsigned long faults;
+	unsigned long weight;
 	int nid, ret;
 	long imp;
 
@@ -1116,10 +1164,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 	rcu_read_unlock();
 
-	faults = task_faults(p, env.src_nid);
+	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_faults(env.p, env.dst_nid) - faults;
+	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/*
@@ -1133,8 +1181,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			/* Only consider nodes that recorded more faults */
-			imp = task_faults(env.p, nid) - faults;
+			/* Only consider nodes where both task and groups benefit */
+			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
 			if (imp < 0)
 				continue;
 
@@ -1181,8 +1229,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1;
-	unsigned long max_faults = 0;
+	int seq, nid, max_nid = -1, max_group_nid = -1;
+	unsigned long max_faults = 0, max_group_faults = 0;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1193,7 +1241,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = 0;
+		unsigned long faults = 0, group_faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1209,9 +1257,12 @@ static void task_numa_placement(struct task_struct *p)
 
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				atomic_long_add(diff, &p->numa_group->faults[i]);
+				atomic_long_add(diff, &p->numa_group->total_faults);
+				group_faults += atomic_long_read(&p->numa_group->faults[i]);
 			}
 		}
 
@@ -1219,6 +1270,27 @@ static void task_numa_placement(struct task_struct *p)
 			max_faults = faults;
 			max_nid = nid;
 		}
+
+		if (group_faults > max_group_faults) {
+			max_group_faults = group_faults;
+			max_group_nid = nid;
+		}
+	}
+
+	/*
+	 * If the preferred task and group nids are different, 
+	 * iterate over the nodes again to find the best place.
+	 */
+	if (p->numa_group && max_nid != max_group_nid) {
+		unsigned long weight, max_weight = 0;
+
+		for_each_online_node(nid) {
+			weight = task_weight(p, nid) + group_weight(p, nid);
+			if (weight > max_weight) {
+				max_weight = weight;
+				max_nid = nid;
+			}
+		}
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1273,6 +1345,8 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
 
+		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
 		rcu_assign_pointer(p->numa_group, grp);
@@ -1320,6 +1394,8 @@ unlock:
 		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
 		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
 	}
+	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+	atomic_long_add(p->total_numa_faults, &grp->total_faults);
 
 	double_lock(&my_grp->lock, &grp->lock);
 
@@ -1340,12 +1416,12 @@ void task_numa_free(struct task_struct *p)
 	struct numa_group *grp = p->numa_group;
 	int i;
 
-	kfree(p->numa_faults);
-
 	if (grp) {
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
 
+		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
 		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
@@ -1353,6 +1429,8 @@ void task_numa_free(struct task_struct *p)
 		rcu_assign_pointer(p->numa_group, NULL);
 		put_numa_group(grp);
 	}
+
+	kfree(p->numa_faults);
 }
 
 /*
@@ -1382,6 +1460,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->total_numa_faults = 0;
 	}
 
 	/*
@@ -4527,12 +4606,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (dst_nid == p->numa_preferred_nid ||
-	    task_faults(p, dst_nid) > task_faults(p, src_nid))
+	/* Always encourage migration to the preferred node. */
+	if (dst_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is better. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4552,14 +4636,20 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
- 		return true;
- 
- 	return false;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is worse. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+			task_weight(p, src_nid) + group_weight(p, src_nid))
+		return true;
+
+	return false;
 }
 
 #else
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 45/50] sched: numa: use group fault statistics in numa placement
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |   1 +
 kernel/sched/fair.c   | 134 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 113 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f51ceb..46fb36a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,7 @@ struct task_struct {
 	 * The values remain static for the duration of a PTE scan
 	 */
 	unsigned long *numa_faults;
+	unsigned long total_numa_faults;
 
 	/*
 	 * numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfce3e..3a92c58 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
+	atomic_long_t total_faults;
 	atomic_long_t faults[0];
 };
 
@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
 		p->numa_faults[task_faults_idx(nid, 1)];
 }
 
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+	if (!p->numa_group)
+		return 0;
+
+	return atomic_long_read(&p->numa_group->faults[2*nid]) +
+	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node.  The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_faults)
+		return 0;
+
+	total_faults = p->total_numa_faults;
+
+	if (!total_faults)
+		return 0;
+
+	return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+	unsigned long total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+	if (!total_faults)
+		return 0;
+
+	return 1200 * group_faults(p, nid) / total_faults;
+}
+
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_faults(cur, env->src_nid) -
-		       task_faults(cur, env->dst_nid);
+		imp += task_weight(cur, env->src_nid) +
+		       group_weight(cur, env->src_nid) -
+		       task_weight(cur, env->dst_nid) -
+		       group_weight(cur, env->dst_nid);
 	}
 
 	if (imp < env->best_imp)
@@ -1099,7 +1147,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
  	struct sched_domain *sd;
-	unsigned long faults;
+	unsigned long weight;
 	int nid, ret;
 	long imp;
 
@@ -1116,10 +1164,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 	rcu_read_unlock();
 
-	faults = task_faults(p, env.src_nid);
+	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_faults(env.p, env.dst_nid) - faults;
+	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/*
@@ -1133,8 +1181,8 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
-			/* Only consider nodes that recorded more faults */
-			imp = task_faults(env.p, nid) - faults;
+			/* Only consider nodes where both task and groups benefit */
+			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
 			if (imp < 0)
 				continue;
 
@@ -1181,8 +1229,8 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1;
-	unsigned long max_faults = 0;
+	int seq, nid, max_nid = -1, max_group_nid = -1;
+	unsigned long max_faults = 0, max_group_faults = 0;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1193,7 +1241,7 @@ static void task_numa_placement(struct task_struct *p)
 
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
-		unsigned long faults = 0;
+		unsigned long faults = 0, group_faults = 0;
 		int priv, i;
 
 		for (priv = 0; priv < 2; priv++) {
@@ -1209,9 +1257,12 @@ static void task_numa_placement(struct task_struct *p)
 
 			faults += p->numa_faults[i];
 			diff += p->numa_faults[i];
+			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
 				atomic_long_add(diff, &p->numa_group->faults[i]);
+				atomic_long_add(diff, &p->numa_group->total_faults);
+				group_faults += atomic_long_read(&p->numa_group->faults[i]);
 			}
 		}
 
@@ -1219,6 +1270,27 @@ static void task_numa_placement(struct task_struct *p)
 			max_faults = faults;
 			max_nid = nid;
 		}
+
+		if (group_faults > max_group_faults) {
+			max_group_faults = group_faults;
+			max_group_nid = nid;
+		}
+	}
+
+	/*
+	 * If the preferred task and group nids are different, 
+	 * iterate over the nodes again to find the best place.
+	 */
+	if (p->numa_group && max_nid != max_group_nid) {
+		unsigned long weight, max_weight = 0;
+
+		for_each_online_node(nid) {
+			weight = task_weight(p, nid) + group_weight(p, nid);
+			if (weight > max_weight) {
+				max_weight = weight;
+				max_nid = nid;
+			}
+		}
 	}
 
 	/* Preferred node as the node with the most faults */
@@ -1273,6 +1345,8 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
 
+		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
 		rcu_assign_pointer(p->numa_group, grp);
@@ -1320,6 +1394,8 @@ unlock:
 		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
 		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
 	}
+	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+	atomic_long_add(p->total_numa_faults, &grp->total_faults);
 
 	double_lock(&my_grp->lock, &grp->lock);
 
@@ -1340,12 +1416,12 @@ void task_numa_free(struct task_struct *p)
 	struct numa_group *grp = p->numa_group;
 	int i;
 
-	kfree(p->numa_faults);
-
 	if (grp) {
 		for (i = 0; i < 2*nr_node_ids; i++)
 			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
 
+		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
 		spin_lock(&grp->lock);
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
@@ -1353,6 +1429,8 @@ void task_numa_free(struct task_struct *p)
 		rcu_assign_pointer(p->numa_group, NULL);
 		put_numa_group(grp);
 	}
+
+	kfree(p->numa_faults);
 }
 
 /*
@@ -1382,6 +1460,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		BUG_ON(p->numa_faults_buffer);
 		p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+		p->total_numa_faults = 0;
 	}
 
 	/*
@@ -4527,12 +4606,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (dst_nid == p->numa_preferred_nid ||
-	    task_faults(p, dst_nid) > task_faults(p, src_nid))
+	/* Always encourage migration to the preferred node. */
+	if (dst_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is better. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+			task_weight(p, src_nid) + group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4552,14 +4636,20 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
-	if (src_nid == dst_nid ||
-	    p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+	if (src_nid == dst_nid)
 		return false;
 
-	if (task_faults(p, dst_nid) < task_faults(p, src_nid))
- 		return true;
- 
- 	return false;
+	/* Migrating away from the preferred node is always bad. */
+	if (src_nid == p->numa_preferred_nid)
+		return true;
+
+	/* After the task has settled, check if the new node is worse. */
+	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+			task_weight(p, src_nid) + group_weight(p, src_nid))
+		return true;
+
+	return false;
 }
 
 #else
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a92c58..4653f71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	/* If the task is part of a group prevent parallel updates to group stats */
+	if (p->numa_group) {
+		group_lock = &p->numa_group->lock;
+		spin_lock(group_lock);
+	}
+
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults = 0, group_faults = 0;
@@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * If the preferred task and group nids are different, 
-	 * iterate over the nodes again to find the best place.
-	 */
-	if (p->numa_group && max_nid != max_group_nid) {
-		unsigned long weight, max_weight = 0;
-
-		for_each_online_node(nid) {
-			weight = task_weight(p, nid) + group_weight(p, nid);
-			if (weight > max_weight) {
-				max_weight = weight;
-				max_nid = nid;
+	if (p->numa_group) {
+		/*
+		 * If the preferred task and group nids are different, 
+		 * iterate over the nodes again to find the best place.
+		 */
+		if (max_nid != max_group_nid) {
+			unsigned long weight, max_weight = 0;
+
+			for_each_online_node(nid) {
+				weight = task_weight(p, nid) + group_weight(p, nid);
+				if (weight > max_weight) {
+					max_weight = weight;
+					max_nid = nid;
+				}
 			}
 		}
+
+		spin_unlock(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a92c58..4653f71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
 	unsigned long max_faults = 0, max_group_faults = 0;
+	spinlock_t *group_lock = NULL;
 
 	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	if (p->numa_scan_seq == seq)
@@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
 	p->numa_migrate_seq++;
 	p->numa_scan_period_max = task_scan_max(p);
 
+	/* If the task is part of a group prevent parallel updates to group stats */
+	if (p->numa_group) {
+		group_lock = &p->numa_group->lock;
+		spin_lock(group_lock);
+	}
+
 	/* Find the node with the highest number of faults */
 	for_each_online_node(nid) {
 		unsigned long faults = 0, group_faults = 0;
@@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	/*
-	 * If the preferred task and group nids are different, 
-	 * iterate over the nodes again to find the best place.
-	 */
-	if (p->numa_group && max_nid != max_group_nid) {
-		unsigned long weight, max_weight = 0;
-
-		for_each_online_node(nid) {
-			weight = task_weight(p, nid) + group_weight(p, nid);
-			if (weight > max_weight) {
-				max_weight = weight;
-				max_nid = nid;
+	if (p->numa_group) {
+		/*
+		 * If the preferred task and group nids are different, 
+		 * iterate over the nodes again to find the best place.
+		 */
+		if (max_nid != max_group_nid) {
+			unsigned long weight, max_weight = 0;
+
+			for_each_online_node(nid) {
+				weight = task_weight(p, nid) + group_weight(p, nid);
+				if (weight > max_weight) {
+					max_weight = weight;
+					max_nid = nid;
+				}
 			}
 		}
+
+		spin_unlock(group_lock);
 	}
 
 	/* Preferred node as the node with the most faults */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 47/50] sched: numa: add debugging
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/debug.c  | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  5 ++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 46fb36a..ac08eb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1357,6 +1357,7 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	int numa_preferred_nid;
+	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
@@ -2577,6 +2578,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+static inline int task_node(const struct task_struct *p)
+{
+	return cpu_to_node(task_cpu(p));
+}
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e076bdd..49ab782 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/utsname.h>
+#include <linux/mempolicy.h>
 
 #include "sched.h"
 
@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->on_rq || task_cpu(p) != rq_cpu)
+		if (task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
 	cpu_clk = local_clock();
 	local_irq_restore(flags);
 
-	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)
 
 __initcall(init_sched_debug_procfs);
 
+#define __P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	struct mempolicy *pol;
+	int node, i;
+
+	if (p->mm)
+		P(mm->numa_scan_seq);
+
+	task_lock(p);
+	pol = p->mempolicy;
+	if (pol && !(pol->flags & MPOL_F_MORON))
+		pol = NULL;
+	mpol_get(pol);
+	task_unlock(p);
+
+	SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+	for_each_online_node(node) {
+		for (i = 0; i < 2; i++) {
+			unsigned long nr_faults = -1;
+			int cpu_current, home_node;
+
+			if (p->numa_faults)
+				nr_faults = p->numa_faults[2*node + i];
+
+			cpu_current = !i ? (task_node(p) == node) :
+				(pol && node_isset(node, pol->v.nodes));
+
+			home_node = (p->numa_preferred_nid == node);
+
+			SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+				i, node, cpu_current, home_node, nr_faults);
+		}
+	}
+
+	mpol_put(pol);
+#endif
+}
+
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
 	unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		SEQ_printf(m, "%-45s:%21Ld\n",
 			   "clock-delta", (long long)(t1-t0));
 	}
+
+	sched_show_numa(p, m);
 }
 
 void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4653f71..80906fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1138,7 +1138,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.p = p,
 
 		.src_cpu = task_cpu(p),
-		.src_nid = cpu_to_node(task_cpu(p)),
+		.src_nid = task_node(p),
 
 		.imbalance_pct = 112,
 
@@ -1510,6 +1510,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
+	if (migrated)
+		p->numa_pages_migrated += pages;
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 47/50] sched: numa: add debugging
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-5giqjcqnc93a89q01ymtjxpr@git.kernel.org
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/debug.c  | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  5 ++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 46fb36a..ac08eb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1357,6 +1357,7 @@ struct task_struct {
 	unsigned long *numa_faults_buffer;
 
 	int numa_preferred_nid;
+	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
@@ -2577,6 +2578,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+static inline int task_node(const struct task_struct *p)
+{
+	return cpu_to_node(task_cpu(p));
+}
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e076bdd..49ab782 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/utsname.h>
+#include <linux/mempolicy.h>
 
 #include "sched.h"
 
@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+	SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 	read_lock_irqsave(&tasklist_lock, flags);
 
 	do_each_thread(g, p) {
-		if (!p->on_rq || task_cpu(p) != rq_cpu)
+		if (task_cpu(p) != rq_cpu)
 			continue;
 
 		print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
 	cpu_clk = local_clock();
 	local_irq_restore(flags);
 
-	SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)
 
 __initcall(init_sched_debug_procfs);
 
+#define __P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+	SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+	SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	struct mempolicy *pol;
+	int node, i;
+
+	if (p->mm)
+		P(mm->numa_scan_seq);
+
+	task_lock(p);
+	pol = p->mempolicy;
+	if (pol && !(pol->flags & MPOL_F_MORON))
+		pol = NULL;
+	mpol_get(pol);
+	task_unlock(p);
+
+	SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+	for_each_online_node(node) {
+		for (i = 0; i < 2; i++) {
+			unsigned long nr_faults = -1;
+			int cpu_current, home_node;
+
+			if (p->numa_faults)
+				nr_faults = p->numa_faults[2*node + i];
+
+			cpu_current = !i ? (task_node(p) == node) :
+				(pol && node_isset(node, pol->v.nodes));
+
+			home_node = (p->numa_preferred_nid == node);
+
+			SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+				i, node, cpu_current, home_node, nr_faults);
+		}
+	}
+
+	mpol_put(pol);
+#endif
+}
+
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
 	unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		SEQ_printf(m, "%-45s:%21Ld\n",
 			   "clock-delta", (long long)(t1-t0));
 	}
+
+	sched_show_numa(p, m);
 }
 
 void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4653f71..80906fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1138,7 +1138,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.p = p,
 
 		.src_cpu = task_cpu(p),
-		.src_nid = cpu_to_node(task_cpu(p)),
+		.src_nid = task_node(p),
 
 		.imbalance_pct = 112,
 
@@ -1510,6 +1510,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 	if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
 		numa_migrate_preferred(p);
 
+	if (migrated)
+		p->numa_pages_migrated += pages;
+
 	p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 48/50] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Not-signed-off-by: Rik van Riel
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80906fa..fdb7923 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+			      long taskimp, long groupimp)
 {
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long dst_load, src_load;
 	long load;
+	long imp = (groupimp > 0) ? groupimp : taskimp;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_weight(cur, env->src_nid) +
-		       group_weight(cur, env->src_nid) -
-		       task_weight(cur, env->dst_nid) -
-		       group_weight(cur, env->dst_nid);
+		/*
+		 * If dst and source tasks are in the same NUMA group, or not
+		 * in any group then look only at task weights otherwise give
+		 * priority to the group weights.
+		 */
+		if (!cur->numa_group || ! env->p->numa_group ||
+		    cur->numa_group == env->p->numa_group) {
+			imp = taskimp + task_weight(cur, env->src_nid) -
+			      task_weight(cur, env->dst_nid);
+		} else {
+			imp = groupimp + group_weight(cur, env->src_nid) -
+			       group_weight(cur, env->dst_nid);
+		}
 	}
 
 	if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
 	rcu_read_unlock();
 }
 
-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+				long taskimp, long groupimp)
 {
 	int cpu;
 
@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, imp);
+		task_numa_compare(env, taskimp, groupimp);
 	}
 }
 
@@ -1147,9 +1159,9 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
  	struct sched_domain *sd;
-	unsigned long weight;
+	unsigned long taskweight, groupweight;
 	int nid, ret;
-	long imp;
+	long taskimp, groupimp;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -1164,10 +1176,12 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 	rcu_read_unlock();
 
-	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+	taskweight = task_weight(p, env.src_nid);
+	groupweight = group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+	taskimp = task_weight(p, env.dst_nid) - taskweight;
+	groupimp = group_weight(p, env.dst_nid) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/*
@@ -1175,20 +1189,21 @@ static int task_numa_migrate(struct task_struct *p)
 	 * alternative node with relatively better statistics.
 	 */
 	if (env.dst_stats.has_capacity) {
-		task_numa_find_cpu(&env, imp);
+		task_numa_find_cpu(&env, taskimp, groupimp);
 	} else {
 		for_each_online_node(nid) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
 			/* Only consider nodes where both task and groups benefit */
-			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
-			if (imp < 0)
+			taskimp = task_weight(p, nid) - taskweight;
+			groupimp = group_weight(p, nid) - groupweight;
+			if (taskimp < 0 && groupimp < 0)
 				continue;
 
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, imp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
@@ -4627,10 +4642,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (dst_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is better. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If both task and group weight improve, this move is a winner. */
+	if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+	    group_weight(p, dst_nid) > group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4657,10 +4671,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is worse. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If either task or group weight get worse, don't do it. */
+	if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+	    group_weight(p, dst_nid) < group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 48/50] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Not-signed-off-by: Rik van Riel
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80906fa..fdb7923 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+			      long taskimp, long groupimp)
 {
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long dst_load, src_load;
 	long load;
+	long imp = (groupimp > 0) ? groupimp : taskimp;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
 		if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
 			goto unlock;
 
-		imp += task_weight(cur, env->src_nid) +
-		       group_weight(cur, env->src_nid) -
-		       task_weight(cur, env->dst_nid) -
-		       group_weight(cur, env->dst_nid);
+		/*
+		 * If dst and source tasks are in the same NUMA group, or not
+		 * in any group then look only at task weights otherwise give
+		 * priority to the group weights.
+		 */
+		if (!cur->numa_group || ! env->p->numa_group ||
+		    cur->numa_group == env->p->numa_group) {
+			imp = taskimp + task_weight(cur, env->src_nid) -
+			      task_weight(cur, env->dst_nid);
+		} else {
+			imp = groupimp + group_weight(cur, env->src_nid) -
+			       group_weight(cur, env->dst_nid);
+		}
 	}
 
 	if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
 	rcu_read_unlock();
 }
 
-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+				long taskimp, long groupimp)
 {
 	int cpu;
 
@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, imp);
+		task_numa_compare(env, taskimp, groupimp);
 	}
 }
 
@@ -1147,9 +1159,9 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1
 	};
  	struct sched_domain *sd;
-	unsigned long weight;
+	unsigned long taskweight, groupweight;
 	int nid, ret;
-	long imp;
+	long taskimp, groupimp;
 
 	/*
 	 * Find the lowest common scheduling domain covering the nodes of both
@@ -1164,10 +1176,12 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 	rcu_read_unlock();
 
-	weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+	taskweight = task_weight(p, env.src_nid);
+	groupweight = group_weight(p, env.src_nid);
 	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+	taskimp = task_weight(p, env.dst_nid) - taskweight;
+	groupimp = group_weight(p, env.dst_nid) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/*
@@ -1175,20 +1189,21 @@ static int task_numa_migrate(struct task_struct *p)
 	 * alternative node with relatively better statistics.
 	 */
 	if (env.dst_stats.has_capacity) {
-		task_numa_find_cpu(&env, imp);
+		task_numa_find_cpu(&env, taskimp, groupimp);
 	} else {
 		for_each_online_node(nid) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
 			/* Only consider nodes where both task and groups benefit */
-			imp = task_weight(p, nid) + group_weight(p, nid) - weight;
-			if (imp < 0)
+			taskimp = task_weight(p, nid) - taskweight;
+			groupimp = group_weight(p, nid) - groupweight;
+			if (taskimp < 0 && groupimp < 0)
 				continue;
 
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, imp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
@@ -4627,10 +4642,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
 	if (dst_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is better. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) >
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If both task and group weight improve, this move is a winner. */
+	if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+	    group_weight(p, dst_nid) > group_weight(p, src_nid))
 		return true;
 
 	return false;
@@ -4657,10 +4671,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	if (src_nid == p->numa_preferred_nid)
 		return true;
 
-	/* After the task has settled, check if the new node is worse. */
-	if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
-			task_weight(p, dst_nid) + group_weight(p, dst_nid) <
-			task_weight(p, src_nid) + group_weight(p, src_nid))
+	/* If either task or group weight get worse, don't do it. */
+	if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+	    group_weight(p, dst_nid) < group_weight(p, src_nid))
 		return true;
 
 	return false;
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 49/50] sched: numa: fix task or group comparison
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch should probably be folded into

commit 77e0ecbbc5cf0a84764be88b9de5ff13e4338163
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Aug 27 21:52:47 2013 +0100

    sched: numa: Decide whether to favour task or group weights based on swap candidate relationships

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdb7923..ac7184d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,
 
 		/*
 		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights otherwise give
-		 * priority to the group weights.
+		 * in any group then look only at task weights.
 		 */
-		if (!cur->numa_group || ! env->p->numa_group ||
-		    cur->numa_group == env->p->numa_group) {
+		if (cur->numa_group == env->p->numa_group) {
 			imp = taskimp + task_weight(cur, env->src_nid) -
 			      task_weight(cur, env->dst_nid);
+			/*
+			 * Add some hysteresis to prevent swapping the
+			 * tasks within a group over tiny differences.
+			 */
+			if (cur->numa_group)
+				imp -= imp/16;
 		} else {
-			imp = groupimp + group_weight(cur, env->src_nid) -
-			       group_weight(cur, env->dst_nid);
+			/*
+			 * Compare the group weights. If a task is all by
+			 * itself (not part of a group), use the task weight
+			 * instead.
+			 */
+			if (env->p->numa_group)
+				imp = groupimp;
+			else
+				imp = taskimp;
+
+			if (cur->numa_group)
+				imp += group_weight(cur, env->src_nid) -
+				       group_weight(cur, env->dst_nid);
+			else
+				imp += task_weight(cur, env->src_nid) -
+				       task_weight(cur, env->dst_nid);
 		}
 	}
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 49/50] sched: numa: fix task or group comparison
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This patch should probably be folded into

commit 77e0ecbbc5cf0a84764be88b9de5ff13e4338163
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Aug 27 21:52:47 2013 +0100

    sched: numa: Decide whether to favour task or group weights based on swap candidate relationships

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdb7923..ac7184d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1000 * group_faults(p, nid) / total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,
 
 		/*
 		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights otherwise give
-		 * priority to the group weights.
+		 * in any group then look only at task weights.
 		 */
-		if (!cur->numa_group || ! env->p->numa_group ||
-		    cur->numa_group == env->p->numa_group) {
+		if (cur->numa_group == env->p->numa_group) {
 			imp = taskimp + task_weight(cur, env->src_nid) -
 			      task_weight(cur, env->dst_nid);
+			/*
+			 * Add some hysteresis to prevent swapping the
+			 * tasks within a group over tiny differences.
+			 */
+			if (cur->numa_group)
+				imp -= imp/16;
 		} else {
-			imp = groupimp + group_weight(cur, env->src_nid) -
-			       group_weight(cur, env->dst_nid);
+			/*
+			 * Compare the group weights. If a task is all by
+			 * itself (not part of a group), use the task weight
+			 * instead.
+			 */
+			if (env->p->numa_group)
+				imp = groupimp;
+			else
+				imp = taskimp;
+
+			if (cur->numa_group)
+				imp += group_weight(cur, env->src_nid) -
+				       group_weight(cur, env->dst_nid);
+			else
+				imp += task_weight(cur, env->src_nid) -
+				       task_weight(cur, env->dst_nid);
 		}
 	}
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 50/50] sched: numa: Avoid migrating tasks that are placed on their preferred node
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-10  9:32   ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

(This changelog needs more work, it's currently inaccurate and it's not
clear at exactly what point rt > env->fbq_type is true for the logic to
kick in)

This patch classifies scheduler domains and runqueues into FBQ (cannot
guess what this expands to) types which are one of

regular: There are tasks running that do not care about their NUMA
	placement

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running that
care about their placement (nr_numa_running). The load balancer uses this
information to avoid migrating idea placed NUMA tasks as long as better
options for load balancing exists.

Not-signed-off-by: Peter Zijlstra
---
 kernel/sched/core.c  |  29 ++++++++++++
 kernel/sched/fair.c  | 128 ++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |   5 ++
 3 files changed, 150 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bf0827..3fc31b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4485,6 +4485,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 1;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7184d..27bc89b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running += (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
 struct numa_group {
 	atomic_t refcount;
 
@@ -1229,6 +1241,8 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
+	sched_setnuma(p, env.dst_nid);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1340,8 +1354,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 1;
+		sched_setnuma(p, max_nid);
 		numa_migrate_preferred(p);
 	}
 }
@@ -1736,6 +1749,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1745,8 +1766,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1757,8 +1782,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -4553,6 +4580,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
+enum fbq_type { regular, remote, all };
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -4579,6 +4608,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	enum fbq_type		fbq_type;
 };
 
 /*
@@ -5044,6 +5075,10 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 };
 
 /*
@@ -5335,6 +5370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
@@ -5409,14 +5448,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running > sgs->nr_numa_running)
+		return regular;
+	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+		return remote;
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	if (rq->nr_running > rq->nr_numa_running)
+		return regular;
+	if (rq->nr_running > rq->nr_preferred_running)
+		return remote;
+	return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
  * @balance: Should we balance.
  * @sds: variable to hold the statistics for this sched_domain.
  */
-static inline void update_sd_lb_stats(struct lb_env *env,
-					struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -5466,6 +5534,9 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	if (env->sd->flags & SD_NUMA)
+		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 }
 
 /**
@@ -5768,15 +5839,47 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		unsigned long power, capacity, wl;
+		enum fbq_type rt;
 
+		rq = cpu_rq(i);
+		rt = fbq_classify_rq(rq);
+
+#ifdef CONFIG_NUMA_BALANCING
+		trace_printk("group(%d:%pc) rq(%d): wl: %lu nr: %d nrn: %d nrp: %d gt:%d rt:%d\n",
+				env->sd->level, sched_group_cpus(group), i,
+				weighted_cpuload(i), rq->nr_running,
+				rq->nr_numa_running, rq->nr_preferred_running,
+				env->fbq_type, rt);
+#endif
+
+		/*
+		 * We classify groups/runqueues into three groups:
+		 *  - regular: there are !numa tasks
+		 *  - remote:  there are numa tasks that run on the 'wrong' node
+		 *  - all:     there is no distinction
+		 *
+		 * In order to avoid migrating ideally placed numa tasks,
+		 * ignore those when there's better options.
+		 *
+		 * If we ignore the actual busiest queue to migrate another
+		 * task, the next balance pass can still reduce the busiest
+		 * queue by moving tasks around inside the node.
+		 *
+		 * If we cannot move enough load due to this classification
+		 * the next pass will adjust the group classification and
+		 * allow migration of more tasks.
+		 *
+		 * Both cases only affect the total convergence complexity.
+		 */
+		if (rt > env->fbq_type)
+			continue;
+
+		power = power_of(i);
+		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
 		/*
@@ -5888,6 +5991,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.fbq_type	= all,
 	};
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c6ec25..b9bcea5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -407,6 +407,10 @@ struct rq {
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
@@ -555,6 +559,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
 extern void task_numa_free(struct task_struct *p);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* [PATCH 50/50] sched: numa: Avoid migrating tasks that are placed on their preferred node
@ 2013-09-10  9:32   ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-10  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

(This changelog needs more work, it's currently inaccurate and it's not
clear at exactly what point rt > env->fbq_type is true for the logic to
kick in)

This patch classifies scheduler domains and runqueues into FBQ (cannot
guess what this expands to) types which are one of

regular: There are tasks running that do not care about their NUMA
	placement

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running that
care about their placement (nr_numa_running). The load balancer uses this
information to avoid migrating idea placed NUMA tasks as long as better
options for load balancing exists.

Not-signed-off-by: Peter Zijlstra
---
 kernel/sched/core.c  |  29 ++++++++++++
 kernel/sched/fair.c  | 128 ++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |   5 ++
 3 files changed, 150 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bf0827..3fc31b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4485,6 +4485,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
 
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+	struct rq *rq;
+	unsigned long flags;
+	bool on_rq, running;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->numa_preferred_nid = nid;
+	p->numa_migrate_seq = 1;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7184d..27bc89b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
  */
 unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
 
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running += (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
 struct numa_group {
 	atomic_t refcount;
 
@@ -1229,6 +1241,8 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
+	sched_setnuma(p, env.dst_nid);
+
 	if (env.best_task == NULL) {
 		int ret = migrate_task_to(p, env.best_cpu);
 		return ret;
@@ -1340,8 +1354,7 @@ static void task_numa_placement(struct task_struct *p)
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
 		/* Update the preferred nid and migrate task if possible */
-		p->numa_preferred_nid = max_nid;
-		p->numa_migrate_seq = 1;
+		sched_setnuma(p, max_nid);
 		numa_migrate_preferred(p);
 	}
 }
@@ -1736,6 +1749,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1745,8 +1766,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_enqueue(rq, task_of(se));
+		list_add(&se->group_node, &rq->cfs_tasks);
+	}
 #endif
 	cfs_rq->nr_running++;
 }
@@ -1757,8 +1782,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -4553,6 +4580,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
+enum fbq_type { regular, remote, all };
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -4579,6 +4608,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	enum fbq_type		fbq_type;
 };
 
 /*
@@ -5044,6 +5075,10 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 };
 
 /*
@@ -5335,6 +5370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
@@ -5409,14 +5448,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running > sgs->nr_numa_running)
+		return regular;
+	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+		return remote;
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	if (rq->nr_running > rq->nr_numa_running)
+		return regular;
+	if (rq->nr_running > rq->nr_preferred_running)
+		return remote;
+	return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+	return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+	return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
  * @balance: Should we balance.
  * @sds: variable to hold the statistics for this sched_domain.
  */
-static inline void update_sd_lb_stats(struct lb_env *env,
-					struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
@@ -5466,6 +5534,9 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	if (env->sd->flags & SD_NUMA)
+		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 }
 
 /**
@@ -5768,15 +5839,47 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power,
-							   SCHED_POWER_SCALE);
-		unsigned long wl;
+		unsigned long power, capacity, wl;
+		enum fbq_type rt;
 
+		rq = cpu_rq(i);
+		rt = fbq_classify_rq(rq);
+
+#ifdef CONFIG_NUMA_BALANCING
+		trace_printk("group(%d:%pc) rq(%d): wl: %lu nr: %d nrn: %d nrp: %d gt:%d rt:%d\n",
+				env->sd->level, sched_group_cpus(group), i,
+				weighted_cpuload(i), rq->nr_running,
+				rq->nr_numa_running, rq->nr_preferred_running,
+				env->fbq_type, rt);
+#endif
+
+		/*
+		 * We classify groups/runqueues into three groups:
+		 *  - regular: there are !numa tasks
+		 *  - remote:  there are numa tasks that run on the 'wrong' node
+		 *  - all:     there is no distinction
+		 *
+		 * In order to avoid migrating ideally placed numa tasks,
+		 * ignore those when there's better options.
+		 *
+		 * If we ignore the actual busiest queue to migrate another
+		 * task, the next balance pass can still reduce the busiest
+		 * queue by moving tasks around inside the node.
+		 *
+		 * If we cannot move enough load due to this classification
+		 * the next pass will adjust the group classification and
+		 * allow migration of more tasks.
+		 *
+		 * Both cases only affect the total convergence complexity.
+		 */
+		if (rt > env->fbq_type)
+			continue;
+
+		power = power_of(i);
+		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
 		if (!capacity)
 			capacity = fix_small_capacity(env->sd, group);
 
-		rq = cpu_rq(i);
 		wl = weighted_cpuload(i);
 
 		/*
@@ -5888,6 +5991,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.fbq_type	= all,
 	};
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c6ec25..b9bcea5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -407,6 +407,10 @@ struct rq {
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
 	unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned long last_load_update_tick;
@@ -555,6 +559,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
 extern void task_numa_free(struct task_struct *p);
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
  2013-09-10  9:31   ` Mel Gorman
@ 2013-09-11  0:58     ` Joonsoo Kim
  -1 siblings, 0 replies; 361+ messages in thread
From: Joonsoo Kim @ 2013-09-11  0:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:41AM +0100, Mel Gorman wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>  
>  static int active_load_balance_cpu_stop(void *data);
>  
> +static int should_we_balance(struct lb_env *env)
> +{
> +	struct sched_group *sg = env->sd->groups;
> +	struct cpumask *sg_cpus, *sg_mask;
> +	int cpu, balance_cpu = -1;
> +
> +	/*
> +	 * In the newly idle case, we will allow all the cpu's
> +	 * to do the newly idle load balance.
> +	 */
> +	if (env->idle == CPU_NEWLY_IDLE)
> +		return 1;
> +
> +	sg_cpus = sched_group_cpus(sg);
> +	sg_mask = sched_group_mask(sg);
> +	/* Try to find first idle cpu */
> +	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> +		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +			continue;
> +
> +		balance_cpu = cpu;
> +		break;
> +	}
> +
> +	if (balance_cpu == -1)
> +		balance_cpu = group_balance_cpu(sg);
> +
> +	/*
> +	 * First idle cpu or the first cpu(busiest) in this sched group
> +	 * is eligible for doing load balancing at this and above domains.
> +	 */
> +	return balance_cpu != env->dst_cpu;
> +}
> +

Hello, Mel.

There is one mistake from me.
The last return statement in should_we_balance() should be
'return balance_cpu == env->dst_cpu'. The fix was submitted yesterday.

You can get more information on below thread.
https://lkml.org/lkml/2013/9/10/1

I think that this fix is somewhat important to scheduler's behavior,
so it may be better to update your test result with this fix.
Sorry for notifying this.

Thanks.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
@ 2013-09-11  0:58     ` Joonsoo Kim
  0 siblings, 0 replies; 361+ messages in thread
From: Joonsoo Kim @ 2013-09-11  0:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:41AM +0100, Mel Gorman wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>  
>  static int active_load_balance_cpu_stop(void *data);
>  
> +static int should_we_balance(struct lb_env *env)
> +{
> +	struct sched_group *sg = env->sd->groups;
> +	struct cpumask *sg_cpus, *sg_mask;
> +	int cpu, balance_cpu = -1;
> +
> +	/*
> +	 * In the newly idle case, we will allow all the cpu's
> +	 * to do the newly idle load balance.
> +	 */
> +	if (env->idle == CPU_NEWLY_IDLE)
> +		return 1;
> +
> +	sg_cpus = sched_group_cpus(sg);
> +	sg_mask = sched_group_mask(sg);
> +	/* Try to find first idle cpu */
> +	for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> +		if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +			continue;
> +
> +		balance_cpu = cpu;
> +		break;
> +	}
> +
> +	if (balance_cpu == -1)
> +		balance_cpu = group_balance_cpu(sg);
> +
> +	/*
> +	 * First idle cpu or the first cpu(busiest) in this sched group
> +	 * is eligible for doing load balancing at this and above domains.
> +	 */
> +	return balance_cpu != env->dst_cpu;
> +}
> +

Hello, Mel.

There is one mistake from me.
The last return statement in should_we_balance() should be
'return balance_cpu == env->dst_cpu'. The fix was submitted yesterday.

You can get more information on below thread.
https://lkml.org/lkml/2013/9/10/1

I think that this fix is somewhat important to scheduler's behavior,
so it may be better to update your test result with this fix.
Sorry for notifying this.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
  2013-09-10  9:31 ` Mel Gorman
                   ` (50 preceding siblings ...)
  (?)
@ 2013-09-11  2:03 ` Rik van Riel
  -1 siblings, 0 replies; 361+ messages in thread
From: Rik van Riel @ 2013-09-11  2:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On 09/10/2013 05:31 AM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
> 
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.

The attached two patches should fix the task grouping issues
we discussed on #mm earlier.

Now on to the load balancer. When specjbb takes up way fewer
CPUs than what are available on a node, it is possible for
multiple specjbb processes to end up on the same NUMA node,
and the load balancer makes no attempt to move some of them
to completely idle loads.

I have not figured out yet how to fix that behaviour...

-- 
All rights reversed

[-- Attachment #2: 0061-exec-leave-numa-group.patch --]
[-- Type: text/x-patch, Size: 2246 bytes --]

Subject: sched,numa: call task_numa_free from do_execve

It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.

This has the potential to break numa grouping, and is trivial
to fix.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 fs/exec.c             | 1 +
 include/linux/sched.h | 4 ++++
 kernel/sched/sched.h  | 5 -----
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index ffd7a81..a6da73b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1548,6 +1548,7 @@ static int do_execve_common(const char *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	acct_update_integrals(current);
+	task_numa_free(current);
 	free_bprm(bprm);
 	if (displaced)
 		put_files_struct(displaced);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 97df20f..44a7cc7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1453,6 +1453,7 @@ struct task_struct {
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
+extern void task_numa_free(struct task_struct *p);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -1465,6 +1466,9 @@ static inline pid_t task_numa_group_id(struct task_struct *p)
 static inline void set_numabalancing_state(bool enabled)
 {
 }
+static inline void task_numa_free(struct task_struct *p)
+{
+}
 #endif
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fae56e..93fa176 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,11 +559,6 @@ static inline u64 rq_clock_task(struct rq *rq)
 extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *, struct task_struct *);
-extern void task_numa_free(struct task_struct *p);
-#else /* CONFIG_NUMA_BALANCING */
-static inline void task_numa_free(struct task_struct *p)
-{
-}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP

[-- Attachment #3: 0062-numa-join-group-carefully.patch --]
[-- Type: text/x-patch, Size: 6167 bytes --]

Subject: sched,numa: be more careful about joining numa groups

Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.

This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h |  6 ++++--
 kernel/sched/fair.c   | 23 +++++++++++++++++------
 mm/huge_memory.c      |  4 +++-
 mm/memory.c           |  8 ++++++--
 4 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 44a7cc7..de942a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1450,13 +1450,15 @@ struct task_struct {
 #define TNF_NO_GROUP	0x02
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, int flags);
+extern void task_numa_fault(int last_node, int node, int pages, int flags,
+			    struct vm_area_struct *vma, int mapcount);
 extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
-				   int flags)
+				   int flags, struct vm_area_struct *vma,
+				   int mapcount)
 {
 }
 static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b3d877..22e859f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1323,7 +1323,8 @@ static void double_lock(spinlock_t *l1, spinlock_t *l2)
 	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 }
 
-static void task_numa_group(struct task_struct *p, int cpu, int pid)
+static void task_numa_group(struct task_struct *p, int cpu, int pid,
+			    struct vm_area_struct *vma, int mapcount)
 {
 	struct numa_group *grp, *my_grp;
 	struct task_struct *tsk;
@@ -1380,10 +1381,19 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
 	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
 		goto unlock;
 
-	if (!get_numa_group(grp))
-		goto unlock;
+	/* Always join threads in the same process. */
+	if (tsk->mm == current->mm)
+		join = true;
+
+	/*
+	 * Simple filter to avoid false positives due to PID collisions,
+	 * accesses on KSM shared pages, etc...
+	 */
+	if (mapcount > 1 && (vma->vm_flags & VM_SHARED))
+		join = true;
 
-	join = true;
+	if (join && !get_numa_group(grp))
+		join = false;
 
 unlock:
 	rcu_read_unlock();
@@ -1437,7 +1447,8 @@ void task_numa_free(struct task_struct *p)
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int last_cpupid, int node, int pages, int flags)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags,
+		     struct vm_area_struct *vma, int mapcount)
 {
 	struct task_struct *p = current;
 	bool migrated = flags & TNF_MIGRATED;
@@ -1478,7 +1489,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
 
 		priv = (pid == (p->pid & LAST__PID_MASK));
 		if (!priv && !(flags & TNF_NO_GROUP))
-			task_numa_group(p, cpu, pid);
+			task_numa_group(p, cpu, pid, vma, mapcount);
 	}
 
 	/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f883df..a175191 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,6 +1298,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool page_locked;
 	bool migrated = false;
 	int flags = 0;
+	int mapcount = 0;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1306,6 +1307,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = pmd_page(pmd);
 	BUG_ON(is_huge_zero_page(page));
 	page_nid = page_to_nid(page);
+	mapcount = page_mapcount(page);
 	last_cpupid = page_cpupid_last(page);
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	if (page_nid == this_nid)
@@ -1388,7 +1390,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags, vma, mapcount);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 2e1d43b..8cef83c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3545,6 +3545,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid;
 	bool migrated = false;
 	int flags = 0;
+	int mapcount = 0;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3583,6 +3584,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
+	mapcount = page_mapcount(page);
 	target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 	pte_unmap_unlock(ptep, ptl);
 	if (target_nid == -1) {
@@ -3599,7 +3601,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 out:
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, 1, flags);
+		task_numa_fault(last_cpupid, page_nid, 1, flags, vma, mapcount);
 	return 0;
 }
 
@@ -3641,6 +3643,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int target_nid;
 		bool migrated = false;
 		int flags = 0;
+		int mapcount;
 
 		if (!pte_present(pteval))
 			continue;
@@ -3670,6 +3673,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		last_cpupid = page_cpupid_last(page);
 		page_nid = page_to_nid(page);
+		mapcount = page_mapcount(page);
 		target_nid = numa_migrate_prep(page, vma, addr, page_nid);
 		pte_unmap_unlock(pte, ptl);
 		if (target_nid != -1) {
@@ -3683,7 +3687,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 
 		if (page_nid != -1)
-			task_numa_fault(last_cpupid, page_nid, 1, flags);
+			task_numa_fault(last_cpupid, page_nid, 1, flags, vma, mapcount);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
  2013-09-10  9:31   ` Mel Gorman
@ 2013-09-11  3:11     ` Hillf Danton
  -1 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-11  3:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <mgorman@suse.de> wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>
>  static int active_load_balance_cpu_stop(void *data);
>
> +static int should_we_balance(struct lb_env *env)
> +{
> +       struct sched_group *sg = env->sd->groups;
> +       struct cpumask *sg_cpus, *sg_mask;
> +       int cpu, balance_cpu = -1;
> +
> +       /*
> +        * In the newly idle case, we will allow all the cpu's
> +        * to do the newly idle load balance.
> +        */
> +       if (env->idle == CPU_NEWLY_IDLE)
> +               return 1;
> +
> +       sg_cpus = sched_group_cpus(sg);
> +       sg_mask = sched_group_mask(sg);
> +       /* Try to find first idle cpu */
> +       for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> +               if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +                       continue;
> +
> +               balance_cpu = cpu;
> +               break;
> +       }
> +
> +       if (balance_cpu == -1)
> +               balance_cpu = group_balance_cpu(sg);
> +
> +       /*
> +        * First idle cpu or the first cpu(busiest) in this sched group
> +        * is eligible for doing load balancing at this and above domains.
> +        */
> +       return balance_cpu != env->dst_cpu;

FYI: Here is a bug reported by Dave Chinner.
https://lkml.org/lkml/2013/9/10/1

And lets see if any changes in your SpecJBB results without it.

Hillf

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
@ 2013-09-11  3:11     ` Hillf Danton
  0 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-11  3:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <mgorman@suse.de> wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>
>  static int active_load_balance_cpu_stop(void *data);
>
> +static int should_we_balance(struct lb_env *env)
> +{
> +       struct sched_group *sg = env->sd->groups;
> +       struct cpumask *sg_cpus, *sg_mask;
> +       int cpu, balance_cpu = -1;
> +
> +       /*
> +        * In the newly idle case, we will allow all the cpu's
> +        * to do the newly idle load balance.
> +        */
> +       if (env->idle == CPU_NEWLY_IDLE)
> +               return 1;
> +
> +       sg_cpus = sched_group_cpus(sg);
> +       sg_mask = sched_group_mask(sg);
> +       /* Try to find first idle cpu */
> +       for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> +               if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> +                       continue;
> +
> +               balance_cpu = cpu;
> +               break;
> +       }
> +
> +       if (balance_cpu == -1)
> +               balance_cpu = group_balance_cpu(sg);
> +
> +       /*
> +        * First idle cpu or the first cpu(busiest) in this sched group
> +        * is eligible for doing load balancing at this and above domains.
> +        */
> +       return balance_cpu != env->dst_cpu;

FYI: Here is a bug reported by Dave Chinner.
https://lkml.org/lkml/2013/9/10/1

And lets see if any changes in your SpecJBB results without it.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-12  2:10     ` Hillf Danton
  -1 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12  2:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hillo Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
[...]
> @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         int page_lru = page_is_file_cache(page);
>
>         /*
> -        * Don't migrate pages that are mapped in multiple processes.
> -        * TODO: Handle false sharing detection instead of this hammer
> -        */
> -       if (page_mapcount(page) != 1)
> -               goto out_dropref;
> -
Is there rmap walk when migrating THP?

> -       /*
>          * Rate-limit the amount of data that is being migrated to a node.
>          * Optimal placement is no good if the memory bus is saturated and
>          * all the time is being spent migrating!

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
@ 2013-09-12  2:10     ` Hillf Danton
  0 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12  2:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hillo Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
[...]
> @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         int page_lru = page_is_file_cache(page);
>
>         /*
> -        * Don't migrate pages that are mapped in multiple processes.
> -        * TODO: Handle false sharing detection instead of this hammer
> -        */
> -       if (page_mapcount(page) != 1)
> -               goto out_dropref;
> -
Is there rmap walk when migrating THP?

> -       /*
>          * Rate-limit the amount of data that is being migrated to a node.
>          * Optimal placement is no good if the memory bus is saturated and
>          * all the time is being spent migrating!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-12 12:42     ` Hillf Danton
  -1 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12 12:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> +       struct numa_group *grp = p->numa_group;
> +       int i;
> +
> +       kfree(p->numa_faults);
> +
> +       if (grp) {
> +               for (i = 0; i < 2*nr_node_ids; i++)
> +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free, numa_faults ;/

> +               spin_lock(&grp->lock);
> +               list_del(&p->numa_entry);
> +               grp->nr_tasks--;
> +               spin_unlock(&grp->lock);
> +               rcu_assign_pointer(p->numa_group, NULL);
> +               put_numa_group(grp);
> +       }
> +}
> +

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-09-12 12:42     ` Hillf Danton
  0 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12 12:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> +       struct numa_group *grp = p->numa_group;
> +       int i;
> +
> +       kfree(p->numa_faults);
> +
> +       if (grp) {
> +               for (i = 0; i < 2*nr_node_ids; i++)
> +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free, numa_faults ;/

> +               spin_lock(&grp->lock);
> +               list_del(&p->numa_entry);
> +               grp->nr_tasks--;
> +               spin_unlock(&grp->lock);
> +               rcu_assign_pointer(p->numa_group, NULL);
> +               put_numa_group(grp);
> +       }
> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-12 12:45     ` Hillf Danton
  -1 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12 12:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> +       struct numa_group *grp = p->numa_group;
> +       int i;
> +
> +       kfree(p->numa_faults);
> +
> +       if (grp) {
> +               for (i = 0; i < 2*nr_node_ids; i++)
> +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free :/

> +               spin_lock(&grp->lock);
> +               list_del(&p->numa_entry);
> +               grp->nr_tasks--;
> +               spin_unlock(&grp->lock);
> +               rcu_assign_pointer(p->numa_group, NULL);
> +               put_numa_group(grp);
> +       }
> +}
> +

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-09-12 12:45     ` Hillf Danton
  0 siblings, 0 replies; 361+ messages in thread
From: Hillf Danton @ 2013-09-12 12:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> +       struct numa_group *grp = p->numa_group;
> +       int i;
> +
> +       kfree(p->numa_faults);
> +
> +       if (grp) {
> +               for (i = 0; i < 2*nr_node_ids; i++)
> +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free :/

> +               spin_lock(&grp->lock);
> +               list_del(&p->numa_entry);
> +               grp->nr_tasks--;
> +               spin_unlock(&grp->lock);
> +               rcu_assign_pointer(p->numa_group, NULL);
> +               put_numa_group(grp);
> +       }
> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
  2013-09-12 12:42     ` Hillf Danton
@ 2013-09-12 14:40       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-12 14:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Sep 12, 2013 at 08:42:18PM +0800, Hillf Danton wrote:
> Hello Mel
> 
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > +void task_numa_free(struct task_struct *p)
> > +{
> > +       struct numa_group *grp = p->numa_group;
> > +       int i;
> > +
> > +       kfree(p->numa_faults);
> > +
> > +       if (grp) {
> > +               for (i = 0; i < 2*nr_node_ids; i++)
> > +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> > +
> use after free, numa_faults ;/
> 

It gets fixed in the patch "sched: numa: use group fault statistics in
numa placement" but I agree that it's the wrong place to fix it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults
@ 2013-09-12 14:40       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-12 14:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Sep 12, 2013 at 08:42:18PM +0800, Hillf Danton wrote:
> Hello Mel
> 
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > +void task_numa_free(struct task_struct *p)
> > +{
> > +       struct numa_group *grp = p->numa_group;
> > +       int i;
> > +
> > +       kfree(p->numa_faults);
> > +
> > +       if (grp) {
> > +               for (i = 0; i < 2*nr_node_ids; i++)
> > +                       atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> > +
> use after free, numa_faults ;/
> 

It gets fixed in the patch "sched: numa: use group fault statistics in
numa placement" but I agree that it's the wrong place to fix it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
  2013-09-11  3:11     ` Hillf Danton
@ 2013-09-13  8:11       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-13  8:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Sep 11, 2013 at 11:11:03AM +0800, Hillf Danton wrote:
> On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <mgorman@suse.de> wrote:
> > @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
> >
> >  static int active_load_balance_cpu_stop(void *data);
> >
> > +static int should_we_balance(struct lb_env *env)
> > +{
> > +       struct sched_group *sg = env->sd->groups;
> > +       struct cpumask *sg_cpus, *sg_mask;
> > +       int cpu, balance_cpu = -1;
> > +
> > +       /*
> > +        * In the newly idle case, we will allow all the cpu's
> > +        * to do the newly idle load balance.
> > +        */
> > +       if (env->idle == CPU_NEWLY_IDLE)
> > +               return 1;
> > +
> > +       sg_cpus = sched_group_cpus(sg);
> > +       sg_mask = sched_group_mask(sg);
> > +       /* Try to find first idle cpu */
> > +       for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> > +               if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> > +                       continue;
> > +
> > +               balance_cpu = cpu;
> > +               break;
> > +       }
> > +
> > +       if (balance_cpu == -1)
> > +               balance_cpu = group_balance_cpu(sg);
> > +
> > +       /*
> > +        * First idle cpu or the first cpu(busiest) in this sched group
> > +        * is eligible for doing load balancing at this and above domains.
> > +        */
> > +       return balance_cpu != env->dst_cpu;
> 
> FYI: Here is a bug reported by Dave Chinner.
> https://lkml.org/lkml/2013/9/10/1
> 
> And lets see if any changes in your SpecJBB results without it.
> 

Thanks for pointing that out. I've picked up the one-liner fix.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream
@ 2013-09-13  8:11       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-13  8:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Wed, Sep 11, 2013 at 11:11:03AM +0800, Hillf Danton wrote:
> On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <mgorman@suse.de> wrote:
> > @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
> >
> >  static int active_load_balance_cpu_stop(void *data);
> >
> > +static int should_we_balance(struct lb_env *env)
> > +{
> > +       struct sched_group *sg = env->sd->groups;
> > +       struct cpumask *sg_cpus, *sg_mask;
> > +       int cpu, balance_cpu = -1;
> > +
> > +       /*
> > +        * In the newly idle case, we will allow all the cpu's
> > +        * to do the newly idle load balance.
> > +        */
> > +       if (env->idle == CPU_NEWLY_IDLE)
> > +               return 1;
> > +
> > +       sg_cpus = sched_group_cpus(sg);
> > +       sg_mask = sched_group_mask(sg);
> > +       /* Try to find first idle cpu */
> > +       for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> > +               if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> > +                       continue;
> > +
> > +               balance_cpu = cpu;
> > +               break;
> > +       }
> > +
> > +       if (balance_cpu == -1)
> > +               balance_cpu = group_balance_cpu(sg);
> > +
> > +       /*
> > +        * First idle cpu or the first cpu(busiest) in this sched group
> > +        * is eligible for doing load balancing at this and above domains.
> > +        */
> > +       return balance_cpu != env->dst_cpu;
> 
> FYI: Here is a bug reported by Dave Chinner.
> https://lkml.org/lkml/2013/9/10/1
> 
> And lets see if any changes in your SpecJBB results without it.
> 

Thanks for pointing that out. I've picked up the one-liner fix.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
  2013-09-12  2:10     ` Hillf Danton
@ 2013-09-13  8:11       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-13  8:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Sep 12, 2013 at 10:10:13AM +0800, Hillf Danton wrote:
> Hillo Mel
> 
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> > Currently automatic NUMA balancing is unable to distinguish between false
> > shared versus private pages except by ignoring pages with an elevated
> > page_mapcount entirely. This avoids shared pages bouncing between the
> > nodes whose task is using them but that is ignored quite a lot of data.
> >
> > This patch kicks away the training wheels in preparation for adding support
> > for identifying shared/private pages is now in place. The ordering is so
> > that the impact of the shared/private detection can be easily measured. Note
> > that the patch does not migrate shared, file-backed within vmas marked
> > VM_EXEC as these are generally shared library pages. Migrating such pages
> > is not beneficial as there is an expectation they are read-shared between
> > caches and iTLB and iCache pressure is generally low.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> [...]
> > @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >         int page_lru = page_is_file_cache(page);
> >
> >         /*
> > -        * Don't migrate pages that are mapped in multiple processes.
> > -        * TODO: Handle false sharing detection instead of this hammer
> > -        */
> > -       if (page_mapcount(page) != 1)
> > -               goto out_dropref;
> > -
> Is there rmap walk when migrating THP?
> 

Should not be necessary for THP.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount
@ 2013-09-13  8:11       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-13  8:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Thu, Sep 12, 2013 at 10:10:13AM +0800, Hillf Danton wrote:
> Hillo Mel
> 
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <mgorman@suse.de> wrote:
> > Currently automatic NUMA balancing is unable to distinguish between false
> > shared versus private pages except by ignoring pages with an elevated
> > page_mapcount entirely. This avoids shared pages bouncing between the
> > nodes whose task is using them but that is ignored quite a lot of data.
> >
> > This patch kicks away the training wheels in preparation for adding support
> > for identifying shared/private pages is now in place. The ordering is so
> > that the impact of the shared/private detection can be easily measured. Note
> > that the patch does not migrate shared, file-backed within vmas marked
> > VM_EXEC as these are generally shared library pages. Migrating such pages
> > is not beneficial as there is an expectation they are read-shared between
> > caches and iTLB and iCache pressure is generally low.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> [...]
> > @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >         int page_lru = page_is_file_cache(page);
> >
> >         /*
> > -        * Don't migrate pages that are mapped in multiple processes.
> > -        * TODO: Handle false sharing detection instead of this hammer
> > -        */
> > -       if (page_mapcount(page) != 1)
> > -               goto out_dropref;
> > -
> Is there rmap walk when migrating THP?
> 

Should not be necessary for THP.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
  2013-09-10  9:31 ` Mel Gorman
@ 2013-09-14  2:57   ` Bob Liu
  -1 siblings, 0 replies; 361+ messages in thread
From: Bob Liu @ 2013-09-14  2:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hi Mel,

On 09/10/2013 05:31 PM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
> 
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.
> 

I found sometimes numa balancing will be broken after khugepaged
started, because khugepaged always allocate huge page from the node of
the first scanned normal page during collapsing.

A simple use case is when a user run his application interleaving all
nodes using "numactl --interleave=all xxxx".
But after khugepaged started most pages of his application will be
located to only one specific node.

I have a simple patch fix this issue in thread:
[PATCH 2/2] mm: thp: khugepaged: add policy for finding target node

I think this may related with this topic, I don't know whether this
series can also fix the issue I mentioned.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
@ 2013-09-14  2:57   ` Bob Liu
  0 siblings, 0 replies; 361+ messages in thread
From: Bob Liu @ 2013-09-14  2:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

Hi Mel,

On 09/10/2013 05:31 PM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
> 
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.
> 

I found sometimes numa balancing will be broken after khugepaged
started, because khugepaged always allocate huge page from the node of
the first scanned normal page during collapsing.

A simple use case is when a user run his application interleaving all
nodes using "numactl --interleave=all xxxx".
But after khugepaged started most pages of his application will be
located to only one specific node.

I have a simple patch fix this issue in thread:
[PATCH 2/2] mm: thp: khugepaged: add policy for finding target node

I think this may related with this topic, I don't know whether this
series can also fix the issue I mentioned.

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-10  9:31   ` Mel Gorman
@ 2013-09-16 12:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 12:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/mprotect.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94722a4..2bbb648 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  				split_huge_page_pmd(vma, addr, pmd);
>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
>  						 prot_numa)) {
> -				pages += HPAGE_PMD_NR;
> +				pages++;

But now you're not counting pages anymore..

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-16 12:36     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 12:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/mprotect.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94722a4..2bbb648 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  				split_huge_page_pmd(vma, addr, pmd);
>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
>  						 prot_numa)) {
> -				pages += HPAGE_PMD_NR;
> +				pages++;

But now you're not counting pages anymore..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-16 12:36     ` Peter Zijlstra
@ 2013-09-16 13:39       ` Rik van Riel
  -1 siblings, 0 replies; 361+ messages in thread
From: Rik van Riel @ 2013-09-16 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
>> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
>> large difference when estimating the cost of automatic NUMA balancing and
>> can be misleading when comparing results that had collapsed versus split
>> THP. This patch addresses the accounting issue.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/mprotect.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 94722a4..2bbb648 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>  				split_huge_page_pmd(vma, addr, pmd);
>>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
>>  						 prot_numa)) {
>> -				pages += HPAGE_PMD_NR;
>> +				pages++;
> 
> But now you're not counting pages anymore..

The migrate statistics still count pages. That makes sense, since the
amount of work scales with the amount of memory moved.

It is just the "number of faults" counters that actually count the
number of faults again, instead of the number of pages represented
by each fault.

IMHO this change makes sense.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-16 13:39       ` Rik van Riel
  0 siblings, 0 replies; 361+ messages in thread
From: Rik van Riel @ 2013-09-16 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
>> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
>> large difference when estimating the cost of automatic NUMA balancing and
>> can be misleading when comparing results that had collapsed versus split
>> THP. This patch addresses the accounting issue.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/mprotect.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 94722a4..2bbb648 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>  				split_huge_page_pmd(vma, addr, pmd);
>>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
>>  						 prot_numa)) {
>> -				pages += HPAGE_PMD_NR;
>> +				pages++;
> 
> But now you're not counting pages anymore..

The migrate statistics still count pages. That makes sense, since the
amount of work scales with the amount of memory moved.

It is just the "number of faults" counters that actually count the
number of faults again, instead of the number of pages represented
by each fault.

IMHO this change makes sense.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-16 13:39       ` Rik van Riel
@ 2013-09-16 14:54         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 14:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> >> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> >> large difference when estimating the cost of automatic NUMA balancing and
> >> can be misleading when comparing results that had collapsed versus split
> >> THP. This patch addresses the accounting issue.
> >>
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> ---
> >>  mm/mprotect.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index 94722a4..2bbb648 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>  				split_huge_page_pmd(vma, addr, pmd);
> >>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
> >>  						 prot_numa)) {
> >> -				pages += HPAGE_PMD_NR;
> >> +				pages++;
> > 
> > But now you're not counting pages anymore..
> 
> The migrate statistics still count pages. That makes sense, since the
> amount of work scales with the amount of memory moved.

Right.

> It is just the "number of faults" counters that actually count the
> number of faults again, instead of the number of pages represented
> by each fault.

So you're suggesting s/pages/faults/ or somesuch?

> IMHO this change makes sense.

I never said the change didn't make sense as such. Just that we're no
longer counting pages in change_*_range().

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-16 14:54         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 14:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> >> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> >> large difference when estimating the cost of automatic NUMA balancing and
> >> can be misleading when comparing results that had collapsed versus split
> >> THP. This patch addresses the accounting issue.
> >>
> >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> ---
> >>  mm/mprotect.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index 94722a4..2bbb648 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>  				split_huge_page_pmd(vma, addr, pmd);
> >>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
> >>  						 prot_numa)) {
> >> -				pages += HPAGE_PMD_NR;
> >> +				pages++;
> > 
> > But now you're not counting pages anymore..
> 
> The migrate statistics still count pages. That makes sense, since the
> amount of work scales with the amount of memory moved.

Right.

> It is just the "number of faults" counters that actually count the
> number of faults again, instead of the number of pages represented
> by each fault.

So you're suggesting s/pages/faults/ or somesuch?

> IMHO this change makes sense.

I never said the change didn't make sense as such. Just that we're no
longer counting pages in change_*_range().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
  2013-09-10  9:31   ` Mel Gorman
@ 2013-09-16 15:18     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	 * If pages are properly placed (did not migrate) then scan slower.
>  	 * This is reset periodically in case of phase changes
>  	 */
> -        if (!migrated)
> -		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> +        if (!migrated) {
> +		/* Initialise if necessary */
> +		if (!p->numa_scan_period_max)
> +			p->numa_scan_period_max = task_scan_max(p);
> +
> +		p->numa_scan_period = min(p->numa_scan_period_max,
>  			p->numa_scan_period + jiffies_to_msecs(10));

So the next patch changes the jiffies_to_msec() thing.. is that really
worth a whole separate patch?

Also, I really don't believe any of that is 'right', increasing the scan
period by a fixed amount for every !migrated page is just wrong.

Firstly; there's the migration throttle which basically guarantees that
most pages aren't migrated -- even when they ought to be, thus inflating
the period.

Secondly; assume a _huge_ process, so large that even a small fraction
of non-migrated pages will completely clip the scan period.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-09-16 15:18     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
>  	 * If pages are properly placed (did not migrate) then scan slower.
>  	 * This is reset periodically in case of phase changes
>  	 */
> -        if (!migrated)
> -		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> +        if (!migrated) {
> +		/* Initialise if necessary */
> +		if (!p->numa_scan_period_max)
> +			p->numa_scan_period_max = task_scan_max(p);
> +
> +		p->numa_scan_period = min(p->numa_scan_period_max,
>  			p->numa_scan_period + jiffies_to_msecs(10));

So the next patch changes the jiffies_to_msec() thing.. is that really
worth a whole separate patch?

Also, I really don't believe any of that is 'right', increasing the scan
period by a fixed amount for every !migrated page is just wrong.

Firstly; there's the migration throttle which basically guarantees that
most pages aren't migrated -- even when they ought to be, thus inflating
the period.

Secondly; assume a _huge_ process, so large that even a small fraction
of non-migrated pages will completely clip the scan period.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
  2013-09-16 15:18     ` Peter Zijlstra
@ 2013-09-16 15:40       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-16 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 05:18:22PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> > @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	 * If pages are properly placed (did not migrate) then scan slower.
> >  	 * This is reset periodically in case of phase changes
> >  	 */
> > -        if (!migrated)
> > -		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> > +        if (!migrated) {
> > +		/* Initialise if necessary */
> > +		if (!p->numa_scan_period_max)
> > +			p->numa_scan_period_max = task_scan_max(p);
> > +
> > +		p->numa_scan_period = min(p->numa_scan_period_max,
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> 
> So the next patch changes the jiffies_to_msec() thing.. is that really
> worth a whole separate patch?
> 

No, I can collapse them.

> Also, I really don't believe any of that is 'right', increasing the scan
> period by a fixed amount for every !migrated page is just wrong.
> 

At the moment Rik and I are both looking at adapting the scan rate based
on whether the faults trapped since the last scan window were local or
remote faults. It should be able to sensibly adapt the scan rate
independently of the RSS of the process.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned
@ 2013-09-16 15:40       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-16 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 05:18:22PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> > @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> >  	 * If pages are properly placed (did not migrate) then scan slower.
> >  	 * This is reset periodically in case of phase changes
> >  	 */
> > -        if (!migrated)
> > -		p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> > +        if (!migrated) {
> > +		/* Initialise if necessary */
> > +		if (!p->numa_scan_period_max)
> > +			p->numa_scan_period_max = task_scan_max(p);
> > +
> > +		p->numa_scan_period = min(p->numa_scan_period_max,
> >  			p->numa_scan_period + jiffies_to_msecs(10));
> 
> So the next patch changes the jiffies_to_msec() thing.. is that really
> worth a whole separate patch?
> 

No, I can collapse them.

> Also, I really don't believe any of that is 'right', increasing the scan
> period by a fixed amount for every !migrated page is just wrong.
> 

At the moment Rik and I are both looking at adapting the scan rate based
on whether the faults trapped since the last scan window were local or
remote faults. It should be able to sensibly adapt the scan rate
independently of the RSS of the process.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-16 14:54         ` Peter Zijlstra
@ 2013-09-16 16:11           ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-16 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 04:54:38PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> > On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> > >> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> > >> large difference when estimating the cost of automatic NUMA balancing and
> > >> can be misleading when comparing results that had collapsed versus split
> > >> THP. This patch addresses the accounting issue.
> > >>
> > >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> > >> ---
> > >>  mm/mprotect.c | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> > >> index 94722a4..2bbb648 100644
> > >> --- a/mm/mprotect.c
> > >> +++ b/mm/mprotect.c
> > >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > >>  				split_huge_page_pmd(vma, addr, pmd);
> > >>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
> > >>  						 prot_numa)) {
> > >> -				pages += HPAGE_PMD_NR;
> > >> +				pages++;
> > > 
> > > But now you're not counting pages anymore..
> > 
> > The migrate statistics still count pages. That makes sense, since the
> > amount of work scales with the amount of memory moved.
> 
> Right.
> 
> > It is just the "number of faults" counters that actually count the
> > number of faults again, instead of the number of pages represented
> > by each fault.
> 
> So you're suggesting s/pages/faults/ or somesuch?
> 

It's really the number of ptes that are updated.

> > IMHO this change makes sense.
> 
> I never said the change didn't make sense as such. Just that we're no
> longer counting pages in change_*_range().

well, it's still a THP page. Is it worth renaming?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-16 16:11           ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-16 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 04:54:38PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> > On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> > >> A THP PMD update is accounted for as 512 pages updated in vmstat.  This is
> > >> large difference when estimating the cost of automatic NUMA balancing and
> > >> can be misleading when comparing results that had collapsed versus split
> > >> THP. This patch addresses the accounting issue.
> > >>
> > >> Signed-off-by: Mel Gorman <mgorman@suse.de>
> > >> ---
> > >>  mm/mprotect.c | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> > >> index 94722a4..2bbb648 100644
> > >> --- a/mm/mprotect.c
> > >> +++ b/mm/mprotect.c
> > >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > >>  				split_huge_page_pmd(vma, addr, pmd);
> > >>  			else if (change_huge_pmd(vma, pmd, addr, newprot,
> > >>  						 prot_numa)) {
> > >> -				pages += HPAGE_PMD_NR;
> > >> +				pages++;
> > > 
> > > But now you're not counting pages anymore..
> > 
> > The migrate statistics still count pages. That makes sense, since the
> > amount of work scales with the amount of memory moved.
> 
> Right.
> 
> > It is just the "number of faults" counters that actually count the
> > number of faults again, instead of the number of pages represented
> > by each fault.
> 
> So you're suggesting s/pages/faults/ or somesuch?
> 

It's really the number of ptes that are updated.

> > IMHO this change makes sense.
> 
> I never said the change didn't make sense as such. Just that we're no
> longer counting pages in change_*_range().

well, it's still a THP page. Is it worth renaming?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-09-10  9:31   ` Mel Gorman
@ 2013-09-16 16:35     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 16:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/mprotect.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1f9b54b..1e9cef0 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				make_migration_entry_read(&entry);
>  				set_pte_at(mm, addr, pte,
>  					swp_entry_to_pte(entry));
> +
> +				pages++;
>  			}
> -			pages++;
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();

Should we fold this into patch 7 ?

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-09-16 16:35     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 16:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/mprotect.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1f9b54b..1e9cef0 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				make_migration_entry_read(&entry);
>  				set_pte_at(mm, addr, pte,
>  					swp_entry_to_pte(entry));
> +
> +				pages++;
>  			}
> -			pages++;
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();

Should we fold this into patch 7 ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
  2013-09-16 16:11           ` Mel Gorman
@ 2013-09-16 16:37             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 05:11:50PM +0100, Mel Gorman wrote:
> > I never said the change didn't make sense as such. Just that we're no
> > longer counting pages in change_*_range().
> 
> well, it's still a THP page. Is it worth renaming?

Dunno, the pedant in me needed to raise the issue :-)

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update
@ 2013-09-16 16:37             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-16 16:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 05:11:50PM +0100, Mel Gorman wrote:
> > I never said the change didn't make sense as such. Just that we're no
> > longer counting pages in change_*_range().
> 
> well, it's still a THP page. Is it worth renaming?

Dunno, the pedant in me needed to raise the issue :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* 答复: [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
  2013-09-10  9:32   ` Mel Gorman
  (?)
@ 2013-09-17  2:02   ` 张天飞
  2013-09-17  8:05       ` Mel Gorman
  -1 siblings, 1 reply; 361+ messages in thread
From: 张天飞 @ 2013-09-17  2:02 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

index fd724bc..5d244d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
 		if (!vma_migratable(vma))
 			continue;
 
+		/*
+		 * Shared library pages mapped by multiple processes are not
+		 * migrated as it is expected they are cache replicated. Avoid
+		 * hinting faults in read-only file-backed mappings or the vdso
+		 * as migrating the pages will be of marginal benefit.
+		 */
+		if (!vma->vm_mm ||
+		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+			continue;
+
 
=》 May I ask a question, we should consider some VMAs canot be scaned for BalanceNuma?
(VM_DONTEXPAND | VM_RESERVED | VM_INSERTPAGE |
				  VM_NONLINEAR | VM_MIXEDMAP | VM_SAO));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: ????: [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
  2013-09-17  2:02   ` 答复: " 张天飞
@ 2013-09-17  8:05       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17  8:05 UTC (permalink / raw)
  To: ??????
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 17, 2013 at 10:02:22AM +0800, ?????? wrote:
> index fd724bc..5d244d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
>  		if (!vma_migratable(vma))
>  			continue;
>  
> +		/*
> +		 * Shared library pages mapped by multiple processes are not
> +		 * migrated as it is expected they are cache replicated. Avoid
> +		 * hinting faults in read-only file-backed mappings or the vdso
> +		 * as migrating the pages will be of marginal benefit.
> +		 */
> +		if (!vma->vm_mm ||
> +		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
> +			continue;
> +
>  
> =?? May I ask a question, we should consider some VMAs canot be scaned for BalanceNuma?
> (VM_DONTEXPAND | VM_RESERVED | VM_INSERTPAGE |
> 				  VM_NONLINEAR | VM_MIXEDMAP | VM_SAO));

vma_migratable check covers most of the other VMAs we do not care
about.  I do not see the point of checking for some of the VMA flags you
mention. Please state which of the additional flags that you think should
be checked and why.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: ????: [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
@ 2013-09-17  8:05       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17  8:05 UTC (permalink / raw)
  To: ??????
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Tue, Sep 17, 2013 at 10:02:22AM +0800, ?????? wrote:
> index fd724bc..5d244d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
>  		if (!vma_migratable(vma))
>  			continue;
>  
> +		/*
> +		 * Shared library pages mapped by multiple processes are not
> +		 * migrated as it is expected they are cache replicated. Avoid
> +		 * hinting faults in read-only file-backed mappings or the vdso
> +		 * as migrating the pages will be of marginal benefit.
> +		 */
> +		if (!vma->vm_mm ||
> +		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
> +			continue;
> +
>  
> =?? May I ask a question, we should consider some VMAs canot be scaned for BalanceNuma?
> (VM_DONTEXPAND | VM_RESERVED | VM_INSERTPAGE |
> 				  VM_NONLINEAR | VM_MIXEDMAP | VM_SAO));

vma_migratable check covers most of the other VMAs we do not care
about.  I do not see the point of checking for some of the VMA flags you
mention. Please state which of the additional flags that you think should
be checked and why.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: ????: [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries
  2013-09-17  8:05       ` Mel Gorman
  (?)
@ 2013-09-17  8:22       ` Figo.zhang
  -1 siblings, 0 replies; 361+ messages in thread
From: Figo.zhang @ 2013-09-17  8:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: ??????,
	Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

2013/9/17 Mel Gorman <mgorman@suse.de>

> On Tue, Sep 17, 2013 at 10:02:22AM +0800, ?????? wrote:
> > index fd724bc..5d244d0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
> >               if (!vma_migratable(vma))
> >                       continue;
> >
> > +             /*
> > +              * Shared library pages mapped by multiple processes are
> not
> > +              * migrated as it is expected they are cache replicated.
> Avoid
> > +              * hinting faults in read-only file-backed mappings or the
> vdso
> > +              * as migrating the pages will be of marginal benefit.
> > +              */
> > +             if (!vma->vm_mm ||
> > +                 (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE))
> == (VM_READ)))
> > +                     continue;
> > +
> >
> > =?? May I ask a question, we should consider some VMAs canot be scaned
> for BalanceNuma?
> > (VM_DONTEXPAND | VM_RESERVED | VM_INSERTPAGE |
> >                                 VM_NONLINEAR | VM_MIXEDMAP | VM_SAO));
>
> vma_migratable check covers most of the other VMAs we do not care
> about.  I do not see the point of checking for some of the VMA flags you
> mention. Please state which of the additional flags that you think should
> be checked and why.
>

=> we should filter out the VMAs of  VM_MIXEDMAP, because of  it just set
pte_mknuma for normal mapping pages in change_pte_range.

Best,
Figo.zhang




>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 2858 bytes --]

^ permalink raw reply	[flat|nested] 361+ messages in thread

* [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-17 14:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The cpu hotplug lock is a purely reader biased read-write lock.

The current implementation uses global state, change it so the reader
side uses per-cpu state in the uncontended fast-path.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   33 ++++++++++++++-
 kernel/cpu.c        |  108 ++++++++++++++++++++++++++--------------------------
 2 files changed, 87 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	this_cpu_inc(__cpuhp_refcount);
+	/*
+	 * Order the refcount inc against the writer read; pairs with the full
+	 * barrier in cpu_hotplug_begin().
+	 */
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		__get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	this_cpu_dec(__cpuhp_refcount);
+	if (unlikely(__cpuhp_writer))
+		__put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
 
-void get_online_cpus(void)
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+again:
+	/*
+	 * Ensure a pending reading has a 0 refcount.
+	 *
+	 * Without this a new reader that comes in before cpu_hotplug_begin()
+	 * reads the refcount will deadlock.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+
+	this_cpu_inc(__cpuhp_refcount);
+	/*
+	 * See get_online_cpu().
+	 */
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int refcnt = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (__cpuhp_writer == current)
+		return;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	for_each_possible_cpu(cpu)
+		refcnt += per_cpu(__cpuhp_refcount, cpu);
 
+	if (!refcnt)
+		wake_up_process(__cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	__cpuhp_writer = current;
 
 	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
+		unsigned int refcnt = 0;
+		int cpu;
+
+		/*
+		 * Order the setting of writer against the reading of refcount;
+		 * pairs with the full barrier in get_online_cpus().
+		 */
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
 			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
+
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_wq);
 }
 
 /*

^ permalink raw reply	[flat|nested] 361+ messages in thread

* [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-17 14:30     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 14:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The cpu hotplug lock is a purely reader biased read-write lock.

The current implementation uses global state, change it so the reader
side uses per-cpu state in the uncontended fast-path.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   33 ++++++++++++++-
 kernel/cpu.c        |  108 ++++++++++++++++++++++++++--------------------------
 2 files changed, 87 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	this_cpu_inc(__cpuhp_refcount);
+	/*
+	 * Order the refcount inc against the writer read; pairs with the full
+	 * barrier in cpu_hotplug_begin().
+	 */
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		__get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	this_cpu_dec(__cpuhp_refcount);
+	if (unlikely(__cpuhp_writer))
+		__put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
 
-void get_online_cpus(void)
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+again:
+	/*
+	 * Ensure a pending reading has a 0 refcount.
+	 *
+	 * Without this a new reader that comes in before cpu_hotplug_begin()
+	 * reads the refcount will deadlock.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+
+	this_cpu_inc(__cpuhp_refcount);
+	/*
+	 * See get_online_cpu().
+	 */
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int refcnt = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (__cpuhp_writer == current)
+		return;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	for_each_possible_cpu(cpu)
+		refcnt += per_cpu(__cpuhp_refcount, cpu);
 
+	if (!refcnt)
+		wake_up_process(__cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	__cpuhp_writer = current;
 
 	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
+		unsigned int refcnt = 0;
+		int cpu;
+
+		/*
+		 * Order the setting of writer against the reading of refcount;
+		 * pairs with the full barrier in get_online_cpus().
+		 */
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
 			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
+
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_wq);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 37/50] sched: Introduce migrate_swap()
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-17 14:32     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:32:17AM +0100, Mel Gorman wrote:
> TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
> wake_cpu to the actual task cpu; just couldn't be bothered to think through
> all the cases.

> + * XXX worry about hotplug

Combined with the {get,put}_online_cpus() optimization patch, the below
should address the two outstanding issues.

Completely untested for now.. will try and get it some runtime later.

Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   37 ++++++++++++++++++++-----------------
 kernel/sched/sched.h |    1 +
 2 files changed, 21 insertions(+), 17 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,7 +1035,7 @@ static void __migrate_swap_task(struct t
 		/*
 		 * Task isn't running anymore; make it appear like we migrated
 		 * it before it went to sleep. This means on wakeup we make the
-		 * previous cpu or targer instead of where it really is.
+		 * previous cpu our target instead of where it really is.
 		 */
 		p->wake_cpu = cpu;
 	}
@@ -1080,11 +1080,16 @@ static int migrate_swap_stop(void *data)
 }
 
 /*
- * XXX worry about hotplug
+ * Cross migrate two tasks
  */
 int migrate_swap(struct task_struct *cur, struct task_struct *p)
 {
-	struct migration_swap_arg arg = {
+	struct migration_swap_arg arg;
+	int ret = -EINVAL;
+
+	get_online_cpus();
+
+       	arg = (struct migration_swap_arg){
 		.src_task = cur,
 		.src_cpu = task_cpu(cur),
 		.dst_task = p,
@@ -1092,15 +1097,22 @@ int migrate_swap(struct task_struct *cur
 	};
 
 	if (arg.src_cpu == arg.dst_cpu)
-		return -EINVAL;
+		goto out;
+
+	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+		goto out;
 
 	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
-		return -EINVAL;
+		goto out;
 
 	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
-		return -EINVAL;
+		goto out;
+
+	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
-	return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+out:
+	put_online_cpus();
+	return ret;
 }
 
 struct migration_arg {
@@ -1608,12 +1620,7 @@ try_to_wake_up(struct task_struct *p, un
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	if (p->wake_cpu != -1) {	/* XXX make this condition go away */
-		cpu = p->wake_cpu;
-		p->wake_cpu = -1;
-	}
-
-	cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1699,10 +1706,6 @@ static void __sched_fork(struct task_str
 {
 	p->on_rq			= 0;
 
-#ifdef CONFIG_SMP
-	p->wake_cpu			= -1;
-#endif
-
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,6 +737,7 @@ static inline void __set_task_cpu(struct
 	 */
 	smp_wmb();
 	task_thread_info(p)->cpu = cpu;
+	p->wake_cpu = cpu;
 #endif
 }
 

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 37/50] sched: Introduce migrate_swap()
@ 2013-09-17 14:32     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:32:17AM +0100, Mel Gorman wrote:
> TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
> wake_cpu to the actual task cpu; just couldn't be bothered to think through
> all the cases.

> + * XXX worry about hotplug

Combined with the {get,put}_online_cpus() optimization patch, the below
should address the two outstanding issues.

Completely untested for now.. will try and get it some runtime later.

Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   37 ++++++++++++++++++++-----------------
 kernel/sched/sched.h |    1 +
 2 files changed, 21 insertions(+), 17 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,7 +1035,7 @@ static void __migrate_swap_task(struct t
 		/*
 		 * Task isn't running anymore; make it appear like we migrated
 		 * it before it went to sleep. This means on wakeup we make the
-		 * previous cpu or targer instead of where it really is.
+		 * previous cpu our target instead of where it really is.
 		 */
 		p->wake_cpu = cpu;
 	}
@@ -1080,11 +1080,16 @@ static int migrate_swap_stop(void *data)
 }
 
 /*
- * XXX worry about hotplug
+ * Cross migrate two tasks
  */
 int migrate_swap(struct task_struct *cur, struct task_struct *p)
 {
-	struct migration_swap_arg arg = {
+	struct migration_swap_arg arg;
+	int ret = -EINVAL;
+
+	get_online_cpus();
+
+       	arg = (struct migration_swap_arg){
 		.src_task = cur,
 		.src_cpu = task_cpu(cur),
 		.dst_task = p,
@@ -1092,15 +1097,22 @@ int migrate_swap(struct task_struct *cur
 	};
 
 	if (arg.src_cpu == arg.dst_cpu)
-		return -EINVAL;
+		goto out;
+
+	if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+		goto out;
 
 	if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
-		return -EINVAL;
+		goto out;
 
 	if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
-		return -EINVAL;
+		goto out;
+
+	ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
 
-	return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+out:
+	put_online_cpus();
+	return ret;
 }
 
 struct migration_arg {
@@ -1608,12 +1620,7 @@ try_to_wake_up(struct task_struct *p, un
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
 
-	if (p->wake_cpu != -1) {	/* XXX make this condition go away */
-		cpu = p->wake_cpu;
-		p->wake_cpu = -1;
-	}
-
-	cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -1699,10 +1706,6 @@ static void __sched_fork(struct task_str
 {
 	p->on_rq			= 0;
 
-#ifdef CONFIG_SMP
-	p->wake_cpu			= -1;
-#endif
-
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,6 +737,7 @@ static inline void __set_task_cpu(struct
 	 */
 	smp_wmb();
 	task_thread_info(p)->cpu = cpu;
+	p->wake_cpu = cpu;
 #endif
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-17 14:30     ` Peter Zijlstra
@ 2013-09-17 16:20       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 17, 2013 at 04:30:03PM +0200, Peter Zijlstra wrote:
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The cpu hotplug lock is a purely reader biased read-write lock.
> 
> The current implementation uses global state, change it so the reader
> side uses per-cpu state in the uncontended fast-path.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/cpu.h |   33 ++++++++++++++-
>  kernel/cpu.c        |  108 ++++++++++++++++++++++++++--------------------------
>  2 files changed, 87 insertions(+), 54 deletions(-)
> 
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>  
>  struct device;
>  
> @@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
>  
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	this_cpu_inc(__cpuhp_refcount);
> +	/*
> +	 * Order the refcount inc against the writer read; pairs with the full
> +	 * barrier in cpu_hotplug_begin().
> +	 */
> +	smp_mb();
> +	if (unlikely(__cpuhp_writer))
> +		__get_online_cpus();
> +}
> +

If the problem with get_online_cpus() is the shared global state then a
full barrier in the fast path is still going to hurt. Granted, it will hurt
a lot less and there should be no lock contention.

However, what barrier in cpu_hotplug_begin is the comment referring to? The
other barrier is in the slowpath __get_online_cpus. Did you mean to do
a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?
I'm assuming you are currently using a full barrier to guarantee that an
update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
not 100% sure because of the comments.

> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();

Why is this barrier necessary? I could not find anything that stated if an
inline function is an implicit compiler barrier but whether it is or not,
it's not clear why it's necessary at all.

> +	this_cpu_dec(__cpuhp_refcount);
> +	if (unlikely(__cpuhp_writer))
> +		__put_online_cpus();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer = NULL;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
>  
> -void get_online_cpus(void)
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void __get_online_cpus(void)
>  {
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
>  
> +again:
> +	/*
> +	 * Ensure a pending reading has a 0 refcount.
> +	 *
> +	 * Without this a new reader that comes in before cpu_hotplug_begin()
> +	 * reads the refcount will deadlock.
> +	 */
> +	this_cpu_dec(__cpuhp_refcount);
> +	wait_event(cpuhp_wq, !__cpuhp_writer);
> +
> +	this_cpu_inc(__cpuhp_refcount);
> +	/*
> +	 * See get_online_cpu().
> +	 */
> +	smp_mb();
> +	if (unlikely(__cpuhp_writer))
> +		goto again;
>  }

If CPU hotplug operations are very frequent (or a stupid stress test) then
it's possible for a new hotplug operation to start (updating __cpuhp_writer)
before a caller to __get_online_cpus can update the refcount. Potentially
a caller to __get_online_cpus gets starved although as it only affects a
CPU hotplug stress test it may not be a serious issue.

> -EXPORT_SYMBOL_GPL(get_online_cpus);
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  
> -void put_online_cpus(void)
> +void __put_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> +	unsigned int refcnt = 0;
> +	int cpu;
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	if (__cpuhp_writer == current)
> +		return;
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	for_each_possible_cpu(cpu)
> +		refcnt += per_cpu(__cpuhp_refcount, cpu);
>  

This can result in spurious wakeups if CPU N calls get_online_cpus after
its refcnt has been checked but I could not think of a case where it
matters.

> +	if (!refcnt)
> +		wake_up_process(__cpuhp_writer);
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>  
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	__cpuhp_writer = current;
>  
>  	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> +		unsigned int refcnt = 0;
> +		int cpu;
> +
> +		/*
> +		 * Order the setting of writer against the reading of refcount;
> +		 * pairs with the full barrier in get_online_cpus().
> +		 */
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +		for_each_possible_cpu(cpu)
> +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> +

CPU 0					CPU 1
get_online_cpus
refcnt++
					__cpuhp_writer = current
					refcnt > 0
					schedule
__get_online_cpus slowpath
refcnt--
wait_event(!__cpuhp_writer)

What wakes up __cpuhp_writer to recheck the refcnts and see that they're
all 0?

> +		if (!refcnt)
>  			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> +
>  		schedule();
>  	}
> +	__set_current_state(TASK_RUNNING);
>  }
>  
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	__cpuhp_writer = NULL;
> +	wake_up_all(&cpuhp_wq);
>  }
>  
>  /*

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-17 16:20       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 17, 2013 at 04:30:03PM +0200, Peter Zijlstra wrote:
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The cpu hotplug lock is a purely reader biased read-write lock.
> 
> The current implementation uses global state, change it so the reader
> side uses per-cpu state in the uncontended fast-path.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/cpu.h |   33 ++++++++++++++-
>  kernel/cpu.c        |  108 ++++++++++++++++++++++++++--------------------------
>  2 files changed, 87 insertions(+), 54 deletions(-)
> 
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>  
>  struct device;
>  
> @@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
>  
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	this_cpu_inc(__cpuhp_refcount);
> +	/*
> +	 * Order the refcount inc against the writer read; pairs with the full
> +	 * barrier in cpu_hotplug_begin().
> +	 */
> +	smp_mb();
> +	if (unlikely(__cpuhp_writer))
> +		__get_online_cpus();
> +}
> +

If the problem with get_online_cpus() is the shared global state then a
full barrier in the fast path is still going to hurt. Granted, it will hurt
a lot less and there should be no lock contention.

However, what barrier in cpu_hotplug_begin is the comment referring to? The
other barrier is in the slowpath __get_online_cpus. Did you mean to do
a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?
I'm assuming you are currently using a full barrier to guarantee that an
update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
not 100% sure because of the comments.

> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();

Why is this barrier necessary? I could not find anything that stated if an
inline function is an implicit compiler barrier but whether it is or not,
it's not clear why it's necessary at all.

> +	this_cpu_dec(__cpuhp_refcount);
> +	if (unlikely(__cpuhp_writer))
> +		__put_online_cpus();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer = NULL;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
>  
> -void get_online_cpus(void)
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void __get_online_cpus(void)
>  {
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
>  
> +again:
> +	/*
> +	 * Ensure a pending reading has a 0 refcount.
> +	 *
> +	 * Without this a new reader that comes in before cpu_hotplug_begin()
> +	 * reads the refcount will deadlock.
> +	 */
> +	this_cpu_dec(__cpuhp_refcount);
> +	wait_event(cpuhp_wq, !__cpuhp_writer);
> +
> +	this_cpu_inc(__cpuhp_refcount);
> +	/*
> +	 * See get_online_cpu().
> +	 */
> +	smp_mb();
> +	if (unlikely(__cpuhp_writer))
> +		goto again;
>  }

If CPU hotplug operations are very frequent (or a stupid stress test) then
it's possible for a new hotplug operation to start (updating __cpuhp_writer)
before a caller to __get_online_cpus can update the refcount. Potentially
a caller to __get_online_cpus gets starved although as it only affects a
CPU hotplug stress test it may not be a serious issue.

> -EXPORT_SYMBOL_GPL(get_online_cpus);
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  
> -void put_online_cpus(void)
> +void __put_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> +	unsigned int refcnt = 0;
> +	int cpu;
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	if (__cpuhp_writer == current)
> +		return;
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	for_each_possible_cpu(cpu)
> +		refcnt += per_cpu(__cpuhp_refcount, cpu);
>  

This can result in spurious wakeups if CPU N calls get_online_cpus after
its refcnt has been checked but I could not think of a case where it
matters.

> +	if (!refcnt)
> +		wake_up_process(__cpuhp_writer);
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>  
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	__cpuhp_writer = current;
>  
>  	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> +		unsigned int refcnt = 0;
> +		int cpu;
> +
> +		/*
> +		 * Order the setting of writer against the reading of refcount;
> +		 * pairs with the full barrier in get_online_cpus().
> +		 */
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +		for_each_possible_cpu(cpu)
> +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> +

CPU 0					CPU 1
get_online_cpus
refcnt++
					__cpuhp_writer = current
					refcnt > 0
					schedule
__get_online_cpus slowpath
refcnt--
wait_event(!__cpuhp_writer)

What wakes up __cpuhp_writer to recheck the refcnts and see that they're
all 0?

> +		if (!refcnt)
>  			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> +
>  		schedule();
>  	}
> +	__set_current_state(TASK_RUNNING);
>  }
>  
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	__cpuhp_writer = NULL;
> +	wake_up_all(&cpuhp_wq);
>  }
>  
>  /*

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-17 16:20       ` Mel Gorman
@ 2013-09-17 16:45         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 16:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 17, 2013 at 05:20:50PM +0100, Mel Gorman wrote:
> > +extern struct task_struct *__cpuhp_writer;
> > +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	this_cpu_inc(__cpuhp_refcount);
> > +	/*
> > +	 * Order the refcount inc against the writer read; pairs with the full
> > +	 * barrier in cpu_hotplug_begin().
> > +	 */
> > +	smp_mb();
> > +	if (unlikely(__cpuhp_writer))
> > +		__get_online_cpus();
> > +}
> > +
> 
> If the problem with get_online_cpus() is the shared global state then a
> full barrier in the fast path is still going to hurt. Granted, it will hurt
> a lot less and there should be no lock contention.

I went for a lot less, I wasn't smart enough to get rid of it. Also,
since its a lock op we should at least provide an ACQUIRE barrier.

> However, what barrier in cpu_hotplug_begin is the comment referring to? 

set_current_state() implies a full barrier and nicely separates the
write to __cpuhp_writer and the read of __cpuph_refcount.

> The
> other barrier is in the slowpath __get_online_cpus. Did you mean to do
> a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?

No, since we're ordering LOADs and STORES (see below) we must use full
barriers.

> I'm assuming you are currently using a full barrier to guarantee that an
> update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
> not 100% sure because of the comments.

I'm ordering:

  CPU0 -- get_online_cpus()	CPU1 -- cpu_hotplug_begin()

  STORE __cpuhp_refcount        STORE __cpuhp_writer

  MB				MB

  LOAD __cpuhp_writer		LOAD __cpuhp_refcount

Such that neither can miss the state of the other and we get proper
mutual exclusion.

> > +extern void __put_online_cpus(void);
> > +
> > +static inline void put_online_cpus(void)
> > +{
> > +	barrier();
> 
> Why is this barrier necessary? 

To ensure the compiler keeps all loads/stores done before the
read-unlock before it.

Arguably it should be a complete RELEASE barrier. I should've put an XXX
comment here but the brain gave out completely for the day.

> I could not find anything that stated if an
> inline function is an implicit compiler barrier but whether it is or not,
> it's not clear why it's necessary at all.

It is not, only actual function calls are an implied sync point for the
compiler.

> > +	this_cpu_dec(__cpuhp_refcount);
> > +	if (unlikely(__cpuhp_writer))
> > +		__put_online_cpus();
> > +}
> > +

> > +struct task_struct *__cpuhp_writer = NULL;
> > +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> > +
> > +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> >  
> > +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> > +
> > +void __get_online_cpus(void)
> >  {
> > +	if (__cpuhp_writer == current)
> >  		return;
> >  
> > +again:
> > +	/*
> > +	 * Ensure a pending reading has a 0 refcount.
> > +	 *
> > +	 * Without this a new reader that comes in before cpu_hotplug_begin()
> > +	 * reads the refcount will deadlock.
> > +	 */
> > +	this_cpu_dec(__cpuhp_refcount);
> > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> > +
> > +	this_cpu_inc(__cpuhp_refcount);
> > +	/*
> > +	 * See get_online_cpu().
> > +	 */
> > +	smp_mb();
> > +	if (unlikely(__cpuhp_writer))
> > +		goto again;
> >  }
> 
> If CPU hotplug operations are very frequent (or a stupid stress test) then
> it's possible for a new hotplug operation to start (updating __cpuhp_writer)
> before a caller to __get_online_cpus can update the refcount. Potentially
> a caller to __get_online_cpus gets starved although as it only affects a
> CPU hotplug stress test it may not be a serious issue.

Right.. If that ever becomes a problem we should fix it, but aside from
stress tests hotplug should be extremely rare.

Initially I kept the reference over the wait_event() but realized (as
per the comment) that that would deadlock cpu_hotplug_begin() for it
would never observe !refcount.

One solution for this problem is having refcount as an array of 2 and
flipping the index at the appropriate times.

> > +EXPORT_SYMBOL_GPL(__get_online_cpus);
> >  
> > +void __put_online_cpus(void)
> >  {
> > +	unsigned int refcnt = 0;
> > +	int cpu;
> >  
> > +	if (__cpuhp_writer == current)
> > +		return;
> >  
> > +	for_each_possible_cpu(cpu)
> > +		refcnt += per_cpu(__cpuhp_refcount, cpu);
> >  
> 
> This can result in spurious wakeups if CPU N calls get_online_cpus after
> its refcnt has been checked but I could not think of a case where it
> matters.

Right and right.. too many wakeups aren't a correctness issue. One
should try and minimize them for performance reasons though :-)

> > +	if (!refcnt)
> > +		wake_up_process(__cpuhp_writer);
> >  }


> >  /*
> >   * This ensures that the hotplug operation can begin only when the
> >   * refcount goes to zero.
> >   *
> >   * Since cpu_hotplug_begin() is always called after invoking
> >   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> >   */
> >  void cpu_hotplug_begin(void)
> >  {
> > +	__cpuhp_writer = current;
> >  
> >  	for (;;) {
> > +		unsigned int refcnt = 0;
> > +		int cpu;
> > +
> > +		/*
> > +		 * Order the setting of writer against the reading of refcount;
> > +		 * pairs with the full barrier in get_online_cpus().
> > +		 */
> > +
> > +		set_current_state(TASK_UNINTERRUPTIBLE);
> > +
> > +		for_each_possible_cpu(cpu)
> > +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> > +
> 
> CPU 0					CPU 1
> get_online_cpus
> refcnt++
> 					__cpuhp_writer = current
> 					refcnt > 0
> 					schedule
> __get_online_cpus slowpath
> refcnt--
> wait_event(!__cpuhp_writer)
> 
> What wakes up __cpuhp_writer to recheck the refcnts and see that they're
> all 0?

The wakeup in __put_online_cpus() you just commented on?
put_online_cpus() will drop into the slow path __put_online_cpus() if
there's a writer and compute the refcount and perform the wakeup when
!refcount.

> > +		if (!refcnt)
> >  			break;
> > +
> >  		schedule();
> >  	}
> > +	__set_current_state(TASK_RUNNING);
> >  }

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-17 16:45         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-17 16:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 17, 2013 at 05:20:50PM +0100, Mel Gorman wrote:
> > +extern struct task_struct *__cpuhp_writer;
> > +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	this_cpu_inc(__cpuhp_refcount);
> > +	/*
> > +	 * Order the refcount inc against the writer read; pairs with the full
> > +	 * barrier in cpu_hotplug_begin().
> > +	 */
> > +	smp_mb();
> > +	if (unlikely(__cpuhp_writer))
> > +		__get_online_cpus();
> > +}
> > +
> 
> If the problem with get_online_cpus() is the shared global state then a
> full barrier in the fast path is still going to hurt. Granted, it will hurt
> a lot less and there should be no lock contention.

I went for a lot less, I wasn't smart enough to get rid of it. Also,
since its a lock op we should at least provide an ACQUIRE barrier.

> However, what barrier in cpu_hotplug_begin is the comment referring to? 

set_current_state() implies a full barrier and nicely separates the
write to __cpuhp_writer and the read of __cpuph_refcount.

> The
> other barrier is in the slowpath __get_online_cpus. Did you mean to do
> a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?

No, since we're ordering LOADs and STORES (see below) we must use full
barriers.

> I'm assuming you are currently using a full barrier to guarantee that an
> update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
> not 100% sure because of the comments.

I'm ordering:

  CPU0 -- get_online_cpus()	CPU1 -- cpu_hotplug_begin()

  STORE __cpuhp_refcount        STORE __cpuhp_writer

  MB				MB

  LOAD __cpuhp_writer		LOAD __cpuhp_refcount

Such that neither can miss the state of the other and we get proper
mutual exclusion.

> > +extern void __put_online_cpus(void);
> > +
> > +static inline void put_online_cpus(void)
> > +{
> > +	barrier();
> 
> Why is this barrier necessary? 

To ensure the compiler keeps all loads/stores done before the
read-unlock before it.

Arguably it should be a complete RELEASE barrier. I should've put an XXX
comment here but the brain gave out completely for the day.

> I could not find anything that stated if an
> inline function is an implicit compiler barrier but whether it is or not,
> it's not clear why it's necessary at all.

It is not, only actual function calls are an implied sync point for the
compiler.

> > +	this_cpu_dec(__cpuhp_refcount);
> > +	if (unlikely(__cpuhp_writer))
> > +		__put_online_cpus();
> > +}
> > +

> > +struct task_struct *__cpuhp_writer = NULL;
> > +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> > +
> > +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> >  
> > +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> > +
> > +void __get_online_cpus(void)
> >  {
> > +	if (__cpuhp_writer == current)
> >  		return;
> >  
> > +again:
> > +	/*
> > +	 * Ensure a pending reading has a 0 refcount.
> > +	 *
> > +	 * Without this a new reader that comes in before cpu_hotplug_begin()
> > +	 * reads the refcount will deadlock.
> > +	 */
> > +	this_cpu_dec(__cpuhp_refcount);
> > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> > +
> > +	this_cpu_inc(__cpuhp_refcount);
> > +	/*
> > +	 * See get_online_cpu().
> > +	 */
> > +	smp_mb();
> > +	if (unlikely(__cpuhp_writer))
> > +		goto again;
> >  }
> 
> If CPU hotplug operations are very frequent (or a stupid stress test) then
> it's possible for a new hotplug operation to start (updating __cpuhp_writer)
> before a caller to __get_online_cpus can update the refcount. Potentially
> a caller to __get_online_cpus gets starved although as it only affects a
> CPU hotplug stress test it may not be a serious issue.

Right.. If that ever becomes a problem we should fix it, but aside from
stress tests hotplug should be extremely rare.

Initially I kept the reference over the wait_event() but realized (as
per the comment) that that would deadlock cpu_hotplug_begin() for it
would never observe !refcount.

One solution for this problem is having refcount as an array of 2 and
flipping the index at the appropriate times.

> > +EXPORT_SYMBOL_GPL(__get_online_cpus);
> >  
> > +void __put_online_cpus(void)
> >  {
> > +	unsigned int refcnt = 0;
> > +	int cpu;
> >  
> > +	if (__cpuhp_writer == current)
> > +		return;
> >  
> > +	for_each_possible_cpu(cpu)
> > +		refcnt += per_cpu(__cpuhp_refcount, cpu);
> >  
> 
> This can result in spurious wakeups if CPU N calls get_online_cpus after
> its refcnt has been checked but I could not think of a case where it
> matters.

Right and right.. too many wakeups aren't a correctness issue. One
should try and minimize them for performance reasons though :-)

> > +	if (!refcnt)
> > +		wake_up_process(__cpuhp_writer);
> >  }


> >  /*
> >   * This ensures that the hotplug operation can begin only when the
> >   * refcount goes to zero.
> >   *
> >   * Since cpu_hotplug_begin() is always called after invoking
> >   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> >   */
> >  void cpu_hotplug_begin(void)
> >  {
> > +	__cpuhp_writer = current;
> >  
> >  	for (;;) {
> > +		unsigned int refcnt = 0;
> > +		int cpu;
> > +
> > +		/*
> > +		 * Order the setting of writer against the reading of refcount;
> > +		 * pairs with the full barrier in get_online_cpus().
> > +		 */
> > +
> > +		set_current_state(TASK_UNINTERRUPTIBLE);
> > +
> > +		for_each_possible_cpu(cpu)
> > +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> > +
> 
> CPU 0					CPU 1
> get_online_cpus
> refcnt++
> 					__cpuhp_writer = current
> 					refcnt > 0
> 					schedule
> __get_online_cpus slowpath
> refcnt--
> wait_event(!__cpuhp_writer)
> 
> What wakes up __cpuhp_writer to recheck the refcnts and see that they're
> all 0?

The wakeup in __put_online_cpus() you just commented on?
put_online_cpus() will drop into the slow path __put_online_cpus() if
there's a writer and compute the refcount and perform the wakeup when
!refcount.

> > +		if (!refcnt)
> >  			break;
> > +
> >  		schedule();
> >  	}
> > +	__set_current_state(TASK_RUNNING);
> >  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
  2013-09-16 16:35     ` Peter Zijlstra
@ 2013-09-17 17:00       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 06:35:47PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> > NUMA PTE scanning is expensive both in terms of the scanning itself and
> > the TLB flush if there are any updates. Currently non-present PTEs are
> > accounted for as an update and incurring a TLB flush where it is only
> > necessary for anonymous migration entries. This patch addresses the
> > problem and should reduce TLB flushes.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/mprotect.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 1f9b54b..1e9cef0 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  				make_migration_entry_read(&entry);
> >  				set_pte_at(mm, addr, pte,
> >  					swp_entry_to_pte(entry));
> > +
> > +				pages++;
> >  			}
> > -			pages++;
> >  		}
> >  	} while (pte++, addr += PAGE_SIZE, addr != end);
> >  	arch_leave_lazy_mmu_mode();
> 
> Should we fold this into patch 7 ?

Looking closer at it, I think folding it into the patch would overload
the purpose of patch 7 a little too much but I shuffled the series to
keep the patches together.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry
@ 2013-09-17 17:00       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-17 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Mon, Sep 16, 2013 at 06:35:47PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> > NUMA PTE scanning is expensive both in terms of the scanning itself and
> > the TLB flush if there are any updates. Currently non-present PTEs are
> > accounted for as an update and incurring a TLB flush where it is only
> > necessary for anonymous migration entries. This patch addresses the
> > problem and should reduce TLB flushes.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/mprotect.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 1f9b54b..1e9cef0 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  				make_migration_entry_read(&entry);
> >  				set_pte_at(mm, addr, pte,
> >  					swp_entry_to_pte(entry));
> > +
> > +				pages++;
> >  			}
> > -			pages++;
> >  		}
> >  	} while (pte++, addr += PAGE_SIZE, addr != end);
> >  	arch_leave_lazy_mmu_mode();
> 
> Should we fold this into patch 7 ?

Looking closer at it, I think folding it into the patch would overload
the purpose of patch 7 a little too much but I shuffled the series to
keep the patches together.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-17 16:45         ` Peter Zijlstra
@ 2013-09-18 15:49           ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-18 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

New version, now with excessive comments.

I found a deadlock (where both reader and writer would go to sleep);
identified below as case 1b.

The implementation without patch is reader biased, this implementation,
as Mel pointed out, is writer biased. I should try and fix this but I'm
stepping away from the computer now as I have the feeling I'll only
wreck stuff from now on.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation uses global state, change it so the reader
side uses per-cpu state in the contended fast path.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   29 ++++++++-
 kernel/cpu.c        |  159 ++++++++++++++++++++++++++++++++++------------------
 2 files changed, 134 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,32 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* see comment near __get_online_cpus() */
+	if (unlikely(__cpuhp_writer))
+		__get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	this_cpu_dec(__cpuhp_refcount);
+	smp_mb(); /* see comment near __get_online_cpus() */
+	if (unlikely(__cpuhp_writer))
+		__put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,143 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+/*
+ * We must order things like:
+ *
+ *  CPU0 -- read-lock		CPU1 -- write-lock
+ *
+ *  STORE __cpuhp_refcount	STORE __cpuhp_writer
+ *  MB				MB
+ *  LOAD __cpuhp_writer		LOAD __cpuhp_refcount
+ *
+ *
+ * This gives rise to the following permutations:
+ *
+ * a) all of R happend before W
+ * b) R starts but sees the W store -- therefore W must see the R store
+ *    W starts but sees the R store -- therefore R must see the W store
+ * c) all of W happens before R
+ *
+ * 1) RL vs WL:
+ *
+ * 1a) RL proceeds; WL observes refcount and goes wait for !refcount.
+ * 1b) RL drops into the slow path; WL waits for !refcount.
+ * 1c) WL proceeds; RL drops into the slow path.
+ *
+ * 2) RL vs WU:
+ *
+ * 2a) RL drops into the slow path; WU clears writer and wakes RL
+ * 2b) RL proceeds; WU continues to wake others
+ * 2d) RL proceeds.
+ *
+ * 3) RU vs WL:
+ *
+ * 3a) RU proceeds; WL proceeds.
+ * 3b) RU drops to slow path; WL proceeds
+ * 3c) WL waits for !refcount; RL drops to slow path
+ *
+ * 4) RU vs WU:
+ *
+ * Impossible since R and W state are mutually exclusive.
+ *
+ * This leaves us to consider the R slow paths:
+ *
+ * RL
+ *
+ * 1b) we must wake W
+ * 2a) nothing of importance
+ *
+ * RU
+ *
+ * 3b) nothing of importance
+ * 3c) we must wake W
+ *
+ */
 
-void get_online_cpus(void)
+void __get_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+again:
+	/*
+	 * Case 1b; we must decrement our refcount again otherwise WL will
+	 * never observe !refcount and stay blocked forever. Not good since
+	 * we're going to sleep too. Someone must be awake and do something.
+	 *
+	 * Skip recomputing the refcount, just wake the pending writer and
+	 * have him check it -- writers are rare.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+	wake_up_process(__cpuhp_writer); /* implies MB */
+
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+
+	/* Basically re-do the fast-path. Excep we can never be the writer. */
+	this_cpu_inc(__cpuhp_refcount);
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int refcnt = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (__cpuhp_writer == current)
+		return;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	/* 3c */
+	for_each_possible_cpu(cpu)
+		refcnt += per_cpu(__cpuhp_refcount, cpu);
 
+	if (!refcnt)
+		wake_up_process(__cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	__cpuhp_writer = current;
 
 	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
+		unsigned int refcnt = 0;
+		int cpu;
+
+		set_current_state(TASK_UNINTERRUPTIBLE); /* implies MB */
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
 			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
+
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_wq); /* implies MB */
 }
 
 /*


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-18 15:49           ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-18 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

New version, now with excessive comments.

I found a deadlock (where both reader and writer would go to sleep);
identified below as case 1b.

The implementation without patch is reader biased, this implementation,
as Mel pointed out, is writer biased. I should try and fix this but I'm
stepping away from the computer now as I have the feeling I'll only
wreck stuff from now on.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation uses global state, change it so the reader
side uses per-cpu state in the contended fast path.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   29 ++++++++-
 kernel/cpu.c        |  159 ++++++++++++++++++++++++++++++++++------------------
 2 files changed, 134 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,32 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* see comment near __get_online_cpus() */
+	if (unlikely(__cpuhp_writer))
+		__get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	this_cpu_dec(__cpuhp_refcount);
+	smp_mb(); /* see comment near __get_online_cpus() */
+	if (unlikely(__cpuhp_writer))
+		__put_online_cpus();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,143 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+/*
+ * We must order things like:
+ *
+ *  CPU0 -- read-lock		CPU1 -- write-lock
+ *
+ *  STORE __cpuhp_refcount	STORE __cpuhp_writer
+ *  MB				MB
+ *  LOAD __cpuhp_writer		LOAD __cpuhp_refcount
+ *
+ *
+ * This gives rise to the following permutations:
+ *
+ * a) all of R happend before W
+ * b) R starts but sees the W store -- therefore W must see the R store
+ *    W starts but sees the R store -- therefore R must see the W store
+ * c) all of W happens before R
+ *
+ * 1) RL vs WL:
+ *
+ * 1a) RL proceeds; WL observes refcount and goes wait for !refcount.
+ * 1b) RL drops into the slow path; WL waits for !refcount.
+ * 1c) WL proceeds; RL drops into the slow path.
+ *
+ * 2) RL vs WU:
+ *
+ * 2a) RL drops into the slow path; WU clears writer and wakes RL
+ * 2b) RL proceeds; WU continues to wake others
+ * 2d) RL proceeds.
+ *
+ * 3) RU vs WL:
+ *
+ * 3a) RU proceeds; WL proceeds.
+ * 3b) RU drops to slow path; WL proceeds
+ * 3c) WL waits for !refcount; RL drops to slow path
+ *
+ * 4) RU vs WU:
+ *
+ * Impossible since R and W state are mutually exclusive.
+ *
+ * This leaves us to consider the R slow paths:
+ *
+ * RL
+ *
+ * 1b) we must wake W
+ * 2a) nothing of importance
+ *
+ * RU
+ *
+ * 3b) nothing of importance
+ * 3c) we must wake W
+ *
+ */
 
-void get_online_cpus(void)
+void __get_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
 
+again:
+	/*
+	 * Case 1b; we must decrement our refcount again otherwise WL will
+	 * never observe !refcount and stay blocked forever. Not good since
+	 * we're going to sleep too. Someone must be awake and do something.
+	 *
+	 * Skip recomputing the refcount, just wake the pending writer and
+	 * have him check it -- writers are rare.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
+	wake_up_process(__cpuhp_writer); /* implies MB */
+
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+
+	/* Basically re-do the fast-path. Excep we can never be the writer. */
+	this_cpu_inc(__cpuhp_refcount);
+	smp_mb();
+	if (unlikely(__cpuhp_writer))
+		goto again;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void put_online_cpus(void)
+void __put_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int refcnt = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (__cpuhp_writer == current)
+		return;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	/* 3c */
+	for_each_possible_cpu(cpu)
+		refcnt += per_cpu(__cpuhp_refcount, cpu);
 
+	if (!refcnt)
+		wake_up_process(__cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	__cpuhp_writer = current;
 
 	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
+		unsigned int refcnt = 0;
+		int cpu;
+
+		set_current_state(TASK_UNINTERRUPTIBLE); /* implies MB */
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
 			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
+
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_wq); /* implies MB */
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-18 15:49           ` Peter Zijlstra
@ 2013-09-19 14:32             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-19 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt




Meh, I should stop poking at this..

This one lost all the comments again :/

It uses preempt_disable/preempt_enable vs synchronize_sched() to remove
the barriers from the fast path.

After that it waits for !refcount before setting state, which stops new
readers.

I used a per-cpu spinlock to keep the state check and refcount inc
atomic vs the setting of state.

So the slow path is still per-cpu and mostly uncontended even in the
pending writer case.

After setting state it again waits for !refcount -- someone could have
sneaked in between the last !refcount and setting state. But this time
we know refcount will stay 0.

The only thing I don't really like is the unconditional writer wake in
the read-unlock slowpath, but I couldn't come up with anything better.
Here at least we guarantee that there is a wakeup after the last dec --
although there might be far too many wakes.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   32 ++++++++++-
 kernel/cpu.c        |  151 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 116 insertions(+), 67 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,35 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
+		this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	preempt_disable();
+	this_cpu_dec(__cpuhp_refcount);
+	if (unlikely(__cpuhp_writer && __cpuhp_writer != current))
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,109 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
-
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
-
-}
-EXPORT_SYMBOL_GPL(get_online_cpus);
-
-void put_online_cpus(void)
-{
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
-
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(int, cpuhp_state);
+static DEFINE_PER_CPU(spinlock_t, cpuhp_lock);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
+{
+	spin_lock(__this_cpu_ptr(&cpuhp_lock));
+	for (;;) {
+		if (!__this_cpu_read(cpuhp_state)) {
+			__this_cpu_inc(__cpuhp_refcount);
+			break;
+		}
+
+		spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+		preempt_enable();
+
+		wait_event(cpuhp_wq, !__cpuhp_writer);
+
+		preempt_disable();
+		spin_lock(__this_cpu_ptr(&cpuhp_lock));
+	}
+	spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	wake_up_process(__cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+static void cpuph_wait_refcount(void)
+{
+	for (;;) {
+		unsigned int refcnt = 0;
+		int cpu;
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
+			break;
+
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+}
+
+static void cpuhp_set_state(int state)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		spinlock_t *lock = &per_cpu(cpuhp_lock, cpu);
+
+		spin_lock(lock);
+		per_cpu(cpuhp_state, cpu) = state;
+		spin_unlock(lock);
+	}
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	__cpuhp_writer = current;
+
+	/* After this everybody will observe _writer and take the slow path. */
+	synchronize_sched();
+
+	/* Wait for no readers -- reader preference */
+	cpuhp_wait_refcount();
+
+	/* Stop new readers. */
+	cpuhp_set_state(1);
+
+	/* Wait for no readers */
+	cpuhp_wait_refcount();
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+
+	/* Allow new readers */
+	cpuhp_set_state(0);
+
+	wake_up_all(&cpuhp_wq);
 }
 
 /*

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-19 14:32             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-19 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Oleg Nesterov, Paul McKenney,
	Thomas Gleixner, Steven Rostedt




Meh, I should stop poking at this..

This one lost all the comments again :/

It uses preempt_disable/preempt_enable vs synchronize_sched() to remove
the barriers from the fast path.

After that it waits for !refcount before setting state, which stops new
readers.

I used a per-cpu spinlock to keep the state check and refcount inc
atomic vs the setting of state.

So the slow path is still per-cpu and mostly uncontended even in the
pending writer case.

After setting state it again waits for !refcount -- someone could have
sneaked in between the last !refcount and setting state. But this time
we know refcount will stay 0.

The only thing I don't really like is the unconditional writer wake in
the read-unlock slowpath, but I couldn't come up with anything better.
Here at least we guarantee that there is a wakeup after the last dec --
although there might be far too many wakes.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h |   32 ++++++++++-
 kernel/cpu.c        |  151 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 116 insertions(+), 67 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -175,8 +176,35 @@ extern struct bus_type cpu_subsys;
 
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
+		this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	preempt_disable();
+	this_cpu_dec(__cpuhp_refcount);
+	if (unlikely(__cpuhp_writer && __cpuhp_writer != current))
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,109 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
-
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
-
-}
-EXPORT_SYMBOL_GPL(get_online_cpus);
-
-void put_online_cpus(void)
-{
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
-
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(int, cpuhp_state);
+static DEFINE_PER_CPU(spinlock_t, cpuhp_lock);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
+{
+	spin_lock(__this_cpu_ptr(&cpuhp_lock));
+	for (;;) {
+		if (!__this_cpu_read(cpuhp_state)) {
+			__this_cpu_inc(__cpuhp_refcount);
+			break;
+		}
+
+		spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+		preempt_enable();
+
+		wait_event(cpuhp_wq, !__cpuhp_writer);
+
+		preempt_disable();
+		spin_lock(__this_cpu_ptr(&cpuhp_lock));
+	}
+	spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	wake_up_process(__cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+static void cpuph_wait_refcount(void)
+{
+	for (;;) {
+		unsigned int refcnt = 0;
+		int cpu;
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		for_each_possible_cpu(cpu)
+			refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+		if (!refcnt)
+			break;
+
+		schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+}
+
+static void cpuhp_set_state(int state)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		spinlock_t *lock = &per_cpu(cpuhp_lock, cpu);
+
+		spin_lock(lock);
+		per_cpu(cpuhp_state, cpu) = state;
+		spin_unlock(lock);
+	}
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	__cpuhp_writer = current;
+
+	/* After this everybody will observe _writer and take the slow path. */
+	synchronize_sched();
+
+	/* Wait for no readers -- reader preference */
+	cpuhp_wait_refcount();
+
+	/* Stop new readers. */
+	cpuhp_set_state(1);
+
+	/* Wait for no readers */
+	cpuhp_wait_refcount();
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	__cpuhp_writer = NULL;
+
+	/* Allow new readers */
+	cpuhp_set_state(0);
+
+	wake_up_all(&cpuhp_wq);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
  2013-09-10  9:32   ` Mel Gorman
@ 2013-09-20  9:55     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-20  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> Having multiple tasks in a group go through task_numa_placement
> simultaneously can lead to a task picking a wrong node to run on, because
> the group stats may be in the middle of an update. This patch avoids
> parallel updates by holding the numa_group lock during placement
> decisions.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
>  1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a92c58..4653f71 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = -1, max_group_nid = -1;
>  	unsigned long max_faults = 0, max_group_faults = 0;
> +	spinlock_t *group_lock = NULL;
>  
>  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
>  	if (p->numa_scan_seq == seq)
> @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
>  	p->numa_migrate_seq++;
>  	p->numa_scan_period_max = task_scan_max(p);
>  
> +	/* If the task is part of a group prevent parallel updates to group stats */
> +	if (p->numa_group) {
> +		group_lock = &p->numa_group->lock;
> +		spin_lock(group_lock);
> +	}
> +
>  	/* Find the node with the highest number of faults */
>  	for_each_online_node(nid) {
>  		unsigned long faults = 0, group_faults = 0;
> @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	if (p->numa_group) {
> +		/*
> +		 * If the preferred task and group nids are different, 
> +		 * iterate over the nodes again to find the best place.
> +		 */
> +		if (max_nid != max_group_nid) {
> +			unsigned long weight, max_weight = 0;
> +
> +			for_each_online_node(nid) {
> +				weight = task_weight(p, nid) + group_weight(p, nid);
> +				if (weight > max_weight) {
> +					max_weight = weight;
> +					max_nid = nid;
> +				}
>  			}
>  		}
> +
> +		spin_unlock(group_lock);
>  	}
>  
>  	/* Preferred node as the node with the most faults */

If you're going to hold locks you can also do away with all that
atomic_long_*() nonsense :-)

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-09-20  9:55     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-20  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> Having multiple tasks in a group go through task_numa_placement
> simultaneously can lead to a task picking a wrong node to run on, because
> the group stats may be in the middle of an update. This patch avoids
> parallel updates by holding the numa_group lock during placement
> decisions.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
>  1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a92c58..4653f71 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
>  {
>  	int seq, nid, max_nid = -1, max_group_nid = -1;
>  	unsigned long max_faults = 0, max_group_faults = 0;
> +	spinlock_t *group_lock = NULL;
>  
>  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
>  	if (p->numa_scan_seq == seq)
> @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
>  	p->numa_migrate_seq++;
>  	p->numa_scan_period_max = task_scan_max(p);
>  
> +	/* If the task is part of a group prevent parallel updates to group stats */
> +	if (p->numa_group) {
> +		group_lock = &p->numa_group->lock;
> +		spin_lock(group_lock);
> +	}
> +
>  	/* Find the node with the highest number of faults */
>  	for_each_online_node(nid) {
>  		unsigned long faults = 0, group_faults = 0;
> @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	if (p->numa_group) {
> +		/*
> +		 * If the preferred task and group nids are different, 
> +		 * iterate over the nodes again to find the best place.
> +		 */
> +		if (max_nid != max_group_nid) {
> +			unsigned long weight, max_weight = 0;
> +
> +			for_each_online_node(nid) {
> +				weight = task_weight(p, nid) + group_weight(p, nid);
> +				if (weight > max_weight) {
> +					max_weight = weight;
> +					max_nid = nid;
> +				}
>  			}
>  		}
> +
> +		spin_unlock(group_lock);
>  	}
>  
>  	/* Preferred node as the node with the most faults */

If you're going to hold locks you can also do away with all that
atomic_long_*() nonsense :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
  2013-09-20  9:55     ` Peter Zijlstra
@ 2013-09-20 12:31       ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-20 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 11:55:26AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> > Having multiple tasks in a group go through task_numa_placement
> > simultaneously can lead to a task picking a wrong node to run on, because
> > the group stats may be in the middle of an update. This patch avoids
> > parallel updates by holding the numa_group lock during placement
> > decisions.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
> >  1 file changed, 23 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3a92c58..4653f71 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
> >  {
> >  	int seq, nid, max_nid = -1, max_group_nid = -1;
> >  	unsigned long max_faults = 0, max_group_faults = 0;
> > +	spinlock_t *group_lock = NULL;
> >  
> >  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> >  	if (p->numa_scan_seq == seq)
> > @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
> >  	p->numa_migrate_seq++;
> >  	p->numa_scan_period_max = task_scan_max(p);
> >  
> > +	/* If the task is part of a group prevent parallel updates to group stats */
> > +	if (p->numa_group) {
> > +		group_lock = &p->numa_group->lock;
> > +		spin_lock(group_lock);
> > +	}
> > +
> >  	/* Find the node with the highest number of faults */
> >  	for_each_online_node(nid) {
> >  		unsigned long faults = 0, group_faults = 0;
> > @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
> >  		}
> >  	}
> >  
> > +	if (p->numa_group) {
> > +		/*
> > +		 * If the preferred task and group nids are different, 
> > +		 * iterate over the nodes again to find the best place.
> > +		 */
> > +		if (max_nid != max_group_nid) {
> > +			unsigned long weight, max_weight = 0;
> > +
> > +			for_each_online_node(nid) {
> > +				weight = task_weight(p, nid) + group_weight(p, nid);
> > +				if (weight > max_weight) {
> > +					max_weight = weight;
> > +					max_nid = nid;
> > +				}
> >  			}
> >  		}
> > +
> > +		spin_unlock(group_lock);
> >  	}
> >  
> >  	/* Preferred node as the node with the most faults */
> 
> If you're going to hold locks you can also do away with all that
> atomic_long_*() nonsense :-)

Yep! Easily done, patch is untested but should be straight-forward.

---8<---
sched: numa: use longs for numa group fault stats

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates. numa_group faults stats type are still "long" to add a basic
sanity check for fault counts going negative.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a2963..c09687d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,8 +897,8 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
-	atomic_long_t total_faults;
-	atomic_long_t faults[0];
+	long total_faults;
+	long faults[0];
 };
 
 pid_t task_numa_group_id(struct task_struct *p)
@@ -925,8 +925,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
 	if (!p->numa_group)
 		return 0;
 
-	return atomic_long_read(&p->numa_group->faults[2*nid]) +
-	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+	return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
 }
 
 /*
@@ -952,17 +951,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
-
-	if (!p->numa_group)
-		return 0;
-
-	total_faults = atomic_long_read(&p->numa_group->total_faults);
-
-	if (!total_faults)
+	if (!p->numa_group || !p->numa_group->total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1267,9 +1259,9 @@ static void task_numa_placement(struct task_struct *p)
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-				atomic_long_add(diff, &p->numa_group->faults[i]);
-				atomic_long_add(diff, &p->numa_group->total_faults);
-				group_faults += atomic_long_read(&p->numa_group->faults[i]);
+				p->numa_group->faults[i] += diff;
+				p->numa_group->total_faults += diff;
+				group_faults += p->numa_group->faults[i];
 			}
 		}
 
@@ -1343,7 +1335,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-			            2*nr_node_ids*sizeof(atomic_long_t);
+			            2*nr_node_ids*sizeof(long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1355,9 +1347,9 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+			grp->faults[i] = p->numa_faults[i];
 
-		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+		grp->total_faults = p->total_numa_faults;
 
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
@@ -1402,14 +1394,15 @@ unlock:
 	if (!join)
 		return;
 
+	double_lock(&my_grp->lock, &grp->lock);
+
 	for (i = 0; i < 2*nr_node_ids; i++) {
-		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
-		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] -= p->numa_faults[i];
+		WARN_ON_ONCE(grp->faults[i] < 0);
 	}
-	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
-	atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
-	double_lock(&my_grp->lock, &grp->lock);
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults -= p->total_numa_faults;
 
 	list_move(&p->numa_entry, &grp->task_list);
 	my_grp->nr_tasks--;
@@ -1430,12 +1423,13 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
-		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
-		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
-
 		spin_lock(&grp->lock);
+		for (i = 0; i < 2*nr_node_ids; i++) {
+			grp->faults[i] -= p->numa_faults[i];
+			WARN_ON_ONCE(grp->faults[i] < 0);
+		}
+		grp->total_faults -= p->total_numa_faults;
+
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
 		spin_unlock(&grp->lock);

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-09-20 12:31       ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-20 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 11:55:26AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> > Having multiple tasks in a group go through task_numa_placement
> > simultaneously can lead to a task picking a wrong node to run on, because
> > the group stats may be in the middle of an update. This patch avoids
> > parallel updates by holding the numa_group lock during placement
> > decisions.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
> >  1 file changed, 23 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3a92c58..4653f71 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
> >  {
> >  	int seq, nid, max_nid = -1, max_group_nid = -1;
> >  	unsigned long max_faults = 0, max_group_faults = 0;
> > +	spinlock_t *group_lock = NULL;
> >  
> >  	seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> >  	if (p->numa_scan_seq == seq)
> > @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
> >  	p->numa_migrate_seq++;
> >  	p->numa_scan_period_max = task_scan_max(p);
> >  
> > +	/* If the task is part of a group prevent parallel updates to group stats */
> > +	if (p->numa_group) {
> > +		group_lock = &p->numa_group->lock;
> > +		spin_lock(group_lock);
> > +	}
> > +
> >  	/* Find the node with the highest number of faults */
> >  	for_each_online_node(nid) {
> >  		unsigned long faults = 0, group_faults = 0;
> > @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
> >  		}
> >  	}
> >  
> > +	if (p->numa_group) {
> > +		/*
> > +		 * If the preferred task and group nids are different, 
> > +		 * iterate over the nodes again to find the best place.
> > +		 */
> > +		if (max_nid != max_group_nid) {
> > +			unsigned long weight, max_weight = 0;
> > +
> > +			for_each_online_node(nid) {
> > +				weight = task_weight(p, nid) + group_weight(p, nid);
> > +				if (weight > max_weight) {
> > +					max_weight = weight;
> > +					max_nid = nid;
> > +				}
> >  			}
> >  		}
> > +
> > +		spin_unlock(group_lock);
> >  	}
> >  
> >  	/* Preferred node as the node with the most faults */
> 
> If you're going to hold locks you can also do away with all that
> atomic_long_*() nonsense :-)

Yep! Easily done, patch is untested but should be straight-forward.

---8<---
sched: numa: use longs for numa group fault stats

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates. numa_group faults stats type are still "long" to add a basic
sanity check for fault counts going negative.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a2963..c09687d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,8 +897,8 @@ struct numa_group {
 	struct list_head task_list;
 
 	struct rcu_head rcu;
-	atomic_long_t total_faults;
-	atomic_long_t faults[0];
+	long total_faults;
+	long faults[0];
 };
 
 pid_t task_numa_group_id(struct task_struct *p)
@@ -925,8 +925,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
 	if (!p->numa_group)
 		return 0;
 
-	return atomic_long_read(&p->numa_group->faults[2*nid]) +
-	       atomic_long_read(&p->numa_group->faults[2*nid+1]);
+	return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
 }
 
 /*
@@ -952,17 +951,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 
 static inline unsigned long group_weight(struct task_struct *p, int nid)
 {
-	unsigned long total_faults;
-
-	if (!p->numa_group)
-		return 0;
-
-	total_faults = atomic_long_read(&p->numa_group->total_faults);
-
-	if (!total_faults)
+	if (!p->numa_group || !p->numa_group->total_faults)
 		return 0;
 
-	return 1200 * group_faults(p, nid) / total_faults;
+	return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
 }
 
 static unsigned long weighted_cpuload(const int cpu);
@@ -1267,9 +1259,9 @@ static void task_numa_placement(struct task_struct *p)
 			p->total_numa_faults += diff;
 			if (p->numa_group) {
 				/* safe because we can only change our own group */
-				atomic_long_add(diff, &p->numa_group->faults[i]);
-				atomic_long_add(diff, &p->numa_group->total_faults);
-				group_faults += atomic_long_read(&p->numa_group->faults[i]);
+				p->numa_group->faults[i] += diff;
+				p->numa_group->total_faults += diff;
+				group_faults += p->numa_group->faults[i];
 			}
 		}
 
@@ -1343,7 +1335,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 
 	if (unlikely(!p->numa_group)) {
 		unsigned int size = sizeof(struct numa_group) +
-			            2*nr_node_ids*sizeof(atomic_long_t);
+			            2*nr_node_ids*sizeof(long);
 
 		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!grp)
@@ -1355,9 +1347,9 @@ static void task_numa_group(struct task_struct *p, int cpupid)
 		grp->gid = p->pid;
 
 		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+			grp->faults[i] = p->numa_faults[i];
 
-		atomic_long_set(&grp->total_faults, p->total_numa_faults);
+		grp->total_faults = p->total_numa_faults;
 
 		list_add(&p->numa_entry, &grp->task_list);
 		grp->nr_tasks++;
@@ -1402,14 +1394,15 @@ unlock:
 	if (!join)
 		return;
 
+	double_lock(&my_grp->lock, &grp->lock);
+
 	for (i = 0; i < 2*nr_node_ids; i++) {
-		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
-		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] -= p->numa_faults[i];
+		WARN_ON_ONCE(grp->faults[i] < 0);
 	}
-	atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
-	atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
-	double_lock(&my_grp->lock, &grp->lock);
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults -= p->total_numa_faults;
 
 	list_move(&p->numa_entry, &grp->task_list);
 	my_grp->nr_tasks--;
@@ -1430,12 +1423,13 @@ void task_numa_free(struct task_struct *p)
 	void *numa_faults = p->numa_faults;
 
 	if (grp) {
-		for (i = 0; i < 2*nr_node_ids; i++)
-			atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
-		atomic_long_sub(p->total_numa_faults, &grp->total_faults);
-
 		spin_lock(&grp->lock);
+		for (i = 0; i < 2*nr_node_ids; i++) {
+			grp->faults[i] -= p->numa_faults[i];
+			WARN_ON_ONCE(grp->faults[i] < 0);
+		}
+		grp->total_faults -= p->total_numa_faults;
+
 		list_del(&p->numa_entry);
 		grp->nr_tasks--;
 		spin_unlock(&grp->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
  2013-09-20 12:31       ` Mel Gorman
@ 2013-09-20 12:36         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-20 12:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 01:31:52PM +0100, Mel Gorman wrote:
>  static inline unsigned long group_weight(struct task_struct *p, int nid)
>  {
> +	if (!p->numa_group || !p->numa_group->total_faults)
>  		return 0;
>  
> +	return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
>  }

Unrelated to this change; I recently thought we might want to change
these weight factors based on if the task was predominantly private or
shared.

For shared we use the bigger weight for group, for private we use the
bigger weight for task.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-09-20 12:36         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-20 12:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 01:31:52PM +0100, Mel Gorman wrote:
>  static inline unsigned long group_weight(struct task_struct *p, int nid)
>  {
> +	if (!p->numa_group || !p->numa_group->total_faults)
>  		return 0;
>  
> +	return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
>  }

Unrelated to this change; I recently thought we might want to change
these weight factors based on if the task was predominantly private or
shared.

For shared we use the bigger weight for group, for private we use the
bigger weight for task.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
  2013-09-20 12:31       ` Mel Gorman
@ 2013-09-20 13:31         ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-20 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 01:31:51PM +0100, Mel Gorman wrote:
> @@ -1402,14 +1394,15 @@ unlock:
>  	if (!join)
>  		return;
>  
> +	double_lock(&my_grp->lock, &grp->lock);
> +
>  	for (i = 0; i < 2*nr_node_ids; i++) {
> -		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
> -		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
> +		my_grp->faults[i] -= p->numa_faults[i];
> +		grp->faults[i] -= p->numa_faults[i];
> +		WARN_ON_ONCE(grp->faults[i] < 0);
>  	}

That stupidity got fixed

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement
@ 2013-09-20 13:31         ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-20 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML

On Fri, Sep 20, 2013 at 01:31:51PM +0100, Mel Gorman wrote:
> @@ -1402,14 +1394,15 @@ unlock:
>  	if (!join)
>  		return;
>  
> +	double_lock(&my_grp->lock, &grp->lock);
> +
>  	for (i = 0; i < 2*nr_node_ids; i++) {
> -		atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
> -		atomic_long_add(p->numa_faults[i], &grp->faults[i]);
> +		my_grp->faults[i] -= p->numa_faults[i];
> +		grp->faults[i] -= p->numa_faults[i];
> +		WARN_ON_ONCE(grp->faults[i] < 0);
>  	}

That stupidity got fixed

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-19 14:32             ` Peter Zijlstra
@ 2013-09-21 16:34               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-21 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

Sorry for delay, I was sick...

On 09/19, Peter Zijlstra wrote:
>
> I used a per-cpu spinlock to keep the state check and refcount inc
> atomic vs the setting of state.

I think this could be simpler, see below.

> So the slow path is still per-cpu and mostly uncontended even in the
> pending writer case.

Is it really important? I mean, per-cpu/uncontended even if the writer
is pending?

Otherwise we could do

	static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
	static struct task_struct *cpuhp_writer;
	static DEFINE_MUTEX(cpuhp_slow_lock)
	static long cpuhp_slow_ctr;

	static bool update_fast_ctr(int inc)
	{
		bool success = true;

		preempt_disable();
		if (likely(!cpuhp_writer))
			__get_cpu_var(cpuhp_fast_ctr) += inc;
		else if (cpuhp_writer != current)
			success = false;
		preempt_enable();

		return success;
	}

	void get_online_cpus(void)
	{
		if (likely(update_fast_ctr(+1));
			return;

		mutex_lock(&cpuhp_slow_lock);
		cpuhp_slow_ctr++;
		mutex_unlock(&cpuhp_slow_lock);
	}

	void put_online_cpus(void)
	{
		if (likely(update_fast_ctr(-1));
			return;

		mutex_lock(&cpuhp_slow_lock);
		if (!--cpuhp_slow_ctr && cpuhp_writer)
			wake_up_process(cpuhp_writer);
		mutex_unlock(&cpuhp_slow_lock);
	}

	static void clear_fast_ctr(void)
	{
		long total = 0;
		int cpu;

		for_each_possible_cpu(cpu) {
			total += per_cpu(cpuhp_fast_ctr, cpu);
			per_cpu(cpuhp_fast_ctr, cpu) = 0;
		}

		return total;
	}

	static void cpu_hotplug_begin(void)
	{
		cpuhp_writer = current;
		synchronize_sched();

		/* Nobody except us can use can use cpuhp_fast_ctr */

		mutex_lock(&cpuhp_slow_lock);
		cpuhp_slow_ctr += clear_fast_ctr();

		while (cpuhp_slow_ctr) {
			__set_current_state(TASK_UNINTERRUPTIBLE);
			mutex_unlock(&&cpuhp_slow_lock);
			schedule();
			mutex_lock(&cpuhp_slow_lock);
		}
	}

	static void cpu_hotplug_done(void)
	{
		cpuhp_writer = NULL;
		mutex_unlock(&cpuhp_slow_lock);
	}

I already sent this code in 2010, it needs some trivial updates.

But. We already have percpu_rw_semaphore, can't we reuse it? In fact
I thought about this from the very beginning. Just we need
percpu_down_write_recursive_readers() which does

	bool xxx(brw)
	{
		if (down_trylock(&brw->rw_sem))
			return false;
		if (!atomic_read(&brw->slow_read_ctr))
			return true;
		up_write(&brw->rw_sem);
			return false;
	}

	ait_event(brw->write_waitq, xxx(brw));

instead of down_write() + wait_event(!atomic_read(&brw->slow_read_ctr)).

The only problem is the lockdep annotations in percpu_down_read(), but
this looks simple, just we need down_read_no_lockdep() (like __up_read).

Note also that percpu_down_write/percpu_up_write can be improved wrt
synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
1nd one can be avoided if another percpu_down_write() comes "soon after"
percpu_down_up().


As for the patch itself, I am not sure.

> +static void cpuph_wait_refcount(void)
> +{
> +	for (;;) {
> +		unsigned int refcnt = 0;
> +		int cpu;
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +		for_each_possible_cpu(cpu)
> +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> +
> +		if (!refcnt)
> +			break;
> +
> +		schedule();
> +	}
> +	__set_current_state(TASK_RUNNING);
> +}

It seems, this can succeed while it should not, see below.

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	lockdep_assert_held(&cpu_add_remove_lock);
>
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	__cpuhp_writer = current;
> +
> +	/* After this everybody will observe _writer and take the slow path. */
> +	synchronize_sched();

Yes, the reader should see _writer, but:

> +	/* Wait for no readers -- reader preference */
> +	cpuhp_wait_refcount();

but how we can ensure the writer sees the results of the reader's updates?

Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.

Now,

	- The writer calls cpuph_wait_refcount()

	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
	  refcnt == 0.

	- another reader comes on CPU_0, increments __cpuhp_refcount[0].

	- this reader migrates to CPU_1 and does put_online_cpus(),
	  this decrements __cpuhp_refcount[1] which becomes zero.

	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
	  which is zero. refcnt == 0, return.

	- The writer does cpuhp_set_state(1).

	- The reader R (original reader) wakes up, calls get_online_cpus()
	  recursively, and sleeps in wait_event(!__cpuhp_writer).

Btw, I think that  __sb_start_write/etc is equally wrong. Perhaps it is
another potential user of percpu_rw_sem.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-21 16:34               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-21 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

Sorry for delay, I was sick...

On 09/19, Peter Zijlstra wrote:
>
> I used a per-cpu spinlock to keep the state check and refcount inc
> atomic vs the setting of state.

I think this could be simpler, see below.

> So the slow path is still per-cpu and mostly uncontended even in the
> pending writer case.

Is it really important? I mean, per-cpu/uncontended even if the writer
is pending?

Otherwise we could do

	static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
	static struct task_struct *cpuhp_writer;
	static DEFINE_MUTEX(cpuhp_slow_lock)
	static long cpuhp_slow_ctr;

	static bool update_fast_ctr(int inc)
	{
		bool success = true;

		preempt_disable();
		if (likely(!cpuhp_writer))
			__get_cpu_var(cpuhp_fast_ctr) += inc;
		else if (cpuhp_writer != current)
			success = false;
		preempt_enable();

		return success;
	}

	void get_online_cpus(void)
	{
		if (likely(update_fast_ctr(+1));
			return;

		mutex_lock(&cpuhp_slow_lock);
		cpuhp_slow_ctr++;
		mutex_unlock(&cpuhp_slow_lock);
	}

	void put_online_cpus(void)
	{
		if (likely(update_fast_ctr(-1));
			return;

		mutex_lock(&cpuhp_slow_lock);
		if (!--cpuhp_slow_ctr && cpuhp_writer)
			wake_up_process(cpuhp_writer);
		mutex_unlock(&cpuhp_slow_lock);
	}

	static void clear_fast_ctr(void)
	{
		long total = 0;
		int cpu;

		for_each_possible_cpu(cpu) {
			total += per_cpu(cpuhp_fast_ctr, cpu);
			per_cpu(cpuhp_fast_ctr, cpu) = 0;
		}

		return total;
	}

	static void cpu_hotplug_begin(void)
	{
		cpuhp_writer = current;
		synchronize_sched();

		/* Nobody except us can use can use cpuhp_fast_ctr */

		mutex_lock(&cpuhp_slow_lock);
		cpuhp_slow_ctr += clear_fast_ctr();

		while (cpuhp_slow_ctr) {
			__set_current_state(TASK_UNINTERRUPTIBLE);
			mutex_unlock(&&cpuhp_slow_lock);
			schedule();
			mutex_lock(&cpuhp_slow_lock);
		}
	}

	static void cpu_hotplug_done(void)
	{
		cpuhp_writer = NULL;
		mutex_unlock(&cpuhp_slow_lock);
	}

I already sent this code in 2010, it needs some trivial updates.

But. We already have percpu_rw_semaphore, can't we reuse it? In fact
I thought about this from the very beginning. Just we need
percpu_down_write_recursive_readers() which does

	bool xxx(brw)
	{
		if (down_trylock(&brw->rw_sem))
			return false;
		if (!atomic_read(&brw->slow_read_ctr))
			return true;
		up_write(&brw->rw_sem);
			return false;
	}

	ait_event(brw->write_waitq, xxx(brw));

instead of down_write() + wait_event(!atomic_read(&brw->slow_read_ctr)).

The only problem is the lockdep annotations in percpu_down_read(), but
this looks simple, just we need down_read_no_lockdep() (like __up_read).

Note also that percpu_down_write/percpu_up_write can be improved wrt
synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
1nd one can be avoided if another percpu_down_write() comes "soon after"
percpu_down_up().


As for the patch itself, I am not sure.

> +static void cpuph_wait_refcount(void)
> +{
> +	for (;;) {
> +		unsigned int refcnt = 0;
> +		int cpu;
> +
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +
> +		for_each_possible_cpu(cpu)
> +			refcnt += per_cpu(__cpuhp_refcount, cpu);
> +
> +		if (!refcnt)
> +			break;
> +
> +		schedule();
> +	}
> +	__set_current_state(TASK_RUNNING);
> +}

It seems, this can succeed while it should not, see below.

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	lockdep_assert_held(&cpu_add_remove_lock);
>
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	__cpuhp_writer = current;
> +
> +	/* After this everybody will observe _writer and take the slow path. */
> +	synchronize_sched();

Yes, the reader should see _writer, but:

> +	/* Wait for no readers -- reader preference */
> +	cpuhp_wait_refcount();

but how we can ensure the writer sees the results of the reader's updates?

Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.

Now,

	- The writer calls cpuph_wait_refcount()

	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
	  refcnt == 0.

	- another reader comes on CPU_0, increments __cpuhp_refcount[0].

	- this reader migrates to CPU_1 and does put_online_cpus(),
	  this decrements __cpuhp_refcount[1] which becomes zero.

	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
	  which is zero. refcnt == 0, return.

	- The writer does cpuhp_set_state(1).

	- The reader R (original reader) wakes up, calls get_online_cpus()
	  recursively, and sleeps in wait_event(!__cpuhp_writer).

Btw, I think that  __sb_start_write/etc is equally wrong. Perhaps it is
another potential user of percpu_rw_sem.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-21 16:34               ` Oleg Nesterov
@ 2013-09-21 19:13                 ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-21 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/21, Oleg Nesterov wrote:
>
> As for the patch itself, I am not sure.

Forgot to mention... and with this patch cpu_hotplug_done() loses the
"release" semantics, not sure this is fine.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-21 19:13                 ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-21 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/21, Oleg Nesterov wrote:
>
> As for the patch itself, I am not sure.

Forgot to mention... and with this patch cpu_hotplug_done() loses the
"release" semantics, not sure this is fine.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-21 16:34               ` Oleg Nesterov
@ 2013-09-23  9:29                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > So the slow path is still per-cpu and mostly uncontended even in the
> > pending writer case.
> 
> Is it really important? I mean, per-cpu/uncontended even if the writer
> is pending?

I think so, once we make {get,put}_online_cpus() really cheap they'll
get in more and more places, and the global count with pending writer
will make things crawl on bigger machines.

> Otherwise we could do

<snip>

> I already sent this code in 2010, it needs some trivial updates.

Yeah, I found that a few days ago.. but per the above I didn't like the
pending writer case.

> But. We already have percpu_rw_semaphore,

Oh urgh, forgot about that one. /me goes read.

/me curses loudly.. that thing has an _expedited() call in it, those
should die.

Also, it suffers the same problem. I think esp. for hotplug we should be
100% geared towards readers and pretty much damn writers.

I'd dread to think what would happen if a 4k cpu machine were to land in
the slow path on that global mutex. Readers would never go-away and
progress would make a glacier seem fast.

> Note also that percpu_down_write/percpu_up_write can be improved wrt
> synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> 1nd one can be avoided if another percpu_down_write() comes "soon after"
> percpu_down_up().

Write side be damned ;-)

It is anyway with a pure read bias and a large machine..

> As for the patch itself, I am not sure.
> 
> > +static void cpuph_wait_refcount(void)
> 
> It seems, this can succeed while it should not, see below.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> >
> > +	__cpuhp_writer = current;
> > +
> > +	/* After this everybody will observe _writer and take the slow path. */
> > +	synchronize_sched();
> 
> Yes, the reader should see _writer, but:
> 
> > +	/* Wait for no readers -- reader preference */
> > +	cpuhp_wait_refcount();
> 
> but how we can ensure the writer sees the results of the reader's updates?
> 
> Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
> IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.
> 
> Now,
> 
> 	- The writer calls cpuph_wait_refcount()
> 
> 	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> 	  refcnt == 0.
> 
> 	- another reader comes on CPU_0, increments __cpuhp_refcount[0].
> 
> 	- this reader migrates to CPU_1 and does put_online_cpus(),
> 	  this decrements __cpuhp_refcount[1] which becomes zero.
> 
> 	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> 	  which is zero. refcnt == 0, return.
> 
> 	- The writer does cpuhp_set_state(1).
> 
> 	- The reader R (original reader) wakes up, calls get_online_cpus()
> 	  recursively, and sleeps in wait_event(!__cpuhp_writer).

Ah indeed.. 

The best I can come up with is something like:

static unsigned int cpuhp_refcount(void)
{
	unsigned int refcount = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		refcount += per_cpu(__cpuhp_refcount, cpu);
}

static void cpuhp_wait_refcount(void)
{
	for (;;) {
		unsigned int rc1, rc2;

		rc1 = cpuhp_refcount();
		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
		rc2 = cpuhp_refcount();

		if (rc1 == rc2 && !rc1)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);
}

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23  9:29                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > So the slow path is still per-cpu and mostly uncontended even in the
> > pending writer case.
> 
> Is it really important? I mean, per-cpu/uncontended even if the writer
> is pending?

I think so, once we make {get,put}_online_cpus() really cheap they'll
get in more and more places, and the global count with pending writer
will make things crawl on bigger machines.

> Otherwise we could do

<snip>

> I already sent this code in 2010, it needs some trivial updates.

Yeah, I found that a few days ago.. but per the above I didn't like the
pending writer case.

> But. We already have percpu_rw_semaphore,

Oh urgh, forgot about that one. /me goes read.

/me curses loudly.. that thing has an _expedited() call in it, those
should die.

Also, it suffers the same problem. I think esp. for hotplug we should be
100% geared towards readers and pretty much damn writers.

I'd dread to think what would happen if a 4k cpu machine were to land in
the slow path on that global mutex. Readers would never go-away and
progress would make a glacier seem fast.

> Note also that percpu_down_write/percpu_up_write can be improved wrt
> synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> 1nd one can be avoided if another percpu_down_write() comes "soon after"
> percpu_down_up().

Write side be damned ;-)

It is anyway with a pure read bias and a large machine..

> As for the patch itself, I am not sure.
> 
> > +static void cpuph_wait_refcount(void)
> 
> It seems, this can succeed while it should not, see below.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> >
> > +	__cpuhp_writer = current;
> > +
> > +	/* After this everybody will observe _writer and take the slow path. */
> > +	synchronize_sched();
> 
> Yes, the reader should see _writer, but:
> 
> > +	/* Wait for no readers -- reader preference */
> > +	cpuhp_wait_refcount();
> 
> but how we can ensure the writer sees the results of the reader's updates?
> 
> Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
> IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.
> 
> Now,
> 
> 	- The writer calls cpuph_wait_refcount()
> 
> 	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> 	  refcnt == 0.
> 
> 	- another reader comes on CPU_0, increments __cpuhp_refcount[0].
> 
> 	- this reader migrates to CPU_1 and does put_online_cpus(),
> 	  this decrements __cpuhp_refcount[1] which becomes zero.
> 
> 	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> 	  which is zero. refcnt == 0, return.
> 
> 	- The writer does cpuhp_set_state(1).
> 
> 	- The reader R (original reader) wakes up, calls get_online_cpus()
> 	  recursively, and sleeps in wait_event(!__cpuhp_writer).

Ah indeed.. 

The best I can come up with is something like:

static unsigned int cpuhp_refcount(void)
{
	unsigned int refcount = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		refcount += per_cpu(__cpuhp_refcount, cpu);
}

static void cpuhp_wait_refcount(void)
{
	for (;;) {
		unsigned int rc1, rc2;

		rc1 = cpuhp_refcount();
		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
		rc2 = cpuhp_refcount();

		if (rc1 == rc2 && !rc1)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-19 14:32             ` Peter Zijlstra
@ 2013-09-23 14:50               ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Thu, 19 Sep 2013 16:32:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> +		this_cpu_inc(__cpuhp_refcount);
> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}


This isn't much different than srcu_read_lock(). What about doing
something like this:

static inline void get_online_cpus(void)
{
	might_sleep();

	srcu_read_lock(&cpuhp_srcu);
	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
		srcu_read_unlock(&cpuhp_srcu);
		__get_online_cpus();
		current->online_cpus_held++;
	}
}

static inline void put_online_cpus(void)
{
	if (unlikely(current->online_cpus_held)) {
		current->online_cpus_held--;
		__put_online_cpus();
		return;
	}

	srcu_read_unlock(&cpuhp_srcu);
}

Then have the writer simply do:

	__cpuhp_write = current;
	synchronize_srcu(&cpuhp_srcu);

	<grab the mutex here>

-- Steve

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 14:50               ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Thu, 19 Sep 2013 16:32:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> +		this_cpu_inc(__cpuhp_refcount);
> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}


This isn't much different than srcu_read_lock(). What about doing
something like this:

static inline void get_online_cpus(void)
{
	might_sleep();

	srcu_read_lock(&cpuhp_srcu);
	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
		srcu_read_unlock(&cpuhp_srcu);
		__get_online_cpus();
		current->online_cpus_held++;
	}
}

static inline void put_online_cpus(void)
{
	if (unlikely(current->online_cpus_held)) {
		current->online_cpus_held--;
		__put_online_cpus();
		return;
	}

	srcu_read_unlock(&cpuhp_srcu);
}

Then have the writer simply do:

	__cpuhp_write = current;
	synchronize_srcu(&cpuhp_srcu);

	<grab the mutex here>

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 14:50               ` Steven Rostedt
@ 2013-09-23 14:54                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 14:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> On Thu, 19 Sep 2013 16:32:41 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > +		this_cpu_inc(__cpuhp_refcount);
> > +	else
> > +		__get_online_cpus();
> > +	preempt_enable();
> > +}
> 
> 
> This isn't much different than srcu_read_lock(). What about doing
> something like this:
> 
> static inline void get_online_cpus(void)
> {
> 	might_sleep();
> 
> 	srcu_read_lock(&cpuhp_srcu);
> 	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> 		srcu_read_unlock(&cpuhp_srcu);
> 		__get_online_cpus();
> 		current->online_cpus_held++;
> 	}
> }

There's a full memory barrier in srcu_read_lock(), while there was no
such thing in the previous fast path.

Also, why current->online_cpus_held()? That would make the write side
O(nr_tasks) instead of O(nr_cpus).

> static inline void put_online_cpus(void)
> {
> 	if (unlikely(current->online_cpus_held)) {
> 		current->online_cpus_held--;
> 		__put_online_cpus();
> 		return;
> 	}
> 
> 	srcu_read_unlock(&cpuhp_srcu);
> }

Also, you might not have noticed but, srcu_read_{,un}lock() have an
extra idx thing to pass about. That doesn't fit with the hotplug api.

> 
> Then have the writer simply do:
> 
> 	__cpuhp_write = current;
> 	synchronize_srcu(&cpuhp_srcu);
> 
> 	<grab the mutex here>

How does that do reader preference?

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 14:54                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 14:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> On Thu, 19 Sep 2013 16:32:41 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > +		this_cpu_inc(__cpuhp_refcount);
> > +	else
> > +		__get_online_cpus();
> > +	preempt_enable();
> > +}
> 
> 
> This isn't much different than srcu_read_lock(). What about doing
> something like this:
> 
> static inline void get_online_cpus(void)
> {
> 	might_sleep();
> 
> 	srcu_read_lock(&cpuhp_srcu);
> 	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> 		srcu_read_unlock(&cpuhp_srcu);
> 		__get_online_cpus();
> 		current->online_cpus_held++;
> 	}
> }

There's a full memory barrier in srcu_read_lock(), while there was no
such thing in the previous fast path.

Also, why current->online_cpus_held()? That would make the write side
O(nr_tasks) instead of O(nr_cpus).

> static inline void put_online_cpus(void)
> {
> 	if (unlikely(current->online_cpus_held)) {
> 		current->online_cpus_held--;
> 		__put_online_cpus();
> 		return;
> 	}
> 
> 	srcu_read_unlock(&cpuhp_srcu);
> }

Also, you might not have noticed but, srcu_read_{,un}lock() have an
extra idx thing to pass about. That doesn't fit with the hotplug api.

> 
> Then have the writer simply do:
> 
> 	__cpuhp_write = current;
> 	synchronize_srcu(&cpuhp_srcu);
> 
> 	<grab the mutex here>

How does that do reader preference?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 14:54                 ` Peter Zijlstra
@ 2013-09-23 15:13                   ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, 23 Sep 2013 16:54:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> > On Thu, 19 Sep 2013 16:32:41 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > 
> > > +extern void __get_online_cpus(void);
> > > +
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > > +		this_cpu_inc(__cpuhp_refcount);
> > > +	else
> > > +		__get_online_cpus();
> > > +	preempt_enable();
> > > +}
> > 
> > 
> > This isn't much different than srcu_read_lock(). What about doing
> > something like this:
> > 
> > static inline void get_online_cpus(void)
> > {
> > 	might_sleep();
> > 
> > 	srcu_read_lock(&cpuhp_srcu);
> > 	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> > 		srcu_read_unlock(&cpuhp_srcu);
> > 		__get_online_cpus();
> > 		current->online_cpus_held++;
> > 	}
> > }
> 
> There's a full memory barrier in srcu_read_lock(), while there was no
> such thing in the previous fast path.

Yeah, I mentioned this to Paul, and we talked about making
srcu_read_lock() work with no mb's. But currently, doesn't
get_online_cpus() just take a mutex? What's wrong with a mb() as it
still kicks ass over what is currently there today?

> 
> Also, why current->online_cpus_held()? That would make the write side
> O(nr_tasks) instead of O(nr_cpus).

?? I'm not sure I understand this. The online_cpus_held++ was there for
recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
many times as we did a __get_online_cpus(). I don't know where the
O(nr_tasks) comes from. The ref here was just to account for doing the
old "get_online_cpus" instead of a srcu_read_lock().

> 
> > static inline void put_online_cpus(void)
> > {
> > 	if (unlikely(current->online_cpus_held)) {
> > 		current->online_cpus_held--;
> > 		__put_online_cpus();
> > 		return;
> > 	}
> > 
> > 	srcu_read_unlock(&cpuhp_srcu);
> > }
> 
> Also, you might not have noticed but, srcu_read_{,un}lock() have an
> extra idx thing to pass about. That doesn't fit with the hotplug api.

I'll have to look a that, as I'm not exactly sure about the idx thing.

> 
> > 
> > Then have the writer simply do:
> > 
> > 	__cpuhp_write = current;
> > 	synchronize_srcu(&cpuhp_srcu);
> > 
> > 	<grab the mutex here>
> 
> How does that do reader preference?

Well, the point I was trying to do was to let readers go very fast
(well, with a mb instead of a mutex), and then when the CPU hotplug
happens, it goes back to the current method.

That is, once we set __cpuhp_write, and then run synchronize_srcu(),
the system will be in a state that does what it does today (grabbing
mutexes, and upping refcounts).

I thought the whole point was to speed up the get_online_cpus() when no
hotplug is happening. This does that, and is rather simple. It only
gets slow when hotplug is in effect.

-- Steve



^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 15:13                   ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, 23 Sep 2013 16:54:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> > On Thu, 19 Sep 2013 16:32:41 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > 
> > > +extern void __get_online_cpus(void);
> > > +
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > > +		this_cpu_inc(__cpuhp_refcount);
> > > +	else
> > > +		__get_online_cpus();
> > > +	preempt_enable();
> > > +}
> > 
> > 
> > This isn't much different than srcu_read_lock(). What about doing
> > something like this:
> > 
> > static inline void get_online_cpus(void)
> > {
> > 	might_sleep();
> > 
> > 	srcu_read_lock(&cpuhp_srcu);
> > 	if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> > 		srcu_read_unlock(&cpuhp_srcu);
> > 		__get_online_cpus();
> > 		current->online_cpus_held++;
> > 	}
> > }
> 
> There's a full memory barrier in srcu_read_lock(), while there was no
> such thing in the previous fast path.

Yeah, I mentioned this to Paul, and we talked about making
srcu_read_lock() work with no mb's. But currently, doesn't
get_online_cpus() just take a mutex? What's wrong with a mb() as it
still kicks ass over what is currently there today?

> 
> Also, why current->online_cpus_held()? That would make the write side
> O(nr_tasks) instead of O(nr_cpus).

?? I'm not sure I understand this. The online_cpus_held++ was there for
recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
many times as we did a __get_online_cpus(). I don't know where the
O(nr_tasks) comes from. The ref here was just to account for doing the
old "get_online_cpus" instead of a srcu_read_lock().

> 
> > static inline void put_online_cpus(void)
> > {
> > 	if (unlikely(current->online_cpus_held)) {
> > 		current->online_cpus_held--;
> > 		__put_online_cpus();
> > 		return;
> > 	}
> > 
> > 	srcu_read_unlock(&cpuhp_srcu);
> > }
> 
> Also, you might not have noticed but, srcu_read_{,un}lock() have an
> extra idx thing to pass about. That doesn't fit with the hotplug api.

I'll have to look a that, as I'm not exactly sure about the idx thing.

> 
> > 
> > Then have the writer simply do:
> > 
> > 	__cpuhp_write = current;
> > 	synchronize_srcu(&cpuhp_srcu);
> > 
> > 	<grab the mutex here>
> 
> How does that do reader preference?

Well, the point I was trying to do was to let readers go very fast
(well, with a mb instead of a mutex), and then when the CPU hotplug
happens, it goes back to the current method.

That is, once we set __cpuhp_write, and then run synchronize_srcu(),
the system will be in a state that does what it does today (grabbing
mutexes, and upping refcounts).

I thought the whole point was to speed up the get_online_cpus() when no
hotplug is happening. This does that, and is rather simple. It only
gets slow when hotplug is in effect.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 15:13                   ` Steven Rostedt
@ 2013-09-23 15:22                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 15:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.

Well, for that the thing Oleg proposed works just fine and the
preempt_disable() section vs synchronize_sched() is hardly magic.

But I'd really like to get the writer pending case fast too.

> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).

Still no point in using srcu for this; preempt_disable +
synchronize_sched() is similar and much faster -- its the rcu_sched
equivalent of what you propose.

> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

No, well, it also gets slow when a hotplug is pending, which can be
quite a while if we go sprinkle get_online_cpus() all over the place and
the machine is busy.

One we start a hotplug attempt we must wait for all readers to quiesce
-- since the lock is full reader preference this can take an infinite
amount of time -- while we're waiting for this all 4k+ CPUs will be
bouncing the one mutex around on every get_online_cpus(); of which we'll
have many since that's the entire point of making them cheap, to use
more of them.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 15:22                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 15:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.

Well, for that the thing Oleg proposed works just fine and the
preempt_disable() section vs synchronize_sched() is hardly magic.

But I'd really like to get the writer pending case fast too.

> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).

Still no point in using srcu for this; preempt_disable +
synchronize_sched() is similar and much faster -- its the rcu_sched
equivalent of what you propose.

> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

No, well, it also gets slow when a hotplug is pending, which can be
quite a while if we go sprinkle get_online_cpus() all over the place and
the machine is busy.

One we start a hotplug attempt we must wait for all readers to quiesce
-- since the lock is full reader preference this can take an infinite
amount of time -- while we're waiting for this all 4k+ CPUs will be
bouncing the one mutex around on every get_online_cpus(); of which we'll
have many since that's the entire point of making them cheap, to use
more of them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 15:13                   ` Steven Rostedt
@ 2013-09-23 15:50                     ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-23 15:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 16:54:46 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:

[ . . . ]

> ?? I'm not sure I understand this. The online_cpus_held++ was there for
> recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
> once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
> many times as we did a __get_online_cpus(). I don't know where the
> O(nr_tasks) comes from. The ref here was just to account for doing the
> old "get_online_cpus" instead of a srcu_read_lock().
> 
> > 
> > > static inline void put_online_cpus(void)
> > > {
> > > 	if (unlikely(current->online_cpus_held)) {
> > > 		current->online_cpus_held--;
> > > 		__put_online_cpus();
> > > 		return;
> > > 	}
> > > 
> > > 	srcu_read_unlock(&cpuhp_srcu);
> > > }
> > 
> > Also, you might not have noticed but, srcu_read_{,un}lock() have an
> > extra idx thing to pass about. That doesn't fit with the hotplug api.
> 
> I'll have to look a that, as I'm not exactly sure about the idx thing.

Not a problem, just stuff the idx into some per-task thing.  Either
task_struct or taskinfo will work fine.

> > > 
> > > Then have the writer simply do:
> > > 
> > > 	__cpuhp_write = current;
> > > 	synchronize_srcu(&cpuhp_srcu);
> > > 
> > > 	<grab the mutex here>
> > 
> > How does that do reader preference?
> 
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.
> 
> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).
> 
> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

Or to put it another way, if the underlying slow-path mutex is
reader-preference, then the whole thing will be reader-preference.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 15:50                     ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-23 15:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 16:54:46 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:

[ . . . ]

> ?? I'm not sure I understand this. The online_cpus_held++ was there for
> recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
> once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
> many times as we did a __get_online_cpus(). I don't know where the
> O(nr_tasks) comes from. The ref here was just to account for doing the
> old "get_online_cpus" instead of a srcu_read_lock().
> 
> > 
> > > static inline void put_online_cpus(void)
> > > {
> > > 	if (unlikely(current->online_cpus_held)) {
> > > 		current->online_cpus_held--;
> > > 		__put_online_cpus();
> > > 		return;
> > > 	}
> > > 
> > > 	srcu_read_unlock(&cpuhp_srcu);
> > > }
> > 
> > Also, you might not have noticed but, srcu_read_{,un}lock() have an
> > extra idx thing to pass about. That doesn't fit with the hotplug api.
> 
> I'll have to look a that, as I'm not exactly sure about the idx thing.

Not a problem, just stuff the idx into some per-task thing.  Either
task_struct or taskinfo will work fine.

> > > 
> > > Then have the writer simply do:
> > > 
> > > 	__cpuhp_write = current;
> > > 	synchronize_srcu(&cpuhp_srcu);
> > > 
> > > 	<grab the mutex here>
> > 
> > How does that do reader preference?
> 
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.
> 
> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).
> 
> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

Or to put it another way, if the underlying slow-path mutex is
reader-preference, then the whole thing will be reader-preference.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 15:22                     ` Peter Zijlstra
@ 2013-09-23 15:59                       ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, 23 Sep 2013 17:22:23 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Still no point in using srcu for this; preempt_disable +
> synchronize_sched() is similar and much faster -- its the rcu_sched
> equivalent of what you propose.

To be honest, I sent this out last week and it somehow got trashed by
my laptop and connecting to my smtp server. Where the last version of
your patch still had the memory barrier ;-)

So yeah, a true synchronize_sched() is better.

-- Steve

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 15:59                       ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-23 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, 23 Sep 2013 17:22:23 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Still no point in using srcu for this; preempt_disable +
> synchronize_sched() is similar and much faster -- its the rcu_sched
> equivalent of what you propose.

To be honest, I sent this out last week and it somehow got trashed by
my laptop and connecting to my smtp server. Where the last version of
your patch still had the memory barrier ;-)

So yeah, a true synchronize_sched() is better.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 15:50                     ` Paul E. McKenney
@ 2013-09-23 16:01                       ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 16:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> Not a problem, just stuff the idx into some per-task thing.  Either
> task_struct or taskinfo will work fine.

Still not seeing the point of using srcu though..

srcu_read_lock() vs synchronize_srcu() is the same but far more
expensive than preempt_disable() vs synchronize_sched().

> Or to put it another way, if the underlying slow-path mutex is
> reader-preference, then the whole thing will be reader-preference.

Right, so 1) we have no such mutex so we're going to have to open-code
that anyway, and 2) like I just explained in the other email, I want the
pending writer case to be _fast_ as well.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 16:01                       ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 16:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> Not a problem, just stuff the idx into some per-task thing.  Either
> task_struct or taskinfo will work fine.

Still not seeing the point of using srcu though..

srcu_read_lock() vs synchronize_srcu() is the same but far more
expensive than preempt_disable() vs synchronize_sched().

> Or to put it another way, if the underlying slow-path mutex is
> reader-preference, then the whole thing will be reader-preference.

Right, so 1) we have no such mutex so we're going to have to open-code
that anyway, and 2) like I just explained in the other email, I want the
pending writer case to be _fast_ as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 15:59                       ` Steven Rostedt
@ 2013-09-23 16:02                         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 16:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:59:08AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 17:22:23 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Still no point in using srcu for this; preempt_disable +
> > synchronize_sched() is similar and much faster -- its the rcu_sched
> > equivalent of what you propose.
> 
> To be honest, I sent this out last week and it somehow got trashed by
> my laptop and connecting to my smtp server. Where the last version of
> your patch still had the memory barrier ;-)

Ah, ok, yes in that case things start to make sense again.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 16:02                         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 16:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Oleg Nesterov,
	Paul McKenney, Thomas Gleixner

On Mon, Sep 23, 2013 at 11:59:08AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 17:22:23 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Still no point in using srcu for this; preempt_disable +
> > synchronize_sched() is similar and much faster -- its the rcu_sched
> > equivalent of what you propose.
> 
> To be honest, I sent this out last week and it somehow got trashed by
> my laptop and connecting to my smtp server. Where the last version of
> your patch still had the memory barrier ;-)

Ah, ok, yes in that case things start to make sense again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 16:01                       ` Peter Zijlstra
@ 2013-09-23 17:04                         ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-23 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 06:01:30PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> > Not a problem, just stuff the idx into some per-task thing.  Either
> > task_struct or taskinfo will work fine.
> 
> Still not seeing the point of using srcu though..
> 
> srcu_read_lock() vs synchronize_srcu() is the same but far more
> expensive than preempt_disable() vs synchronize_sched().

Heh!  You want the old-style SRCU.  ;-)

> > Or to put it another way, if the underlying slow-path mutex is
> > reader-preference, then the whole thing will be reader-preference.
> 
> Right, so 1) we have no such mutex so we're going to have to open-code
> that anyway, and 2) like I just explained in the other email, I want the
> pending writer case to be _fast_ as well.

At some point I suspect that we will want some form of fairness, but in
the meantime, good point.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 17:04                         ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-23 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 06:01:30PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> > Not a problem, just stuff the idx into some per-task thing.  Either
> > task_struct or taskinfo will work fine.
> 
> Still not seeing the point of using srcu though..
> 
> srcu_read_lock() vs synchronize_srcu() is the same but far more
> expensive than preempt_disable() vs synchronize_sched().

Heh!  You want the old-style SRCU.  ;-)

> > Or to put it another way, if the underlying slow-path mutex is
> > reader-preference, then the whole thing will be reader-preference.
> 
> Right, so 1) we have no such mutex so we're going to have to open-code
> that anyway, and 2) like I just explained in the other email, I want the
> pending writer case to be _fast_ as well.

At some point I suspect that we will want some form of fairness, but in
the meantime, good point.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 17:04                         ` Paul E. McKenney
@ 2013-09-23 17:30                           ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 17:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 10:04:00AM -0700, Paul E. McKenney wrote:
> At some point I suspect that we will want some form of fairness, but in
> the meantime, good point.

I figured we could start a timer on hotplug to force quiesce the readers
after about 10 minutes or so ;-)

Should be a proper discouragement from (ab)using this hotplug stuff...

Muwhahaha

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 17:30                           ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-23 17:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Oleg Nesterov, Thomas Gleixner

On Mon, Sep 23, 2013 at 10:04:00AM -0700, Paul E. McKenney wrote:
> At some point I suspect that we will want some form of fairness, but in
> the meantime, good point.

I figured we could start a timer on hotplug to force quiesce the readers
after about 10 minutes or so ;-)

Should be a proper discouragement from (ab)using this hotplug stuff...

Muwhahaha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23  9:29                 ` Peter Zijlstra
@ 2013-09-23 17:32                   ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-23 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/23, Peter Zijlstra wrote:
>
> On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > > So the slow path is still per-cpu and mostly uncontended even in the
> > > pending writer case.
> >
> > Is it really important? I mean, per-cpu/uncontended even if the writer
> > is pending?
>
> I think so, once we make {get,put}_online_cpus() really cheap they'll
> get in more and more places, and the global count with pending writer
> will make things crawl on bigger machines.

Hmm. But the writers should be rare.

> > But. We already have percpu_rw_semaphore,
>
> Oh urgh, forgot about that one. /me goes read.
>
> /me curses loudly.. that thing has an _expedited() call in it, those
> should die.

Probably yes, the original reason for _expedited() has gone away.

> I'd dread to think what would happen if a 4k cpu machine were to land in
> the slow path on that global mutex. Readers would never go-away and
> progress would make a glacier seem fast.

Another problem is that write-lock can never succeed unless it
prevents the new readers, but this needs the per-task counter.

> > Note also that percpu_down_write/percpu_up_write can be improved wrt
> > synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> > 1nd one can be avoided if another percpu_down_write() comes "soon after"
> > percpu_down_up().
>
> Write side be damned ;-)

Suppose that a 4k cpu machine does disable_nonboot_cpus(), every
_cpu_down() does synchronize_sched()... OK, perhaps the locking can be
changed so that cpu_hotplug_begin/end is called only once in this case.

> > 	- The writer calls cpuph_wait_refcount()
> >
> > 	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> > 	  refcnt == 0.
> >
> > 	- another reader comes on CPU_0, increments __cpuhp_refcount[0].
> >
> > 	- this reader migrates to CPU_1 and does put_online_cpus(),
> > 	  this decrements __cpuhp_refcount[1] which becomes zero.
> >
> > 	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> > 	  which is zero. refcnt == 0, return.
>
> Ah indeed..
>
> The best I can come up with is something like:
>
> static unsigned int cpuhp_refcount(void)
> {
> 	unsigned int refcount = 0;
> 	int cpu;
>
> 	for_each_possible_cpu(cpu)
> 		refcount += per_cpu(__cpuhp_refcount, cpu);
> }
>
> static void cpuhp_wait_refcount(void)
> {
> 	for (;;) {
> 		unsigned int rc1, rc2;
>
> 		rc1 = cpuhp_refcount();
> 		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> 		rc2 = cpuhp_refcount();
>
> 		if (rc1 == rc2 && !rc1)

But this only makes the race above "theoretical ** 2". Both
cpuhp_refcount()'s can be equally fooled.

Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
before it reads __cpuhp_refcount.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 17:32                   ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-23 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/23, Peter Zijlstra wrote:
>
> On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > > So the slow path is still per-cpu and mostly uncontended even in the
> > > pending writer case.
> >
> > Is it really important? I mean, per-cpu/uncontended even if the writer
> > is pending?
>
> I think so, once we make {get,put}_online_cpus() really cheap they'll
> get in more and more places, and the global count with pending writer
> will make things crawl on bigger machines.

Hmm. But the writers should be rare.

> > But. We already have percpu_rw_semaphore,
>
> Oh urgh, forgot about that one. /me goes read.
>
> /me curses loudly.. that thing has an _expedited() call in it, those
> should die.

Probably yes, the original reason for _expedited() has gone away.

> I'd dread to think what would happen if a 4k cpu machine were to land in
> the slow path on that global mutex. Readers would never go-away and
> progress would make a glacier seem fast.

Another problem is that write-lock can never succeed unless it
prevents the new readers, but this needs the per-task counter.

> > Note also that percpu_down_write/percpu_up_write can be improved wrt
> > synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> > 1nd one can be avoided if another percpu_down_write() comes "soon after"
> > percpu_down_up().
>
> Write side be damned ;-)

Suppose that a 4k cpu machine does disable_nonboot_cpus(), every
_cpu_down() does synchronize_sched()... OK, perhaps the locking can be
changed so that cpu_hotplug_begin/end is called only once in this case.

> > 	- The writer calls cpuph_wait_refcount()
> >
> > 	- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> > 	  refcnt == 0.
> >
> > 	- another reader comes on CPU_0, increments __cpuhp_refcount[0].
> >
> > 	- this reader migrates to CPU_1 and does put_online_cpus(),
> > 	  this decrements __cpuhp_refcount[1] which becomes zero.
> >
> > 	- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> > 	  which is zero. refcnt == 0, return.
>
> Ah indeed..
>
> The best I can come up with is something like:
>
> static unsigned int cpuhp_refcount(void)
> {
> 	unsigned int refcount = 0;
> 	int cpu;
>
> 	for_each_possible_cpu(cpu)
> 		refcount += per_cpu(__cpuhp_refcount, cpu);
> }
>
> static void cpuhp_wait_refcount(void)
> {
> 	for (;;) {
> 		unsigned int rc1, rc2;
>
> 		rc1 = cpuhp_refcount();
> 		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> 		rc2 = cpuhp_refcount();
>
> 		if (rc1 == rc2 && !rc1)

But this only makes the race above "theoretical ** 2". Both
cpuhp_refcount()'s can be equally fooled.

Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
before it reads __cpuhp_refcount.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-19 14:32             ` Peter Zijlstra
@ 2013-09-23 17:50               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-23 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

And somehow I didn't notice that cpuhp_set_state() doesn't look right,

On 09/19, Peter Zijlstra wrote:
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	lockdep_assert_held(&cpu_add_remove_lock);
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	__cpuhp_writer = current;
> +
> +	/* After this everybody will observe _writer and take the slow path. */
> +	synchronize_sched();
> +
> +	/* Wait for no readers -- reader preference */
> +	cpuhp_wait_refcount();
> +
> +	/* Stop new readers. */
> +	cpuhp_set_state(1);

But this stops all readers, not only new. Even if cpuhp_wait_refcount()
was correct, a new reader can come right before cpuhp_set_state(1) and
then it can call another recursive get_online_cpus() right after.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-23 17:50               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-23 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

And somehow I didn't notice that cpuhp_set_state() doesn't look right,

On 09/19, Peter Zijlstra wrote:
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	lockdep_assert_held(&cpu_add_remove_lock);
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	__cpuhp_writer = current;
> +
> +	/* After this everybody will observe _writer and take the slow path. */
> +	synchronize_sched();
> +
> +	/* Wait for no readers -- reader preference */
> +	cpuhp_wait_refcount();
> +
> +	/* Stop new readers. */
> +	cpuhp_set_state(1);

But this stops all readers, not only new. Even if cpuhp_wait_refcount()
was correct, a new reader can come right before cpuhp_set_state(1) and
then it can call another recursive get_online_cpus() right after.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 17:50               ` Oleg Nesterov
@ 2013-09-24 12:38                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 12:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt


OK, so another attempt.

This one is actually fair in that it immediately forces a reader
quiescent state by explicitly implementing reader-reader recursion.

This does away with the potentially long pending writer case and can
thus use the simpler global state.

I don't really like this lock being fair, but alas.

Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
cpu_hotplug_done(). I think its ok, but I keep confusing myself.

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+static struct task_struct *cpuhp_writer_task = NULL;
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
 
-void put_online_cpus(void)
+#define cpuhp_writer_wake()						\
+	wake_up_process(cpuhp_writer_task)
+
+#define cpuhp_writer_wait(cond)						\
+do {									\
+	for (;;) {							\
+		set_current_state(TASK_UNINTERRUPTIBLE);		\
+		if (cond)						\
+			break;						\
+		schedule();						\
+	}								\
+	__set_current_state(TASK_RUNNING);				\
+} while (0)
+
+void __get_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
+	if (cpuhp_writer_task == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	atomic_inc(&cpuhp_waitcount);
+
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+	preempt_disable();
+
+	/*
+	 * It would be possible for cpu_hotplug_done() to complete before
+	 * the atomic_inc() above; in which case there is no writer waiting
+	 * and doing a wakeup would be BAD (tm).
+	 *
+	 * If however we still observe cpuhp_writer_task here we know
+	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
+		cpuhp_writer_wake();
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+void __put_online_cpus(void)
+{
+	if (cpuhp_writer_task == current)
+		return;
 
+	if (atomic_dec_and_test(&cpuhp_slowcount))
+		cpuhp_writer_wake();
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
+
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
+	__cpuhp_writer = 1;
+	cpuhp_writer_task = current;
+
+	/* After this everybody will observe writer and take the slow path. */
+	synchronize_sched();
+
+	/* Collapse the per-cpu refcount into slowcount */
+	for_each_possible_cpu(cpu) {
+		count += per_cpu(__cpuhp_refcount, cpu);
+		per_cpu(__cpuhp_refcount, cpu) = 0;
 	}
+	atomic_add(count, &cpuhp_slowcount);
+
+	/* Wait for all readers to go away */
+	cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done */
+	cpuhp_writer = 0;
+	wake_up_all(&cpuhp_wq);
+
+	/* Wait for any pending readers to be running */
+	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
+	cpuhp_writer_task = NULL;
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 12:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 12:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt


OK, so another attempt.

This one is actually fair in that it immediately forces a reader
quiescent state by explicitly implementing reader-reader recursion.

This does away with the potentially long pending writer case and can
thus use the simpler global state.

I don't really like this lock being fair, but alas.

Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
cpu_hotplug_done(). I think its ok, but I keep confusing myself.

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+static struct task_struct *cpuhp_writer_task = NULL;
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
 
-void put_online_cpus(void)
+#define cpuhp_writer_wake()						\
+	wake_up_process(cpuhp_writer_task)
+
+#define cpuhp_writer_wait(cond)						\
+do {									\
+	for (;;) {							\
+		set_current_state(TASK_UNINTERRUPTIBLE);		\
+		if (cond)						\
+			break;						\
+		schedule();						\
+	}								\
+	__set_current_state(TASK_RUNNING);				\
+} while (0)
+
+void __get_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
+	if (cpuhp_writer_task == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	atomic_inc(&cpuhp_waitcount);
+
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	wait_event(cpuhp_wq, !__cpuhp_writer);
+	preempt_disable();
+
+	/*
+	 * It would be possible for cpu_hotplug_done() to complete before
+	 * the atomic_inc() above; in which case there is no writer waiting
+	 * and doing a wakeup would be BAD (tm).
+	 *
+	 * If however we still observe cpuhp_writer_task here we know
+	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
+		cpuhp_writer_wake();
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+void __put_online_cpus(void)
+{
+	if (cpuhp_writer_task == current)
+		return;
 
+	if (atomic_dec_and_test(&cpuhp_slowcount))
+		cpuhp_writer_wake();
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
+
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
+	__cpuhp_writer = 1;
+	cpuhp_writer_task = current;
+
+	/* After this everybody will observe writer and take the slow path. */
+	synchronize_sched();
+
+	/* Collapse the per-cpu refcount into slowcount */
+	for_each_possible_cpu(cpu) {
+		count += per_cpu(__cpuhp_refcount, cpu);
+		per_cpu(__cpuhp_refcount, cpu) = 0;
 	}
+	atomic_add(count, &cpuhp_slowcount);
+
+	/* Wait for all readers to go away */
+	cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done */
+	cpuhp_writer = 0;
+	wake_up_all(&cpuhp_wq);
+
+	/* Wait for any pending readers to be running */
+	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
+	cpuhp_writer_task = NULL;
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 12:38                 ` Peter Zijlstra
@ 2013-09-24 14:42                   ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 02:38:21PM +0200, Peter Zijlstra wrote:
> 
> OK, so another attempt.
> 
> This one is actually fair in that it immediately forces a reader
> quiescent state by explicitly implementing reader-reader recursion.
> 
> This does away with the potentially long pending writer case and can
> thus use the simpler global state.
> 
> I don't really like this lock being fair, but alas.
> 
> Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
> cpu_hotplug_done(). I think its ok, but I keep confusing myself.

Cute!

Some commentary below.  Also one question about how a race leading to
a NULL-pointer dereference is avoided.

							Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);
> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);
> +	else
> +		__put_online_cpus();
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +static struct task_struct *cpuhp_writer_task = NULL;
> 
> -void get_online_cpus(void)
> -{
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> 
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
>  }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
> 
> -void put_online_cpus(void)
> +#define cpuhp_writer_wake()						\
> +	wake_up_process(cpuhp_writer_task)
> +
> +#define cpuhp_writer_wait(cond)						\
> +do {									\
> +	for (;;) {							\
> +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> +		if (cond)						\
> +			break;						\
> +		schedule();						\
> +	}								\
> +	__set_current_state(TASK_RUNNING);				\
> +} while (0)

Why not wait_event()?  Presumably the above is a bit lighter weight,
but is that even something that can be measured?

> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> 
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> +
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_wq, !__cpuhp_writer);

Finally!  A good use for preempt_enable_no_resched().  ;-)

> +	preempt_disable();
> +
> +	/*
> +	 * It would be possible for cpu_hotplug_done() to complete before
> +	 * the atomic_inc() above; in which case there is no writer waiting
> +	 * and doing a wakeup would be BAD (tm).
> +	 *
> +	 * If however we still observe cpuhp_writer_task here we know
> +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)

OK, I'll bite...  What sequence of events results in the
atomic_dec_and_test() returning true but there being no
cpuhp_writer_task?

Ah, I see it...

o	Task A becomes the writer.

o	Task B tries to read, but stalls for whatever reason before
	the atomic_inc().

o	Task A completes its write-side operation.  It sees no readers
	blocked, so goes on its merry way.

o	Task B does its atomic_inc(), does its read, then sees
	atomic_dec_and_test() return zero, but cpuhp_writer_task
	is NULL, so it doesn't do the wakeup.

But what prevents the following sequence of events?

o	Task A becomes the writer.

o	Task B tries to read, but stalls for whatever reason before
	the atomic_inc().

o	Task A completes its write-side operation.  It sees no readers
	blocked, so goes on its merry way, but is delayed before it
	NULLs cpuhp_writer_task.

o	Task B does its atomic_inc(), does its read, then sees
	atomic_dec_and_test() return zero.  However, it sees
	cpuhp_writer_task as non-NULL.

o	Then Task A NULLs cpuhp_writer_task.

o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.

> +		cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> 
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +void __put_online_cpus(void)
> +{
> +	if (cpuhp_writer_task == current)
> +		return;
> 
> +	if (atomic_dec_and_test(&cpuhp_slowcount))
> +		cpuhp_writer_wake();
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = 1;
> +	cpuhp_writer_task = current;

At this point, the value of cpuhp_slowcount can go negative.  Can't see
that this causes a problem, given the atomic_add() below.

> +
> +	/* After this everybody will observe writer and take the slow path. */
> +	synchronize_sched();
> +
> +	/* Collapse the per-cpu refcount into slowcount */
> +	for_each_possible_cpu(cpu) {
> +		count += per_cpu(__cpuhp_refcount, cpu);
> +		per_cpu(__cpuhp_refcount, cpu) = 0;
>  	}

The above is safe because the readers are no longer changing their
__cpuhp_refcount values.

> +	atomic_add(count, &cpuhp_slowcount);
> +
> +	/* Wait for all readers to go away */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */
> +	cpuhp_writer = 0;
> +	wake_up_all(&cpuhp_wq);
> +
> +	/* Wait for any pending readers to be running */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> +	cpuhp_writer_task = NULL;
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 14:42                   ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 02:38:21PM +0200, Peter Zijlstra wrote:
> 
> OK, so another attempt.
> 
> This one is actually fair in that it immediately forces a reader
> quiescent state by explicitly implementing reader-reader recursion.
> 
> This does away with the potentially long pending writer case and can
> thus use the simpler global state.
> 
> I don't really like this lock being fair, but alas.
> 
> Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
> cpu_hotplug_done(). I think its ok, but I keep confusing myself.

Cute!

Some commentary below.  Also one question about how a race leading to
a NULL-pointer dereference is avoided.

							Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);
> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);
> +	else
> +		__put_online_cpus();
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +static struct task_struct *cpuhp_writer_task = NULL;
> 
> -void get_online_cpus(void)
> -{
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> 
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
>  }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
> 
> -void put_online_cpus(void)
> +#define cpuhp_writer_wake()						\
> +	wake_up_process(cpuhp_writer_task)
> +
> +#define cpuhp_writer_wait(cond)						\
> +do {									\
> +	for (;;) {							\
> +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> +		if (cond)						\
> +			break;						\
> +		schedule();						\
> +	}								\
> +	__set_current_state(TASK_RUNNING);				\
> +} while (0)

Why not wait_event()?  Presumably the above is a bit lighter weight,
but is that even something that can be measured?

> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> 
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> +
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_wq, !__cpuhp_writer);

Finally!  A good use for preempt_enable_no_resched().  ;-)

> +	preempt_disable();
> +
> +	/*
> +	 * It would be possible for cpu_hotplug_done() to complete before
> +	 * the atomic_inc() above; in which case there is no writer waiting
> +	 * and doing a wakeup would be BAD (tm).
> +	 *
> +	 * If however we still observe cpuhp_writer_task here we know
> +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)

OK, I'll bite...  What sequence of events results in the
atomic_dec_and_test() returning true but there being no
cpuhp_writer_task?

Ah, I see it...

o	Task A becomes the writer.

o	Task B tries to read, but stalls for whatever reason before
	the atomic_inc().

o	Task A completes its write-side operation.  It sees no readers
	blocked, so goes on its merry way.

o	Task B does its atomic_inc(), does its read, then sees
	atomic_dec_and_test() return zero, but cpuhp_writer_task
	is NULL, so it doesn't do the wakeup.

But what prevents the following sequence of events?

o	Task A becomes the writer.

o	Task B tries to read, but stalls for whatever reason before
	the atomic_inc().

o	Task A completes its write-side operation.  It sees no readers
	blocked, so goes on its merry way, but is delayed before it
	NULLs cpuhp_writer_task.

o	Task B does its atomic_inc(), does its read, then sees
	atomic_dec_and_test() return zero.  However, it sees
	cpuhp_writer_task as non-NULL.

o	Then Task A NULLs cpuhp_writer_task.

o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.

> +		cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> 
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +void __put_online_cpus(void)
> +{
> +	if (cpuhp_writer_task == current)
> +		return;
> 
> +	if (atomic_dec_and_test(&cpuhp_slowcount))
> +		cpuhp_writer_wake();
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = 1;
> +	cpuhp_writer_task = current;

At this point, the value of cpuhp_slowcount can go negative.  Can't see
that this causes a problem, given the atomic_add() below.

> +
> +	/* After this everybody will observe writer and take the slow path. */
> +	synchronize_sched();
> +
> +	/* Collapse the per-cpu refcount into slowcount */
> +	for_each_possible_cpu(cpu) {
> +		count += per_cpu(__cpuhp_refcount, cpu);
> +		per_cpu(__cpuhp_refcount, cpu) = 0;
>  	}

The above is safe because the readers are no longer changing their
__cpuhp_refcount values.

> +	atomic_add(count, &cpuhp_slowcount);
> +
> +	/* Wait for all readers to go away */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */
> +	cpuhp_writer = 0;
> +	wake_up_all(&cpuhp_wq);
> +
> +	/* Wait for any pending readers to be running */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> +	cpuhp_writer_task = NULL;
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 12:38                 ` Peter Zijlstra
@ 2013-09-24 16:03                   ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;

I don't undestand this barrier()... we are going to return if we already
hold the lock, do we really need it?

The same for put_online_cpus().

> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;

Probably it would be better to simply inc/dec ->cpuhp_ref in
cpu_hotplug_begin/end and remove this check here and in
__put_online_cpus().

This also means that the writer doing get/put_online_cpus() will
always use the fast path, and __cpuhp_writer can go away,
cpuhp_writer_task != NULL can be used instead.

> +     atomic_inc(&cpuhp_waitcount);
> +
> +     /*
> +      * We either call schedule() in the wait, or we'll fall through
> +      * and reschedule on the preempt_enable() in get_online_cpus().
> +      */
> +     preempt_enable_no_resched();
> +     wait_event(cpuhp_wq, !__cpuhp_writer);
> +     preempt_disable();
> +
> +     /*
> +      * It would be possible for cpu_hotplug_done() to complete before
> +      * the atomic_inc() above; in which case there is no writer waiting
> +      * and doing a wakeup would be BAD (tm).
> +      *
> +      * If however we still observe cpuhp_writer_task here we know
> +      * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +      */
> +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> +             cpuhp_writer_wake();

cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
something like ACCESS_ONCE()), its task_struct can be already freed/reused
if the writer exits.

And I don't really understand the logic... This slow path succeds without
incrementing any counter (except current->cpuhp_ref)? How the next writer
can notice the fact it should wait for this reader?

>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */
> +	cpuhp_writer = 0;
> +	wake_up_all(&cpuhp_wq);
> +
> +	/* Wait for any pending readers to be running */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> +	cpuhp_writer_task = NULL;

We also need to ensure that the next reader should see all changes
done by the writer, iow this lacks "realease" semantics.




But, Peter, the main question is, why this is better than
percpu_rw_semaphore performance-wise? (Assuming we add
task_struct->cpuhp_ref).

If the writer is pending, percpu_down_read() does

	down_read(&brw->rw_sem);
	atomic_inc(&brw->slow_read_ctr);
	__up_read(&brw->rw_sem);

is it really much worse than wait_event + atomic_dec_and_test?

And! please note that with your implementation the new readers will
be likely blocked while the writer sleeps in synchronize_sched().
This doesn't happen with percpu_rw_semaphore.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:03                   ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;

I don't undestand this barrier()... we are going to return if we already
hold the lock, do we really need it?

The same for put_online_cpus().

> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;

Probably it would be better to simply inc/dec ->cpuhp_ref in
cpu_hotplug_begin/end and remove this check here and in
__put_online_cpus().

This also means that the writer doing get/put_online_cpus() will
always use the fast path, and __cpuhp_writer can go away,
cpuhp_writer_task != NULL can be used instead.

> +     atomic_inc(&cpuhp_waitcount);
> +
> +     /*
> +      * We either call schedule() in the wait, or we'll fall through
> +      * and reschedule on the preempt_enable() in get_online_cpus().
> +      */
> +     preempt_enable_no_resched();
> +     wait_event(cpuhp_wq, !__cpuhp_writer);
> +     preempt_disable();
> +
> +     /*
> +      * It would be possible for cpu_hotplug_done() to complete before
> +      * the atomic_inc() above; in which case there is no writer waiting
> +      * and doing a wakeup would be BAD (tm).
> +      *
> +      * If however we still observe cpuhp_writer_task here we know
> +      * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +      */
> +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> +             cpuhp_writer_wake();

cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
something like ACCESS_ONCE()), its task_struct can be already freed/reused
if the writer exits.

And I don't really understand the logic... This slow path succeds without
incrementing any counter (except current->cpuhp_ref)? How the next writer
can notice the fact it should wait for this reader?

>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */
> +	cpuhp_writer = 0;
> +	wake_up_all(&cpuhp_wq);
> +
> +	/* Wait for any pending readers to be running */
> +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> +	cpuhp_writer_task = NULL;

We also need to ensure that the next reader should see all changes
done by the writer, iow this lacks "realease" semantics.




But, Peter, the main question is, why this is better than
percpu_rw_semaphore performance-wise? (Assuming we add
task_struct->cpuhp_ref).

If the writer is pending, percpu_down_read() does

	down_read(&brw->rw_sem);
	atomic_inc(&brw->slow_read_ctr);
	__up_read(&brw->rw_sem);

is it really much worse than wait_event + atomic_dec_and_test?

And! please note that with your implementation the new readers will
be likely blocked while the writer sleeps in synchronize_sched().
This doesn't happen with percpu_rw_semaphore.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 14:42                   ` Paul E. McKenney
@ 2013-09-24 16:09                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > +#define cpuhp_writer_wake()						\
> > +	wake_up_process(cpuhp_writer_task)
> > +
> > +#define cpuhp_writer_wait(cond)						\
> > +do {									\
> > +	for (;;) {							\
> > +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> > +		if (cond)						\
> > +			break;						\
> > +		schedule();						\
> > +	}								\
> > +	__set_current_state(TASK_RUNNING);				\
> > +} while (0)
> 
> Why not wait_event()?  Presumably the above is a bit lighter weight,
> but is that even something that can be measured?

I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
could create a second waitqueue; that might also be a better solution
for the NULL thing below.

> > +	atomic_inc(&cpuhp_waitcount);
> > +
> > +	/*
> > +	 * We either call schedule() in the wait, or we'll fall through
> > +	 * and reschedule on the preempt_enable() in get_online_cpus().
> > +	 */
> > +	preempt_enable_no_resched();
> > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> 
> Finally!  A good use for preempt_enable_no_resched().  ;-)

Hehe, there were a few others, but tglx removed most with the
schedule_preempt_disabled() primitive.

In fact, I considered a wait_event_preempt_disabled() but was too lazy.
That whole wait_event macro fest looks like it could use an iteration or
two of collapse anyhow.

> > +	preempt_disable();
> > +
> > +	/*
> > +	 * It would be possible for cpu_hotplug_done() to complete before
> > +	 * the atomic_inc() above; in which case there is no writer waiting
> > +	 * and doing a wakeup would be BAD (tm).
> > +	 *
> > +	 * If however we still observe cpuhp_writer_task here we know
> > +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +	 */
> > +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> 
> OK, I'll bite...  What sequence of events results in the
> atomic_dec_and_test() returning true but there being no
> cpuhp_writer_task?
> 
> Ah, I see it...

<snip>

Indeed, and

> But what prevents the following sequence of events?

<snip>

> o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.

Quite so.. nothing. See there was a reason I kept being confused about
it.

> >  void cpu_hotplug_begin(void)
> >  {
> > +	unsigned int count = 0;
> > +	int cpu;
> > +
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > 
> > +	__cpuhp_writer = 1;
> > +	cpuhp_writer_task = current;
> 
> At this point, the value of cpuhp_slowcount can go negative.  Can't see
> that this causes a problem, given the atomic_add() below.

Agreed.

> > +
> > +	/* After this everybody will observe writer and take the slow path. */
> > +	synchronize_sched();
> > +
> > +	/* Collapse the per-cpu refcount into slowcount */
> > +	for_each_possible_cpu(cpu) {
> > +		count += per_cpu(__cpuhp_refcount, cpu);
> > +		per_cpu(__cpuhp_refcount, cpu) = 0;
> >  	}
> 
> The above is safe because the readers are no longer changing their
> __cpuhp_refcount values.

Yes, I'll expand the comment.

So how about something like this?

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
 
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
 
-void put_online_cpus(void)
+void __get_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
+	/* Support reader-in-writer recursion */
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	atomic_inc(&cpuhp_waitcount);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	wait_event(cpuhp_readers, !__cpuhp_writer);
+	preempt_disable();
+
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	if (__cpuhp_writer == current)
+		return;
 
+	if (atomic_dec_and_test(&cpuhp_slowcount))
+		wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
+
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
+	__cpuhp_writer = current;
+
+	/* 
+	 * After this everybody will observe writer and take the slow path.
+	 */
+	synchronize_sched();
+
+	/* 
+	 * Collapse the per-cpu refcount into slowcount. This is safe because
+	 * readers are now taking the slow path (per the above) which doesn't
+	 * touch __cpuhp_refcount.
+	 */
+	for_each_possible_cpu(cpu) {
+		count += per_cpu(__cpuhp_refcount, cpu);
+		per_cpu(__cpuhp_refcount, cpu) = 0;
 	}
+	atomic_add(count, &cpuhp_slowcount);
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done */
+	cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_readers);
+
+	/* 
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:09                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > +#define cpuhp_writer_wake()						\
> > +	wake_up_process(cpuhp_writer_task)
> > +
> > +#define cpuhp_writer_wait(cond)						\
> > +do {									\
> > +	for (;;) {							\
> > +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> > +		if (cond)						\
> > +			break;						\
> > +		schedule();						\
> > +	}								\
> > +	__set_current_state(TASK_RUNNING);				\
> > +} while (0)
> 
> Why not wait_event()?  Presumably the above is a bit lighter weight,
> but is that even something that can be measured?

I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
could create a second waitqueue; that might also be a better solution
for the NULL thing below.

> > +	atomic_inc(&cpuhp_waitcount);
> > +
> > +	/*
> > +	 * We either call schedule() in the wait, or we'll fall through
> > +	 * and reschedule on the preempt_enable() in get_online_cpus().
> > +	 */
> > +	preempt_enable_no_resched();
> > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> 
> Finally!  A good use for preempt_enable_no_resched().  ;-)

Hehe, there were a few others, but tglx removed most with the
schedule_preempt_disabled() primitive.

In fact, I considered a wait_event_preempt_disabled() but was too lazy.
That whole wait_event macro fest looks like it could use an iteration or
two of collapse anyhow.

> > +	preempt_disable();
> > +
> > +	/*
> > +	 * It would be possible for cpu_hotplug_done() to complete before
> > +	 * the atomic_inc() above; in which case there is no writer waiting
> > +	 * and doing a wakeup would be BAD (tm).
> > +	 *
> > +	 * If however we still observe cpuhp_writer_task here we know
> > +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +	 */
> > +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> 
> OK, I'll bite...  What sequence of events results in the
> atomic_dec_and_test() returning true but there being no
> cpuhp_writer_task?
> 
> Ah, I see it...

<snip>

Indeed, and

> But what prevents the following sequence of events?

<snip>

> o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.

Quite so.. nothing. See there was a reason I kept being confused about
it.

> >  void cpu_hotplug_begin(void)
> >  {
> > +	unsigned int count = 0;
> > +	int cpu;
> > +
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > 
> > +	__cpuhp_writer = 1;
> > +	cpuhp_writer_task = current;
> 
> At this point, the value of cpuhp_slowcount can go negative.  Can't see
> that this causes a problem, given the atomic_add() below.

Agreed.

> > +
> > +	/* After this everybody will observe writer and take the slow path. */
> > +	synchronize_sched();
> > +
> > +	/* Collapse the per-cpu refcount into slowcount */
> > +	for_each_possible_cpu(cpu) {
> > +		count += per_cpu(__cpuhp_refcount, cpu);
> > +		per_cpu(__cpuhp_refcount, cpu) = 0;
> >  	}
> 
> The above is safe because the readers are no longer changing their
> __cpuhp_refcount values.

Yes, I'll expand the comment.

So how about something like this?

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+struct task_struct *__cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
 
-void get_online_cpus(void)
-{
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
 
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
 
-void put_online_cpus(void)
+void __get_online_cpus(void)
 {
-	if (cpu_hotplug.active_writer == current)
+	/* Support reader-in-writer recursion */
+	if (__cpuhp_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	atomic_inc(&cpuhp_waitcount);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	/*
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
+	 */
+	preempt_enable_no_resched();
+	wait_event(cpuhp_readers, !__cpuhp_writer);
+	preempt_disable();
+
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	if (__cpuhp_writer == current)
+		return;
 
+	if (atomic_dec_and_test(&cpuhp_slowcount))
+		wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
+
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
+	__cpuhp_writer = current;
+
+	/* 
+	 * After this everybody will observe writer and take the slow path.
+	 */
+	synchronize_sched();
+
+	/* 
+	 * Collapse the per-cpu refcount into slowcount. This is safe because
+	 * readers are now taking the slow path (per the above) which doesn't
+	 * touch __cpuhp_refcount.
+	 */
+	for_each_possible_cpu(cpu) {
+		count += per_cpu(__cpuhp_refcount, cpu);
+		per_cpu(__cpuhp_refcount, cpu) = 0;
 	}
+	atomic_add(count, &cpuhp_slowcount);
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done */
+	cpuhp_writer = NULL;
+	wake_up_all(&cpuhp_readers);
+
+	/* 
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:09                     ` Peter Zijlstra
@ 2013-09-24 16:31                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	/* Support reader-in-writer recursion */
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);

Yes, this should fix the races with the exiting writer, but still this
doesn't look right afaics.

In particular let me repeat,

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = current;
> +
> +	/* 
> +	 * After this everybody will observe writer and take the slow path.
> +	 */
> +	synchronize_sched();

synchronize_sched() is slow. The new readers will likely notice
__cpuhp_writer != NULL much earlier and they will be blocked in
__get_online_cpus() while the writer sleeps before it actually
enters the critical section.

Or I completely misunderstood this all?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:31                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	/* Support reader-in-writer recursion */
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);

Yes, this should fix the races with the exiting writer, but still this
doesn't look right afaics.

In particular let me repeat,

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = current;
> +
> +	/* 
> +	 * After this everybody will observe writer and take the slow path.
> +	 */
> +	synchronize_sched();

synchronize_sched() is slow. The new readers will likely notice
__cpuhp_writer != NULL much earlier and they will be blocked in
__get_online_cpus() while the writer sleeps before it actually
enters the critical section.

Or I completely misunderstood this all?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 12:38                 ` Peter Zijlstra
@ 2013-09-24 16:39                   ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-24 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On Tue, 24 Sep 2013 14:38:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +#define cpuhp_writer_wait(cond)						\
> +do {									\
> +	for (;;) {							\
> +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> +		if (cond)						\
> +			break;						\
> +		schedule();						\
> +	}								\
> +	__set_current_state(TASK_RUNNING);				\
> +} while (0)
> +
> +void __get_online_cpus(void)

The above really needs a comment about how it is used. Otherwise, I can
envision someone calling this as "oh I can use this when I'm in a
preempt disable section", and the comment below for the
preempt_enable_no_resched() will no longer be true.

-- Steve


>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> +
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_wq, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	/*
> +	 * It would be possible for cpu_hotplug_done() to complete before
> +	 * the atomic_inc() above; in which case there is no writer waiting
> +	 * and doing a wakeup would be BAD (tm).
> +	 *
> +	 * If however we still observe cpuhp_writer_task here we know
> +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> +		cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:39                   ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-24 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On Tue, 24 Sep 2013 14:38:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +#define cpuhp_writer_wait(cond)						\
> +do {									\
> +	for (;;) {							\
> +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> +		if (cond)						\
> +			break;						\
> +		schedule();						\
> +	}								\
> +	__set_current_state(TASK_RUNNING);				\
> +} while (0)
> +
> +void __get_online_cpus(void)

The above really needs a comment about how it is used. Otherwise, I can
envision someone calling this as "oh I can use this when I'm in a
preempt disable section", and the comment below for the
preempt_enable_no_resched() will no longer be true.

-- Steve


>  {
> -	if (cpu_hotplug.active_writer == current)
> +	if (cpuhp_writer_task == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> +
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_wq, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	/*
> +	 * It would be possible for cpu_hotplug_done() to complete before
> +	 * the atomic_inc() above; in which case there is no writer waiting
> +	 * and doing a wakeup would be BAD (tm).
> +	 *
> +	 * If however we still observe cpuhp_writer_task here we know
> +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> +		cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:03                   ` Oleg Nesterov
@ 2013-09-24 16:43                     ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-24 16:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On Tue, 24 Sep 2013 18:03:59 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?

I'm confused too. Unless gcc moves this after the release, but the
release uses preempt_disable() which is its own barrier.

If anything, it requires a comment.

-- Steve

> 
> The same for put_online_cpus().
> 
> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:43                     ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-24 16:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On Tue, 24 Sep 2013 18:03:59 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?

I'm confused too. Unless gcc moves this after the release, but the
release uses preempt_disable() which is its own barrier.

If anything, it requires a comment.

-- Steve

> 
> The same for put_online_cpus().
> 
> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:03                   ` Oleg Nesterov
@ 2013-09-24 16:49                     ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

The barrier() is needed because of the possibility of inlining, right?

> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().
> 
> This also means that the writer doing get/put_online_cpus() will
> always use the fast path, and __cpuhp_writer can go away,
> cpuhp_writer_task != NULL can be used instead.

I would need to see the code for this change to be sure.  ;-)

> > +     atomic_inc(&cpuhp_waitcount);
> > +
> > +     /*
> > +      * We either call schedule() in the wait, or we'll fall through
> > +      * and reschedule on the preempt_enable() in get_online_cpus().
> > +      */
> > +     preempt_enable_no_resched();
> > +     wait_event(cpuhp_wq, !__cpuhp_writer);
> > +     preempt_disable();
> > +
> > +     /*
> > +      * It would be possible for cpu_hotplug_done() to complete before
> > +      * the atomic_inc() above; in which case there is no writer waiting
> > +      * and doing a wakeup would be BAD (tm).
> > +      *
> > +      * If however we still observe cpuhp_writer_task here we know
> > +      * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +      */
> > +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > +             cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
> Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
> something like ACCESS_ONCE()), its task_struct can be already freed/reused
> if the writer exits.
> 
> And I don't really understand the logic... This slow path succeds without
> incrementing any counter (except current->cpuhp_ref)? How the next writer
> can notice the fact it should wait for this reader?
> 
> >  void cpu_hotplug_done(void)
> >  {
> > -	cpu_hotplug.active_writer = NULL;
> > -	mutex_unlock(&cpu_hotplug.lock);
> > +	/* Signal the writer is done */
> > +	cpuhp_writer = 0;
> > +	wake_up_all(&cpuhp_wq);
> > +
> > +	/* Wait for any pending readers to be running */
> > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > +	cpuhp_writer_task = NULL;
> 
> We also need to ensure that the next reader should see all changes
> done by the writer, iow this lacks "realease" semantics.

Good point -- I was expecting wake_up_all() to provide the release
semantics, but code could be reordered into __wake_up()'s critical
section, especially in the case where there was nothing to wake
up, but where there were new readers starting concurrently with
cpu_hotplug_done().

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
> 	down_read(&brw->rw_sem);
> 	atomic_inc(&brw->slow_read_ctr);
> 	__up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:49                     ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

The barrier() is needed because of the possibility of inlining, right?

> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().
> 
> This also means that the writer doing get/put_online_cpus() will
> always use the fast path, and __cpuhp_writer can go away,
> cpuhp_writer_task != NULL can be used instead.

I would need to see the code for this change to be sure.  ;-)

> > +     atomic_inc(&cpuhp_waitcount);
> > +
> > +     /*
> > +      * We either call schedule() in the wait, or we'll fall through
> > +      * and reschedule on the preempt_enable() in get_online_cpus().
> > +      */
> > +     preempt_enable_no_resched();
> > +     wait_event(cpuhp_wq, !__cpuhp_writer);
> > +     preempt_disable();
> > +
> > +     /*
> > +      * It would be possible for cpu_hotplug_done() to complete before
> > +      * the atomic_inc() above; in which case there is no writer waiting
> > +      * and doing a wakeup would be BAD (tm).
> > +      *
> > +      * If however we still observe cpuhp_writer_task here we know
> > +      * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > +      */
> > +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > +             cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
> Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
> something like ACCESS_ONCE()), its task_struct can be already freed/reused
> if the writer exits.
> 
> And I don't really understand the logic... This slow path succeds without
> incrementing any counter (except current->cpuhp_ref)? How the next writer
> can notice the fact it should wait for this reader?
> 
> >  void cpu_hotplug_done(void)
> >  {
> > -	cpu_hotplug.active_writer = NULL;
> > -	mutex_unlock(&cpu_hotplug.lock);
> > +	/* Signal the writer is done */
> > +	cpuhp_writer = 0;
> > +	wake_up_all(&cpuhp_wq);
> > +
> > +	/* Wait for any pending readers to be running */
> > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > +	cpuhp_writer_task = NULL;
> 
> We also need to ensure that the next reader should see all changes
> done by the writer, iow this lacks "realease" semantics.

Good point -- I was expecting wake_up_all() to provide the release
semantics, but code could be reordered into __wake_up()'s critical
section, especially in the case where there was nothing to wake
up, but where there were new readers starting concurrently with
cpu_hotplug_done().

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
> 	down_read(&brw->rw_sem);
> 	atomic_inc(&brw->slow_read_ctr);
> 	__up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:03                   ` Oleg Nesterov
@ 2013-09-24 16:51                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

to make {get,put}_online_cpus() always behave like per-cpu lock
sections.

I don't think its ever 'correct' for loads/stores to escape the section,
even if not strictly harmful.

> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().

Oh indeed!

> > +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > +             cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...

Yeah it is. Paul already said.

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
> 	down_read(&brw->rw_sem);
> 	atomic_inc(&brw->slow_read_ctr);
> 	__up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Good points both, no I don't think there's a significant performance gap
there.

I'm still hoping we can come up with something better though :/ I don't
particularly like either.



^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:51                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> 
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
> 
> The same for put_online_cpus().

to make {get,put}_online_cpus() always behave like per-cpu lock
sections.

I don't think its ever 'correct' for loads/stores to escape the section,
even if not strictly harmful.

> > +void __get_online_cpus(void)
> >  {
> > -	if (cpu_hotplug.active_writer == current)
> > +	if (cpuhp_writer_task == current)
> >  		return;
> 
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().

Oh indeed!

> > +     if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > +             cpuhp_writer_wake();
> 
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...

Yeah it is. Paul already said.

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
> 
> If the writer is pending, percpu_down_read() does
> 
> 	down_read(&brw->rw_sem);
> 	atomic_inc(&brw->slow_read_ctr);
> 	__up_read(&brw->rw_sem);
> 
> is it really much worse than wait_event + atomic_dec_and_test?
> 
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Good points both, no I don't think there's a significant performance gap
there.

I'm still hoping we can come up with something better though :/ I don't
particularly like either.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:49                     ` Paul E. McKenney
@ 2013-09-24 16:54                       ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > +	/* Signal the writer is done */
> > > +	cpuhp_writer = 0;
> > > +	wake_up_all(&cpuhp_wq);
> > > +
> > > +	/* Wait for any pending readers to be running */
> > > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > +	cpuhp_writer_task = NULL;
> > 
> > We also need to ensure that the next reader should see all changes
> > done by the writer, iow this lacks "realease" semantics.
> 
> Good point -- I was expecting wake_up_all() to provide the release
> semantics, but code could be reordered into __wake_up()'s critical
> section, especially in the case where there was nothing to wake
> up, but where there were new readers starting concurrently with
> cpu_hotplug_done().

Doh, indeed. I missed this in Oleg's email, but yes I made that same
assumption about wake_up_all().

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 16:54                       ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 16:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > +	/* Signal the writer is done */
> > > +	cpuhp_writer = 0;
> > > +	wake_up_all(&cpuhp_wq);
> > > +
> > > +	/* Wait for any pending readers to be running */
> > > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > +	cpuhp_writer_task = NULL;
> > 
> > We also need to ensure that the next reader should see all changes
> > done by the writer, iow this lacks "realease" semantics.
> 
> Good point -- I was expecting wake_up_all() to provide the release
> semantics, but code could be reordered into __wake_up()'s critical
> section, especially in the case where there was nothing to wake
> up, but where there were new readers starting concurrently with
> cpu_hotplug_done().

Doh, indeed. I missed this in Oleg's email, but yes I made that same
assumption about wake_up_all().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:54                       ` Peter Zijlstra
@ 2013-09-24 17:02                         ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > +	/* Signal the writer is done */
> > > > +	cpuhp_writer = 0;
> > > > +	wake_up_all(&cpuhp_wq);
> > > > +
> > > > +	/* Wait for any pending readers to be running */
> > > > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > > +	cpuhp_writer_task = NULL;
> > >
> > > We also need to ensure that the next reader should see all changes
> > > done by the writer, iow this lacks "realease" semantics.
> >
> > Good point -- I was expecting wake_up_all() to provide the release
> > semantics, but code could be reordered into __wake_up()'s critical
> > section, especially in the case where there was nothing to wake
> > up, but where there were new readers starting concurrently with
> > cpu_hotplug_done().
>
> Doh, indeed. I missed this in Oleg's email, but yes I made that same
> assumption about wake_up_all().

Well, I think this is even worse... No matter what the writer does,
the new reader needs mb() after it checks !__cpuhp_writer. Or we
need another synchronize_sched() in cpu_hotplug_done(). This is
what percpu_rw_semaphore() does (to remind, this can be turned into
call_rcu).

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 17:02                         ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > +	/* Signal the writer is done */
> > > > +	cpuhp_writer = 0;
> > > > +	wake_up_all(&cpuhp_wq);
> > > > +
> > > > +	/* Wait for any pending readers to be running */
> > > > +	cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > > +	cpuhp_writer_task = NULL;
> > >
> > > We also need to ensure that the next reader should see all changes
> > > done by the writer, iow this lacks "realease" semantics.
> >
> > Good point -- I was expecting wake_up_all() to provide the release
> > semantics, but code could be reordered into __wake_up()'s critical
> > section, especially in the case where there was nothing to wake
> > up, but where there were new readers starting concurrently with
> > cpu_hotplug_done().
>
> Doh, indeed. I missed this in Oleg's email, but yes I made that same
> assumption about wake_up_all().

Well, I think this is even worse... No matter what the writer does,
the new reader needs mb() after it checks !__cpuhp_writer. Or we
need another synchronize_sched() in cpu_hotplug_done(). This is
what percpu_rw_semaphore() does (to remind, this can be turned into
call_rcu).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:43                     ` Steven Rostedt
@ 2013-09-24 17:06                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 17:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On 09/24, Steven Rostedt wrote:
>
> On Tue, 24 Sep 2013 18:03:59 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	if (current->cpuhp_ref++) {
> > > +		barrier();
> > > +		return;
> >
> > I don't undestand this barrier()... we are going to return if we already
> > hold the lock, do we really need it?
>
> I'm confused too. Unless gcc moves this after the release, but the
> release uses preempt_disable() which is its own barrier.
>
> If anything, it requires a comment.

And I am still confused even after emails from Paul and Peter...

If gcc can actually do something wrong, then I suspect this barrier()
should be unconditional.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 17:06                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 17:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner

On 09/24, Steven Rostedt wrote:
>
> On Tue, 24 Sep 2013 18:03:59 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	if (current->cpuhp_ref++) {
> > > +		barrier();
> > > +		return;
> >
> > I don't undestand this barrier()... we are going to return if we already
> > hold the lock, do we really need it?
>
> I'm confused too. Unless gcc moves this after the release, but the
> release uses preempt_disable() which is its own barrier.
>
> If anything, it requires a comment.

And I am still confused even after emails from Paul and Peter...

If gcc can actually do something wrong, then I suspect this barrier()
should be unconditional.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 17:06                       ` Oleg Nesterov
@ 2013-09-24 17:47                         ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 17:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> On 09/24, Steven Rostedt wrote:
> >
> > On Tue, 24 Sep 2013 18:03:59 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > On 09/24, Peter Zijlstra wrote:
> > > >
> > > > +static inline void get_online_cpus(void)
> > > > +{
> > > > +	might_sleep();
> > > > +
> > > > +	if (current->cpuhp_ref++) {
> > > > +		barrier();
> > > > +		return;
> > >
> > > I don't undestand this barrier()... we are going to return if we already
> > > hold the lock, do we really need it?
> >
> > I'm confused too. Unless gcc moves this after the release, but the
> > release uses preempt_disable() which is its own barrier.
> >
> > If anything, it requires a comment.
> 
> And I am still confused even after emails from Paul and Peter...
> 
> If gcc can actually do something wrong, then I suspect this barrier()
> should be unconditional.

If you are saying that there should be a barrier() on all return paths
from get_online_cpus(), I agree.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 17:47                         ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 17:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> On 09/24, Steven Rostedt wrote:
> >
> > On Tue, 24 Sep 2013 18:03:59 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > On 09/24, Peter Zijlstra wrote:
> > > >
> > > > +static inline void get_online_cpus(void)
> > > > +{
> > > > +	might_sleep();
> > > > +
> > > > +	if (current->cpuhp_ref++) {
> > > > +		barrier();
> > > > +		return;
> > >
> > > I don't undestand this barrier()... we are going to return if we already
> > > hold the lock, do we really need it?
> >
> > I'm confused too. Unless gcc moves this after the release, but the
> > release uses preempt_disable() which is its own barrier.
> >
> > If anything, it requires a comment.
> 
> And I am still confused even after emails from Paul and Peter...
> 
> If gcc can actually do something wrong, then I suspect this barrier()
> should be unconditional.

If you are saying that there should be a barrier() on all return paths
from get_online_cpus(), I agree.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 17:47                         ` Paul E. McKenney
@ 2013-09-24 18:00                           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 18:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/24, Paul E. McKenney wrote:
>
> On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> >
> > If gcc can actually do something wrong, then I suspect this barrier()
> > should be unconditional.
>
> If you are saying that there should be a barrier() on all return paths
> from get_online_cpus(), I agree.

Paul, Peter, could you provide any (even completely artificial) example
to explain me why do we need this barrier() ? I am puzzled. And
preempt_enable() already has barrier...

	get_online_cpus();
	do_something();

Yes, we need to ensure gcc doesn't reorder this code so that
do_something() comes before get_online_cpus(). But it can't? At least
it should check current->cpuhp_ref != 0 first? And if it is non-zero
we do not really care, we are already in the critical section and
this ->cpuhp_ref has only meaning in put_online_cpus().

Confused...

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 18:00                           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-24 18:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/24, Paul E. McKenney wrote:
>
> On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> >
> > If gcc can actually do something wrong, then I suspect this barrier()
> > should be unconditional.
>
> If you are saying that there should be a barrier() on all return paths
> from get_online_cpus(), I agree.

Paul, Peter, could you provide any (even completely artificial) example
to explain me why do we need this barrier() ? I am puzzled. And
preempt_enable() already has barrier...

	get_online_cpus();
	do_something();

Yes, we need to ensure gcc doesn't reorder this code so that
do_something() comes before get_online_cpus(). But it can't? At least
it should check current->cpuhp_ref != 0 first? And if it is non-zero
we do not really care, we are already in the critical section and
this ->cpuhp_ref has only meaning in put_online_cpus().

Confused...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-23 17:32                   ` Oleg Nesterov
@ 2013-09-24 20:24                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 20:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Mon, Sep 23, 2013 at 07:32:03PM +0200, Oleg Nesterov wrote:
> > static void cpuhp_wait_refcount(void)
> > {
> > 	for (;;) {
> > 		unsigned int rc1, rc2;
> >
> > 		rc1 = cpuhp_refcount();
> > 		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> > 		rc2 = cpuhp_refcount();
> >
> > 		if (rc1 == rc2 && !rc1)
> 
> But this only makes the race above "theoretical ** 2". Both
> cpuhp_refcount()'s can be equally fooled.
> 
> Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
> before it reads __cpuhp_refcount.

Ah, so SRCU has a solution for this using a sequence count.

So now we drop from a no memory barriers fast path, into a memory
barrier 'slow' path into blocking.

Only once we block do we hit global state..

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,148 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+	if (__cpuhp_writer == 1) {
+		/* See __srcu_read_lock() */
+		__this_cpu_inc(__cpuhp_refcount);
+		smp_mb();
+		__this_cpu_inc(cpuhp_seq);
+		return;
+	}
+
+	atomic_inc(&cpuhp_waitcount);
+
 	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
 	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+	preempt_enable_no_resched();
+	wait_event(cpuhp_readers, !__cpuhp_writer);
+	preempt_disable();
 
-void get_online_cpus(void)
+	/*
+	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* See __srcu_read_unlock() */
+	smp_mb();
+	this_cpu_dec(__cpuhp_refcount);
 
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
-void put_online_cpus(void)
+static unsigned int cpuhp_seq(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int seq = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	for_each_possible_cpu(cpu)
+		seq += per_cpu(cpuhp_seq, cpu);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	return seq;
+}
+
+static unsigned int cpuhp_refcount(void)
+{
+	unsigned int refcount = 0;
+	int cpu;
 
+	for_each_possible_cpu(cpu)
+		refcount += per_cpu(__cpuhp_refcount, cpu);
+
+	return refcount;
+}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
+{
+	unsigned int seq = cpuhp_seq();
+
+	smp_mb();
+
+	if (cpuhp_refcount() != 0)
+		return false;
+
+	smp_mb();
+
+	return cpuhp_seq() == seq;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	lockdep_assert_held(&cpu_add_remove_lock);
+
+	/* allow reader-in-writer recursion */
+	current->cpuhp_ref++;
+
+	/* make readers take the slow path */
+	__cpuhp_writer = 1;
+
+	/* See percpu_down_write() */
+	synchronize_sched();
+
+	/* make readers block */
+	__cpuhp_writer = 2;
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done, no fast path yet */
+	__cpuhp_writer = 1;
+	wake_up_all(&cpuhp_readers);
+
+	/* See percpu_up_write() */
+	synchronize_sched();
+
+	/* Let em rip */
+	__cpuhp_writer = 0
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 20:24                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 20:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Mon, Sep 23, 2013 at 07:32:03PM +0200, Oleg Nesterov wrote:
> > static void cpuhp_wait_refcount(void)
> > {
> > 	for (;;) {
> > 		unsigned int rc1, rc2;
> >
> > 		rc1 = cpuhp_refcount();
> > 		set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> > 		rc2 = cpuhp_refcount();
> >
> > 		if (rc1 == rc2 && !rc1)
> 
> But this only makes the race above "theoretical ** 2". Both
> cpuhp_refcount()'s can be equally fooled.
> 
> Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
> before it reads __cpuhp_refcount.

Ah, so SRCU has a solution for this using a sequence count.

So now we drop from a no memory barriers fast path, into a memory
barrier 'slow' path into blocking.

Only once we block do we hit global state..

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,148 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+	if (__cpuhp_writer == 1) {
+		/* See __srcu_read_lock() */
+		__this_cpu_inc(__cpuhp_refcount);
+		smp_mb();
+		__this_cpu_inc(cpuhp_seq);
+		return;
+	}
+
+	atomic_inc(&cpuhp_waitcount);
+
 	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
+	 * We either call schedule() in the wait, or we'll fall through
+	 * and reschedule on the preempt_enable() in get_online_cpus().
 	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+	preempt_enable_no_resched();
+	wait_event(cpuhp_readers, !__cpuhp_writer);
+	preempt_disable();
 
-void get_online_cpus(void)
+	/*
+	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
+	 */
+	if (atomic_dec_and_test(&cpuhp_waitcount))
+		wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* See __srcu_read_unlock() */
+	smp_mb();
+	this_cpu_dec(__cpuhp_refcount);
 
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
-void put_online_cpus(void)
+static unsigned int cpuhp_seq(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int seq = 0;
+	int cpu;
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	for_each_possible_cpu(cpu)
+		seq += per_cpu(cpuhp_seq, cpu);
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	return seq;
+}
+
+static unsigned int cpuhp_refcount(void)
+{
+	unsigned int refcount = 0;
+	int cpu;
 
+	for_each_possible_cpu(cpu)
+		refcount += per_cpu(__cpuhp_refcount, cpu);
+
+	return refcount;
+}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
+{
+	unsigned int seq = cpuhp_seq();
+
+	smp_mb();
+
+	if (cpuhp_refcount() != 0)
+		return false;
+
+	smp_mb();
+
+	return cpuhp_seq() == seq;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	unsigned int count = 0;
+	int cpu;
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	lockdep_assert_held(&cpu_add_remove_lock);
+
+	/* allow reader-in-writer recursion */
+	current->cpuhp_ref++;
+
+	/* make readers take the slow path */
+	__cpuhp_writer = 1;
+
+	/* See percpu_down_write() */
+	synchronize_sched();
+
+	/* make readers block */
+	__cpuhp_writer = 2;
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done, no fast path yet */
+	__cpuhp_writer = 1;
+	wake_up_all(&cpuhp_readers);
+
+	/* See percpu_up_write() */
+	synchronize_sched();
+
+	/* Let em rip */
+	__cpuhp_writer = 0
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 18:00                           ` Oleg Nesterov
@ 2013-09-24 20:35                             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 20:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> On 09/24, Paul E. McKenney wrote:
> >
> > On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> > >
> > > If gcc can actually do something wrong, then I suspect this barrier()
> > > should be unconditional.
> >
> > If you are saying that there should be a barrier() on all return paths
> > from get_online_cpus(), I agree.
> 
> Paul, Peter, could you provide any (even completely artificial) example
> to explain me why do we need this barrier() ? I am puzzled. And
> preempt_enable() already has barrier...
> 
> 	get_online_cpus();
> 	do_something();
> 
> Yes, we need to ensure gcc doesn't reorder this code so that
> do_something() comes before get_online_cpus(). But it can't? At least
> it should check current->cpuhp_ref != 0 first? And if it is non-zero
> we do not really care, we are already in the critical section and
> this ->cpuhp_ref has only meaning in put_online_cpus().
> 
> Confused...


So the reason I put it in was because of the inline; it could possibly
make it do:

  test  0, current->cpuhp_ref
  je	label1:
  inc	current->cpuhp_ref

label2:
  do_something();

label1:
  inc	%gs:__preempt_count
  test	0, __cpuhp_writer
  jne	label3
  inc	%gs:__cpuhp_refcount
label5
  dec	%gs:__preempt_count
  je	label4
  jmp	label2
label3:
  call	__get_online_cpus();
  jmp	label5
label4:
  call	____preempt_schedule();
  jmp	label2

In which case the recursive fast path doesn't have a barrier() between
taking the ref and starting do_something().

I wanted to make absolutely sure nothing of do_something leaked before
the label2 thing. The other labels all have barrier() from the
preempt_count ops.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 20:35                             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 20:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> On 09/24, Paul E. McKenney wrote:
> >
> > On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> > >
> > > If gcc can actually do something wrong, then I suspect this barrier()
> > > should be unconditional.
> >
> > If you are saying that there should be a barrier() on all return paths
> > from get_online_cpus(), I agree.
> 
> Paul, Peter, could you provide any (even completely artificial) example
> to explain me why do we need this barrier() ? I am puzzled. And
> preempt_enable() already has barrier...
> 
> 	get_online_cpus();
> 	do_something();
> 
> Yes, we need to ensure gcc doesn't reorder this code so that
> do_something() comes before get_online_cpus(). But it can't? At least
> it should check current->cpuhp_ref != 0 first? And if it is non-zero
> we do not really care, we are already in the critical section and
> this ->cpuhp_ref has only meaning in put_online_cpus().
> 
> Confused...


So the reason I put it in was because of the inline; it could possibly
make it do:

  test  0, current->cpuhp_ref
  je	label1:
  inc	current->cpuhp_ref

label2:
  do_something();

label1:
  inc	%gs:__preempt_count
  test	0, __cpuhp_writer
  jne	label3
  inc	%gs:__cpuhp_refcount
label5
  dec	%gs:__preempt_count
  je	label4
  jmp	label2
label3:
  call	__get_online_cpus();
  jmp	label5
label4:
  call	____preempt_schedule();
  jmp	label2

In which case the recursive fast path doesn't have a barrier() between
taking the ref and starting do_something().

I wanted to make absolutely sure nothing of do_something leaked before
the label2 thing. The other labels all have barrier() from the
preempt_count ops.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 20:24                     ` Peter Zijlstra
@ 2013-09-24 21:02                       ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 21:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 10:24:23PM +0200, Peter Zijlstra wrote:
> +void __get_online_cpus(void)
> +{
> +	if (__cpuhp_writer == 1) {
take_ref:
> +		/* See __srcu_read_lock() */
> +		__this_cpu_inc(__cpuhp_refcount);
> +		smp_mb();
> +		__this_cpu_inc(cpuhp_seq);
> +		return;
> +	}
> +
> +	atomic_inc(&cpuhp_waitcount);
> +
>  	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
>  	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
>  
> +	/*
> +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);
	goto take_ref;
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);

It would probably be a good idea to increment __cpuhp_refcount after the
wait_event.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 21:02                       ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-24 21:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 10:24:23PM +0200, Peter Zijlstra wrote:
> +void __get_online_cpus(void)
> +{
> +	if (__cpuhp_writer == 1) {
take_ref:
> +		/* See __srcu_read_lock() */
> +		__this_cpu_inc(__cpuhp_refcount);
> +		smp_mb();
> +		__this_cpu_inc(cpuhp_seq);
> +		return;
> +	}
> +
> +	atomic_inc(&cpuhp_waitcount);
> +
>  	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
>  	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
>  
> +	/*
> +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);
	goto take_ref;
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);

It would probably be a good idea to increment __cpuhp_refcount after the
wait_event.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 16:09                     ` Peter Zijlstra
@ 2013-09-24 21:09                       ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 21:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:09:59PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > > +#define cpuhp_writer_wake()						\
> > > +	wake_up_process(cpuhp_writer_task)
> > > +
> > > +#define cpuhp_writer_wait(cond)						\
> > > +do {									\
> > > +	for (;;) {							\
> > > +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> > > +		if (cond)						\
> > > +			break;						\
> > > +		schedule();						\
> > > +	}								\
> > > +	__set_current_state(TASK_RUNNING);				\
> > > +} while (0)
> > 
> > Why not wait_event()?  Presumably the above is a bit lighter weight,
> > but is that even something that can be measured?
> 
> I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
> could create a second waitqueue; that might also be a better solution
> for the NULL thing below.

That would have the advantage of being a bit less racy.

> > > +	atomic_inc(&cpuhp_waitcount);
> > > +
> > > +	/*
> > > +	 * We either call schedule() in the wait, or we'll fall through
> > > +	 * and reschedule on the preempt_enable() in get_online_cpus().
> > > +	 */
> > > +	preempt_enable_no_resched();
> > > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> > 
> > Finally!  A good use for preempt_enable_no_resched().  ;-)
> 
> Hehe, there were a few others, but tglx removed most with the
> schedule_preempt_disabled() primitive.

;-)

> In fact, I considered a wait_event_preempt_disabled() but was too lazy.
> That whole wait_event macro fest looks like it could use an iteration or
> two of collapse anyhow.

There are some serious layers there, aren't there?

> > > +	preempt_disable();
> > > +
> > > +	/*
> > > +	 * It would be possible for cpu_hotplug_done() to complete before
> > > +	 * the atomic_inc() above; in which case there is no writer waiting
> > > +	 * and doing a wakeup would be BAD (tm).
> > > +	 *
> > > +	 * If however we still observe cpuhp_writer_task here we know
> > > +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > > +	 */
> > > +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > 
> > OK, I'll bite...  What sequence of events results in the
> > atomic_dec_and_test() returning true but there being no
> > cpuhp_writer_task?
> > 
> > Ah, I see it...
> 
> <snip>
> 
> Indeed, and
> 
> > But what prevents the following sequence of events?
> 
> <snip>
> 
> > o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.
> 
> Quite so.. nothing. See there was a reason I kept being confused about
> it.
> 
> > >  void cpu_hotplug_begin(void)
> > >  {
> > > +	unsigned int count = 0;
> > > +	int cpu;
> > > +
> > > +	lockdep_assert_held(&cpu_add_remove_lock);
> > > 
> > > +	__cpuhp_writer = 1;
> > > +	cpuhp_writer_task = current;
> > 
> > At this point, the value of cpuhp_slowcount can go negative.  Can't see
> > that this causes a problem, given the atomic_add() below.
> 
> Agreed.
> 
> > > +
> > > +	/* After this everybody will observe writer and take the slow path. */
> > > +	synchronize_sched();
> > > +
> > > +	/* Collapse the per-cpu refcount into slowcount */
> > > +	for_each_possible_cpu(cpu) {
> > > +		count += per_cpu(__cpuhp_refcount, cpu);
> > > +		per_cpu(__cpuhp_refcount, cpu) = 0;
> > >  	}
> > 
> > The above is safe because the readers are no longer changing their
> > __cpuhp_refcount values.
> 
> Yes, I'll expand the comment.
> 
> So how about something like this?

A few memory barriers required, if I am reading the code correctly.
Some of them, perhaps all of them, called out by Oleg.

							Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

As Oleg noted, need a barrier here for when a new reader runs concurrently
with a completing writer.

> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);

No barrier needed here because synchronize_sched() covers it.

> +	else
> +		__put_online_cpus();
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> 
> -void get_online_cpus(void)
> -{
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> 
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
>  }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
> 
> -void put_online_cpus(void)
> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	/* Support reader-in-writer recursion */
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> 
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> 
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	if (atomic_dec_and_test(&cpuhp_waitcount))

This provides the needed memory barrier for concurrent write releases.

> +		wake_up_all(&cpuhp_writer);
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> +	if (__cpuhp_writer == current)
> +		return;
> 
> +	if (atomic_dec_and_test(&cpuhp_slowcount))

This provides the needed memory barrier for concurrent write acquisitions.

> +		wake_up_all(&cpuhp_writer);
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = current;
> +
> +	/* 
> +	 * After this everybody will observe writer and take the slow path.
> +	 */
> +	synchronize_sched();
> +
> +	/* 
> +	 * Collapse the per-cpu refcount into slowcount. This is safe because
> +	 * readers are now taking the slow path (per the above) which doesn't
> +	 * touch __cpuhp_refcount.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		count += per_cpu(__cpuhp_refcount, cpu);
> +		per_cpu(__cpuhp_refcount, cpu) = 0;
>  	}
> +	atomic_add(count, &cpuhp_slowcount);
> +
> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));

Oddly enough, there appear to be cases where you need a memory barrier
here.  Suppose that all the readers finish after the atomic_add() above,
but before the wait_event().  Then wait_event() just checks the condition
without any memory barriers.  So smp_mb() needed here.

/me runs off to check RCU's use of wait_event()...

Found one missing.  And some places in need of comments.  And a few
places that could use an ACCESS_ONCE().

Back to the review...

>  }
> 
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */

And I believe we need a memory barrier here to keep the write-side
critical section confined from the viewpoint of a reader that starts
just after the NULLing of cpuhp_writer.

Of course, being who I am, I cannot resist pointing out that you have
the same number of memory barriers as would use of SRCU, and that
synchronize_srcu() can be quite a bit faster than synchronize_sched()
in the case where there are no readers.  ;-)

> +	cpuhp_writer = NULL;
> +	wake_up_all(&cpuhp_readers);
> +
> +	/* 
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-24 21:09                       ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-24 21:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Sep 24, 2013 at 06:09:59PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > > +#define cpuhp_writer_wake()						\
> > > +	wake_up_process(cpuhp_writer_task)
> > > +
> > > +#define cpuhp_writer_wait(cond)						\
> > > +do {									\
> > > +	for (;;) {							\
> > > +		set_current_state(TASK_UNINTERRUPTIBLE);		\
> > > +		if (cond)						\
> > > +			break;						\
> > > +		schedule();						\
> > > +	}								\
> > > +	__set_current_state(TASK_RUNNING);				\
> > > +} while (0)
> > 
> > Why not wait_event()?  Presumably the above is a bit lighter weight,
> > but is that even something that can be measured?
> 
> I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
> could create a second waitqueue; that might also be a better solution
> for the NULL thing below.

That would have the advantage of being a bit less racy.

> > > +	atomic_inc(&cpuhp_waitcount);
> > > +
> > > +	/*
> > > +	 * We either call schedule() in the wait, or we'll fall through
> > > +	 * and reschedule on the preempt_enable() in get_online_cpus().
> > > +	 */
> > > +	preempt_enable_no_resched();
> > > +	wait_event(cpuhp_wq, !__cpuhp_writer);
> > 
> > Finally!  A good use for preempt_enable_no_resched().  ;-)
> 
> Hehe, there were a few others, but tglx removed most with the
> schedule_preempt_disabled() primitive.

;-)

> In fact, I considered a wait_event_preempt_disabled() but was too lazy.
> That whole wait_event macro fest looks like it could use an iteration or
> two of collapse anyhow.

There are some serious layers there, aren't there?

> > > +	preempt_disable();
> > > +
> > > +	/*
> > > +	 * It would be possible for cpu_hotplug_done() to complete before
> > > +	 * the atomic_inc() above; in which case there is no writer waiting
> > > +	 * and doing a wakeup would be BAD (tm).
> > > +	 *
> > > +	 * If however we still observe cpuhp_writer_task here we know
> > > +	 * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > > +	 */
> > > +	if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > 
> > OK, I'll bite...  What sequence of events results in the
> > atomic_dec_and_test() returning true but there being no
> > cpuhp_writer_task?
> > 
> > Ah, I see it...
> 
> <snip>
> 
> Indeed, and
> 
> > But what prevents the following sequence of events?
> 
> <snip>
> 
> > o	Task B's call to cpuhp_writer_wake() sees a NULL pointer.
> 
> Quite so.. nothing. See there was a reason I kept being confused about
> it.
> 
> > >  void cpu_hotplug_begin(void)
> > >  {
> > > +	unsigned int count = 0;
> > > +	int cpu;
> > > +
> > > +	lockdep_assert_held(&cpu_add_remove_lock);
> > > 
> > > +	__cpuhp_writer = 1;
> > > +	cpuhp_writer_task = current;
> > 
> > At this point, the value of cpuhp_slowcount can go negative.  Can't see
> > that this causes a problem, given the atomic_add() below.
> 
> Agreed.
> 
> > > +
> > > +	/* After this everybody will observe writer and take the slow path. */
> > > +	synchronize_sched();
> > > +
> > > +	/* Collapse the per-cpu refcount into slowcount */
> > > +	for_each_possible_cpu(cpu) {
> > > +		count += per_cpu(__cpuhp_refcount, cpu);
> > > +		per_cpu(__cpuhp_refcount, cpu) = 0;
> > >  	}
> > 
> > The above is safe because the readers are no longer changing their
> > __cpuhp_refcount values.
> 
> Yes, I'll expand the comment.
> 
> So how about something like this?

A few memory barriers required, if I am reading the code correctly.
Some of them, perhaps all of them, called out by Oleg.

							Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

As Oleg noted, need a barrier here for when a new reader runs concurrently
with a completing writer.

> +	else
> +		__get_online_cpus();
> +	preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);

No barrier needed here because synchronize_sched() covers it.

> +	else
> +		__put_online_cpus();
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> -static struct {
> -	struct task_struct *active_writer;
> -	struct mutex lock; /* Synchronizes accesses to refcount, */
> -	/*
> -	 * Also blocks the new readers during
> -	 * an ongoing cpu hotplug operation.
> -	 */
> -	int refcount;
> -} cpu_hotplug = {
> -	.active_writer = NULL,
> -	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> -	.refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> 
> -void get_online_cpus(void)
> -{
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> 
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
>  }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
> 
> -void put_online_cpus(void)
> +void __get_online_cpus(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> +	/* Support reader-in-writer recursion */
> +	if (__cpuhp_writer == current)
>  		return;
> -	mutex_lock(&cpu_hotplug.lock);
> 
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	atomic_inc(&cpuhp_waitcount);
> 
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/*
> +	 * We either call schedule() in the wait, or we'll fall through
> +	 * and reschedule on the preempt_enable() in get_online_cpus().
> +	 */
> +	preempt_enable_no_resched();
> +	wait_event(cpuhp_readers, !__cpuhp_writer);
> +	preempt_disable();
> +
> +	if (atomic_dec_and_test(&cpuhp_waitcount))

This provides the needed memory barrier for concurrent write releases.

> +		wake_up_all(&cpuhp_writer);
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> +	if (__cpuhp_writer == current)
> +		return;
> 
> +	if (atomic_dec_and_test(&cpuhp_slowcount))

This provides the needed memory barrier for concurrent write acquisitions.

> +		wake_up_all(&cpuhp_writer);
>  }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - *   writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - *   non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
>   */
>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
> +
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> +	__cpuhp_writer = current;
> +
> +	/* 
> +	 * After this everybody will observe writer and take the slow path.
> +	 */
> +	synchronize_sched();
> +
> +	/* 
> +	 * Collapse the per-cpu refcount into slowcount. This is safe because
> +	 * readers are now taking the slow path (per the above) which doesn't
> +	 * touch __cpuhp_refcount.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		count += per_cpu(__cpuhp_refcount, cpu);
> +		per_cpu(__cpuhp_refcount, cpu) = 0;
>  	}
> +	atomic_add(count, &cpuhp_slowcount);
> +
> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));

Oddly enough, there appear to be cases where you need a memory barrier
here.  Suppose that all the readers finish after the atomic_add() above,
but before the wait_event().  Then wait_event() just checks the condition
without any memory barriers.  So smp_mb() needed here.

/me runs off to check RCU's use of wait_event()...

Found one missing.  And some places in need of comments.  And a few
places that could use an ACCESS_ONCE().

Back to the review...

>  }
> 
>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done */

And I believe we need a memory barrier here to keep the write-side
critical section confined from the viewpoint of a reader that starts
just after the NULLing of cpuhp_writer.

Of course, being who I am, I cannot resist pointing out that you have
the same number of memory barriers as would use of SRCU, and that
synchronize_srcu() can be quite a bit faster than synchronize_sched()
in the case where there are no readers.  ;-)

> +	cpuhp_writer = NULL;
> +	wake_up_all(&cpuhp_readers);
> +
> +	/* 
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 20:35                             ` Peter Zijlstra
@ 2013-09-25 15:16                               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> >
> > Yes, we need to ensure gcc doesn't reorder this code so that
> > do_something() comes before get_online_cpus(). But it can't? At least
> > it should check current->cpuhp_ref != 0 first? And if it is non-zero
> > we do not really care, we are already in the critical section and
> > this ->cpuhp_ref has only meaning in put_online_cpus().
> >
> > Confused...
>
>
> So the reason I put it in was because of the inline; it could possibly
> make it do:

[...snip...]

> In which case the recursive fast path doesn't have a barrier() between
> taking the ref and starting do_something().

Yes, but my point was, this can only happen in recursive fast path.
And in this case (I think) we do not care, we are already in the critical
section.

current->cpuhp_ref doesn't matter at all until we call put_online_cpus().

Suppose that gcc knows for sure that current->cpuhp_ref != 0. Then I
think, for example,

	get_online_cpus();
	do_something();
	put_online_cpus();

converted to

	do_something();
	current->cpuhp_ref++;
	current->cpuhp_ref--;

is fine. do_something() should not depend on ->cpuhp_ref.

OK, please forget. I guess I will never understand this ;)

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 15:16                               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> >
> > Yes, we need to ensure gcc doesn't reorder this code so that
> > do_something() comes before get_online_cpus(). But it can't? At least
> > it should check current->cpuhp_ref != 0 first? And if it is non-zero
> > we do not really care, we are already in the critical section and
> > this ->cpuhp_ref has only meaning in put_online_cpus().
> >
> > Confused...
>
>
> So the reason I put it in was because of the inline; it could possibly
> make it do:

[...snip...]

> In which case the recursive fast path doesn't have a barrier() between
> taking the ref and starting do_something().

Yes, but my point was, this can only happen in recursive fast path.
And in this case (I think) we do not care, we are already in the critical
section.

current->cpuhp_ref doesn't matter at all until we call put_online_cpus().

Suppose that gcc knows for sure that current->cpuhp_ref != 0. Then I
think, for example,

	get_online_cpus();
	do_something();
	put_online_cpus();

converted to

	do_something();
	current->cpuhp_ref++;
	current->cpuhp_ref--;

is fine. do_something() should not depend on ->cpuhp_ref.

OK, please forget. I guess I will never understand this ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 15:16                               ` Oleg Nesterov
@ 2013-09-25 15:35                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 15:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
> Yes, but my point was, this can only happen in recursive fast path.

Right, I understood.

> And in this case (I think) we do not care, we are already in the critical
> section.

I tend to agree, however paranoia..

> OK, please forget. I guess I will never understand this ;)

It might just be I'm less certain about there not being any avenue of
mischief.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 15:35                                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 15:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
> Yes, but my point was, this can only happen in recursive fast path.

Right, I understood.

> And in this case (I think) we do not care, we are already in the critical
> section.

I tend to agree, however paranoia..

> OK, please forget. I guess I will never understand this ;)

It might just be I'm less certain about there not being any avenue of
mischief.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-24 20:24                     ` Peter Zijlstra
@ 2013-09-25 15:55                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> So now we drop from a no memory barriers fast path, into a memory
> barrier 'slow' path into blocking.

Cough... can't understand the above ;) In fact I can't understand
the patch... see below. But in any case, afaics the fast path
needs mb() unless you add another synchronize_sched() into
cpu_hotplug_done().

> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

mb() to ensure the reader can't miss, say, a STORE done inside
the cpu_hotplug_begin/end section.

put_online_cpus() needs mb() as well.

> +void __get_online_cpus(void)
> +{
> +	if (__cpuhp_writer == 1) {
> +		/* See __srcu_read_lock() */
> +		__this_cpu_inc(__cpuhp_refcount);
> +		smp_mb();
> +		__this_cpu_inc(cpuhp_seq);
> +		return;
> +	}

OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
the "stable" numbers. Looks suspicious... but lets assume this
works.

However, I do not see how "__cpuhp_writer == 1" can work, please
see below.

> +	/*
> +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);

Same problem as in previous version. __get_online_cpus() succeeds
without incrementing __cpuhp_refcount. "goto start" can't help
afaics.

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	lockdep_assert_held(&cpu_add_remove_lock);
> +
> +	/* allow reader-in-writer recursion */
> +	current->cpuhp_ref++;
> +
> +	/* make readers take the slow path */
> +	__cpuhp_writer = 1;
> +
> +	/* See percpu_down_write() */
> +	synchronize_sched();

Suppose there are no readers at this point,

> +
> +	/* make readers block */
> +	__cpuhp_writer = 2;
> +
> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());

So wait_event() "quickly" returns.

Now. Why the new reader should see __cpuhp_writer = 2 ? It can
still see it == 1, and take that "if (__cpuhp_writer == 1)" path
above.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 15:55                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/24, Peter Zijlstra wrote:
>
> So now we drop from a no memory barriers fast path, into a memory
> barrier 'slow' path into blocking.

Cough... can't understand the above ;) In fact I can't understand
the patch... see below. But in any case, afaics the fast path
needs mb() unless you add another synchronize_sched() into
cpu_hotplug_done().

> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();
> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

mb() to ensure the reader can't miss, say, a STORE done inside
the cpu_hotplug_begin/end section.

put_online_cpus() needs mb() as well.

> +void __get_online_cpus(void)
> +{
> +	if (__cpuhp_writer == 1) {
> +		/* See __srcu_read_lock() */
> +		__this_cpu_inc(__cpuhp_refcount);
> +		smp_mb();
> +		__this_cpu_inc(cpuhp_seq);
> +		return;
> +	}

OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
the "stable" numbers. Looks suspicious... but lets assume this
works.

However, I do not see how "__cpuhp_writer == 1" can work, please
see below.

> +	/*
> +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> +	 */
> +	if (atomic_dec_and_test(&cpuhp_waitcount))
> +		wake_up_all(&cpuhp_writer);

Same problem as in previous version. __get_online_cpus() succeeds
without incrementing __cpuhp_refcount. "goto start" can't help
afaics.

>  void cpu_hotplug_begin(void)
>  {
> -	cpu_hotplug.active_writer = current;
> +	unsigned int count = 0;
> +	int cpu;
>  
> -	for (;;) {
> -		mutex_lock(&cpu_hotplug.lock);
> -		if (likely(!cpu_hotplug.refcount))
> -			break;
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&cpu_hotplug.lock);
> -		schedule();
> -	}
> +	lockdep_assert_held(&cpu_add_remove_lock);
> +
> +	/* allow reader-in-writer recursion */
> +	current->cpuhp_ref++;
> +
> +	/* make readers take the slow path */
> +	__cpuhp_writer = 1;
> +
> +	/* See percpu_down_write() */
> +	synchronize_sched();

Suppose there are no readers at this point,

> +
> +	/* make readers block */
> +	__cpuhp_writer = 2;
> +
> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());

So wait_event() "quickly" returns.

Now. Why the new reader should see __cpuhp_writer = 2 ? It can
still see it == 1, and take that "if (__cpuhp_writer == 1)" path
above.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 15:35                                 ` Peter Zijlstra
@ 2013-09-25 16:33                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
>
> > And in this case (I think) we do not care, we are already in the critical
> > section.
>
> I tend to agree, however paranoia..

Ah, in this case I tend to agree. better be paranoid ;)

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 16:33                                   ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Steven Rostedt, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
>
> > And in this case (I think) we do not care, we are already in the critical
> > section.
>
> I tend to agree, however paranoia..

Ah, in this case I tend to agree. better be paranoid ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 15:55                       ` Oleg Nesterov
@ 2013-09-25 16:59                         ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-25 16:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

For whatever it is worth, I too don't see how it works without read-side
memory barriers.

							Thanx, Paul

> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	/* Support reader-in-reader recursion */
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> > +	}
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer))
> > +		__this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.
> 
> > +void __get_online_cpus(void)
> > +{
> > +	if (__cpuhp_writer == 1) {
> > +		/* See __srcu_read_lock() */
> > +		__this_cpu_inc(__cpuhp_refcount);
> > +		smp_mb();
> > +		__this_cpu_inc(cpuhp_seq);
> > +		return;
> > +	}
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.
> 
> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +	/*
> > +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> > +	 */
> > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > +		wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > -	cpu_hotplug.active_writer = current;
> > +	unsigned int count = 0;
> > +	int cpu;
> >  
> > -	for (;;) {
> > -		mutex_lock(&cpu_hotplug.lock);
> > -		if (likely(!cpu_hotplug.refcount))
> > -			break;
> > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > -		mutex_unlock(&cpu_hotplug.lock);
> > -		schedule();
> > -	}
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +	/* allow reader-in-writer recursion */
> > +	current->cpuhp_ref++;
> > +
> > +	/* make readers take the slow path */
> > +	__cpuhp_writer = 1;
> > +
> > +	/* See percpu_down_write() */
> > +	synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +	/* make readers block */
> > +	__cpuhp_writer = 2;
> > +
> > +	/* Wait for all readers to go away */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 16:59                         ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-25 16:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

For whatever it is worth, I too don't see how it works without read-side
memory barriers.

							Thanx, Paul

> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	/* Support reader-in-reader recursion */
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> > +	}
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer))
> > +		__this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.
> 
> > +void __get_online_cpus(void)
> > +{
> > +	if (__cpuhp_writer == 1) {
> > +		/* See __srcu_read_lock() */
> > +		__this_cpu_inc(__cpuhp_refcount);
> > +		smp_mb();
> > +		__this_cpu_inc(cpuhp_seq);
> > +		return;
> > +	}
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.
> 
> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +	/*
> > +	 * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> > +	 */
> > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > +		wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > -	cpu_hotplug.active_writer = current;
> > +	unsigned int count = 0;
> > +	int cpu;
> >  
> > -	for (;;) {
> > -		mutex_lock(&cpu_hotplug.lock);
> > -		if (likely(!cpu_hotplug.refcount))
> > -			break;
> > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > -		mutex_unlock(&cpu_hotplug.lock);
> > -		schedule();
> > -	}
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +	/* allow reader-in-writer recursion */
> > +	current->cpuhp_ref++;
> > +
> > +	/* make readers take the slow path */
> > +	__cpuhp_writer = 1;
> > +
> > +	/* See percpu_down_write() */
> > +	synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +	/* make readers block */
> > +	__cpuhp_writer = 2;
> > +
> > +	/* Wait for all readers to go away */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.
> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 15:55                       ` Oleg Nesterov
@ 2013-09-25 17:43                         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 17:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

Sure we can add more ;-) But I went with perpcu_up_write(), it too does
the sync_sched() before clearing the fast path state.

> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	/* Support reader-in-reader recursion */
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> > +	}
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer))
> > +		__this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.

OK, I'm not getting this; why isn't the sync_sched sufficient to get out
of this fast path without barriers?

> > +void __get_online_cpus(void)
> > +{
> > +	if (__cpuhp_writer == 1) {
> > +		/* See __srcu_read_lock() */
> > +		__this_cpu_inc(__cpuhp_refcount);
> > +		smp_mb();
> > +		__this_cpu_inc(cpuhp_seq);
> > +		return;
> > +	}
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.

I 'borrowed' it from SRCU, so if its broken here its broken there too I
suppose.

> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > +		wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.

I added a goto into the cond-block, not before the cond; but see the
version below.

> >  void cpu_hotplug_begin(void)
> >  {
> > +	unsigned int count = 0;
> > +	int cpu;
> >  
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +	/* allow reader-in-writer recursion */
> > +	current->cpuhp_ref++;
> > +
> > +	/* make readers take the slow path */
> > +	__cpuhp_writer = 1;
> > +
> > +	/* See percpu_down_write() */
> > +	synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +	/* make readers block */
> > +	__cpuhp_writer = 2;
> > +
> > +	/* Wait for all readers to go away */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.

OK, .. I see the hole, no immediate way to fix it -- too tired atm.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 17:43                         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 17:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
> 
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

Sure we can add more ;-) But I went with perpcu_up_write(), it too does
the sync_sched() before clearing the fast path state.

> > +static inline void get_online_cpus(void)
> > +{
> > +	might_sleep();
> > +
> > +	/* Support reader-in-reader recursion */
> > +	if (current->cpuhp_ref++) {
> > +		barrier();
> > +		return;
> > +	}
> > +
> > +	preempt_disable();
> > +	if (likely(!__cpuhp_writer))
> > +		__this_cpu_inc(__cpuhp_refcount);
> 
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
> 
> put_online_cpus() needs mb() as well.

OK, I'm not getting this; why isn't the sync_sched sufficient to get out
of this fast path without barriers?

> > +void __get_online_cpus(void)
> > +{
> > +	if (__cpuhp_writer == 1) {
> > +		/* See __srcu_read_lock() */
> > +		__this_cpu_inc(__cpuhp_refcount);
> > +		smp_mb();
> > +		__this_cpu_inc(cpuhp_seq);
> > +		return;
> > +	}
> 
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.

I 'borrowed' it from SRCU, so if its broken here its broken there too I
suppose.

> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
> 
> > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > +		wake_up_all(&cpuhp_writer);
> 
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.

I added a goto into the cond-block, not before the cond; but see the
version below.

> >  void cpu_hotplug_begin(void)
> >  {
> > +	unsigned int count = 0;
> > +	int cpu;
> >  
> > +	lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > +	/* allow reader-in-writer recursion */
> > +	current->cpuhp_ref++;
> > +
> > +	/* make readers take the slow path */
> > +	__cpuhp_writer = 1;
> > +
> > +	/* See percpu_down_write() */
> > +	synchronize_sched();
> 
> Suppose there are no readers at this point,
> 
> > +
> > +	/* make readers block */
> > +	__cpuhp_writer = 2;
> > +
> > +	/* Wait for all readers to go away */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> So wait_event() "quickly" returns.
> 
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.

OK, .. I see the hole, no immediate way to fix it -- too tired atm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 17:43                         ` Peter Zijlstra
@ 2013-09-25 17:50                           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
>
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	/* Support reader-in-reader recursion */
> > > +	if (current->cpuhp_ref++) {
> > > +		barrier();
> > > +		return;
> > > +	}
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!__cpuhp_writer))
> > > +		__this_cpu_inc(__cpuhp_refcount);
> >
> > mb() to ensure the reader can't miss, say, a STORE done inside
> > the cpu_hotplug_begin/end section.
> >
> > put_online_cpus() needs mb() as well.
>
> OK, I'm not getting this; why isn't the sync_sched sufficient to get out
> of this fast path without barriers?

Aah, sorry, I didn't notice this version has another synchronize_sched()
in cpu_hotplug_done().

Then I need to recheck again...

No. Too tired too ;) damn LSB test failures...

> > > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > > +		wake_up_all(&cpuhp_writer);
> >
> > Same problem as in previous version. __get_online_cpus() succeeds
> > without incrementing __cpuhp_refcount. "goto start" can't help
> > afaics.
>
> I added a goto into the cond-block, not before the cond; but see the
> version below.

"into the cond-block" doesn't look right too, at first glance. This
always succeeds, but by this time another writer can already hold
the lock.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 17:50                           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-25 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
>
> > > +static inline void get_online_cpus(void)
> > > +{
> > > +	might_sleep();
> > > +
> > > +	/* Support reader-in-reader recursion */
> > > +	if (current->cpuhp_ref++) {
> > > +		barrier();
> > > +		return;
> > > +	}
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!__cpuhp_writer))
> > > +		__this_cpu_inc(__cpuhp_refcount);
> >
> > mb() to ensure the reader can't miss, say, a STORE done inside
> > the cpu_hotplug_begin/end section.
> >
> > put_online_cpus() needs mb() as well.
>
> OK, I'm not getting this; why isn't the sync_sched sufficient to get out
> of this fast path without barriers?

Aah, sorry, I didn't notice this version has another synchronize_sched()
in cpu_hotplug_done().

Then I need to recheck again...

No. Too tired too ;) damn LSB test failures...

> > > +	if (atomic_dec_and_test(&cpuhp_waitcount))
> > > +		wake_up_all(&cpuhp_writer);
> >
> > Same problem as in previous version. __get_online_cpus() succeeds
> > without incrementing __cpuhp_refcount. "goto start" can't help
> > afaics.
>
> I added a goto into the cond-block, not before the cond; but see the
> version below.

"into the cond-block" doesn't look right too, at first glance. This
always succeeds, but by this time another writer can already hold
the lock.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 17:50                           ` Oleg Nesterov
@ 2013-09-25 18:40                             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 18:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> No. Too tired too ;) damn LSB test failures...


ok; I cobbled this together.. I might think better of it tomorrow, but
for now I think I closed the hole before wait_event(readers_active())
you pointed out -- of course I might have created new holes :/

For easy reading the + only version.

---
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
+++ b/kernel/cpu.c
@@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
 
+void __get_online_cpus(void)
 {
+again:
+	/* See __srcu_read_lock() */
+	__this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* A matches B, E */
+	__this_cpu_inc(cpuhp_seq);
+
+	if (unlikely(__cpuhp_writer == readers_block)) {
+		__put_online_cpus();
+
+		atomic_inc(&cpuhp_waitcount);
+
+		/*
+		 * We either call schedule() in the wait, or we'll fall through
+		 * and reschedule on the preempt_enable() in get_online_cpus().
+		 */
+		preempt_enable_no_resched();
+		__wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
+		preempt_disable();
 
+		if (atomic_dec_and_test(&cpuhp_waitcount))
+			wake_up_all(&cpuhp_writer);
+
+		goto again;
+	}
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	/* See __srcu_read_unlock() */
+	smp_mb(); /* C matches D */
+	this_cpu_dec(__cpuhp_refcount);
+
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+)}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
 {
+	unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+	smp_mb(); /* B matches A */
 
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
+	smp_mb(); /* D matches C */
 
+	return per_cpu_sum(cpuhp_seq) == seq;
 }
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
  */
 void cpu_hotplug_begin(void)
 {
+	unsigned int count = 0;
+	int cpu;
 
+	lockdep_assert_held(&cpu_add_remove_lock);
+
+	/* allow reader-in-writer recursion */
+	current->cpuhp_ref++;
+
+	/* make readers take the slow path */
+	__cpuhp_writer = readers_slow;
+
+	/* See percpu_down_write() */
+	synchronize_sched();
+
+	/* make readers block */
+	__cpuhp_writer = readers_block;
+
+	smp_mb(); /* E matches A */
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
+	/* Signal the writer is done, no fast path yet */
+	__cpuhp_writer = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/* See percpu_up_write() */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_writer = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 18:40                             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-25 18:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> No. Too tired too ;) damn LSB test failures...


ok; I cobbled this together.. I might think better of it tomorrow, but
for now I think I closed the hole before wait_event(readers_active())
you pointed out -- of course I might have created new holes :/

For easy reading the + only version.

---
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader-in-reader recursion */
+	if (current->cpuhp_ref++) {
+		barrier();
+		return;
+	}
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_inc(__cpuhp_refcount);
+	else
+		__get_online_cpus();
+	preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	barrier();
+	if (--current->cpuhp_ref)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_writer))
+		__this_cpu_dec(__cpuhp_refcount);
+	else
+		__put_online_cpus();
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
+++ b/kernel/cpu.c
@@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
 
+void __get_online_cpus(void)
 {
+again:
+	/* See __srcu_read_lock() */
+	__this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* A matches B, E */
+	__this_cpu_inc(cpuhp_seq);
+
+	if (unlikely(__cpuhp_writer == readers_block)) {
+		__put_online_cpus();
+
+		atomic_inc(&cpuhp_waitcount);
+
+		/*
+		 * We either call schedule() in the wait, or we'll fall through
+		 * and reschedule on the preempt_enable() in get_online_cpus().
+		 */
+		preempt_enable_no_resched();
+		__wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
+		preempt_disable();
 
+		if (atomic_dec_and_test(&cpuhp_waitcount))
+			wake_up_all(&cpuhp_writer);
+
+		goto again;
+	}
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+	/* See __srcu_read_unlock() */
+	smp_mb(); /* C matches D */
+	this_cpu_dec(__cpuhp_refcount);
+
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
+EXPORT_SYMBOL_GPL(__put_online_cpus);
 
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+)}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
 {
+	unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+	smp_mb(); /* B matches A */
 
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
+	smp_mb(); /* D matches C */
 
+	return per_cpu_sum(cpuhp_seq) == seq;
 }
 
 /*
  * This ensures that the hotplug operation can begin only when the
  * refcount goes to zero.
  *
  * Since cpu_hotplug_begin() is always called after invoking
  * cpu_maps_update_begin(), we can be sure that only one writer is active.
  */
 void cpu_hotplug_begin(void)
 {
+	unsigned int count = 0;
+	int cpu;
 
+	lockdep_assert_held(&cpu_add_remove_lock);
+
+	/* allow reader-in-writer recursion */
+	current->cpuhp_ref++;
+
+	/* make readers take the slow path */
+	__cpuhp_writer = readers_slow;
+
+	/* See percpu_down_write() */
+	synchronize_sched();
+
+	/* make readers block */
+	__cpuhp_writer = readers_block;
+
+	smp_mb(); /* E matches A */
+
+	/* Wait for all readers to go away */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
+	/* Signal the writer is done, no fast path yet */
+	__cpuhp_writer = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/* See percpu_up_write() */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_writer = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 18:40                             ` Peter Zijlstra
@ 2013-09-25 21:22                               ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-25 21:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 08:40:15PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> > No. Too tired too ;) damn LSB test failures...
> 
> 
> ok; I cobbled this together.. I might think better of it tomorrow, but
> for now I think I closed the hole before wait_event(readers_active())
> you pointed out -- of course I might have created new holes :/
> 
> For easy reading the + only version.

A couple of nits and some commentary, but if there are races, they are
quite subtle.  ;-)

							Thanx, Paul

> ---
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();

Oleg was right, this barrier() can go.  The value was >=1 and remains
>=1, so reordering causes no harm.  (See below.)

> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

The barrier required here is provided by synchronize_sched(), and
all the code is contained by the barrier()s in preempt_disable() and
preempt_enable().

> +	else
> +		__get_online_cpus();

And a memory barrier is unconditionally executed by __get_online_cpus().

> +	preempt_enable();

The barrier() in preempt_enable() prevents the compiler from bleeding
the critical section out.

> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();

This barrier() can also be dispensed with.

> +	if (--current->cpuhp_ref)

If we leave here, the value was >=1 and remains >=1, so reordering does
no harm.

> +		return;
> +
> +	preempt_disable();

The barrier() in preempt_disable() prevents the compiler from bleeding
the critical section out.

> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);

The barrier here is supplied by synchronize_sched().

> +	else
> +		__put_online_cpus();

And a memory barrier is unconditionally executed by __put_online_cpus().

> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> +++ b/kernel/cpu.c
> @@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
> +}
> 
> +void __get_online_cpus(void)
>  {
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_writer == readers_block)) {
> +		__put_online_cpus();

Suppose we got delayed here for some time.  The writer might complete,
and be awakened by the blocked readers (we have not incremented our
counter yet).  We would then drop through, do the atomic_dec_and_test()
and deliver a spurious wake_up_all() at some random time in the future.

Which should be OK because __wait_event() looks to handle spurious
wake_up()s.

> +		atomic_inc(&cpuhp_waitcount);
> +
> +		/*
> +		 * We either call schedule() in the wait, or we'll fall through
> +		 * and reschedule on the preempt_enable() in get_online_cpus().
> +		 */
> +		preempt_enable_no_resched();
> +		__wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
> +		preempt_disable();
> 
> +		if (atomic_dec_and_test(&cpuhp_waitcount))
> +			wake_up_all(&cpuhp_writer);

There can be only one writer, so why the wake_up_all()?

> +
> +		goto again;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */

In other words, if they see our decrement (presumably to aggregate zero,
as that is the only time it matters) they will also see our critical section.

> +	this_cpu_dec(__cpuhp_refcount);
> +
> +	/* Prod writer to recheck readers_active */
> +	wake_up_all(&cpuhp_writer);
>  }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
> +#define per_cpu_sum(var)						\
> +({ 									\
> + 	typeof(var) __sum = 0;						\
> + 	int cpu;							\
> + 	for_each_possible_cpu(cpu)					\
> + 		__sum += per_cpu(var, cpu);				\
> + 	__sum;								\
> +)}
> +
> +/*
> + * See srcu_readers_active_idx_check()
> + */
> +static bool cpuhp_readers_active_check(void)
>  {
> +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> +	smp_mb(); /* B matches A */

In other words, if we see __get_online_cpus() cpuhp_seq increment, we
are guaranteed to also see its __cpuhp_refcount increment.

> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
> 
> +	smp_mb(); /* D matches C */
> 
> +	return per_cpu_sum(cpuhp_seq) == seq;

On equality, we know that there could not be any "sneak path" pairs
where we see a decrement but not the corresponding increment for a
given reader.  If we saw its decrement, the memory barriers guarantee
that we now see its cpuhp_seq increment.

>  }
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
>   */
>  void cpu_hotplug_begin(void)
>  {
> +	unsigned int count = 0;
> +	int cpu;
> 
> +	lockdep_assert_held(&cpu_add_remove_lock);
> +
> +	/* allow reader-in-writer recursion */
> +	current->cpuhp_ref++;
> +
> +	/* make readers take the slow path */
> +	__cpuhp_writer = readers_slow;
> +
> +	/* See percpu_down_write() */
> +	synchronize_sched();

At this point, we know that all readers take the slow path.

> +	/* make readers block */
> +	__cpuhp_writer = readers_block;
> +
> +	smp_mb(); /* E matches A */

If they don't see our write of readers_block to __cpuhp_writer, then
we are guaranteed to see their __cpuhp_refcount increment, and therefore
will wait for them.

> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> +	/* Signal the writer is done, no fast path yet */
> +	__cpuhp_writer = readers_slow;
> +	wake_up_all(&cpuhp_readers);

OK, the wait_event()/wake_up_all() prevents the races where the
readers are delayed between fetching __cpuhp_writer and blocking.

> +	/* See percpu_up_write() */
> +	synchronize_sched();

At this point, readers no longer attempt to block.

You avoid falling into the usual acquire-release-mismatch trap by using
__cpuhp_refcount on both the fastpatch and the slowpath, so that it is OK
to acquire on the fastpath and release on the slowpath (and vice versa).

> +	/* Let 'em rip */
> +	__cpuhp_writer = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-25 21:22                               ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-25 21:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 08:40:15PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> > No. Too tired too ;) damn LSB test failures...
> 
> 
> ok; I cobbled this together.. I might think better of it tomorrow, but
> for now I think I closed the hole before wait_event(readers_active())
> you pointed out -- of course I might have created new holes :/
> 
> For easy reading the + only version.

A couple of nits and some commentary, but if there are races, they are
quite subtle.  ;-)

							Thanx, Paul

> ---
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader-in-reader recursion */
> +	if (current->cpuhp_ref++) {
> +		barrier();

Oleg was right, this barrier() can go.  The value was >=1 and remains
>=1, so reordering causes no harm.  (See below.)

> +		return;
> +	}
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_inc(__cpuhp_refcount);

The barrier required here is provided by synchronize_sched(), and
all the code is contained by the barrier()s in preempt_disable() and
preempt_enable().

> +	else
> +		__get_online_cpus();

And a memory barrier is unconditionally executed by __get_online_cpus().

> +	preempt_enable();

The barrier() in preempt_enable() prevents the compiler from bleeding
the critical section out.

> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	barrier();

This barrier() can also be dispensed with.

> +	if (--current->cpuhp_ref)

If we leave here, the value was >=1 and remains >=1, so reordering does
no harm.

> +		return;
> +
> +	preempt_disable();

The barrier() in preempt_disable() prevents the compiler from bleeding
the critical section out.

> +	if (likely(!__cpuhp_writer))
> +		__this_cpu_dec(__cpuhp_refcount);

The barrier here is supplied by synchronize_sched().

> +	else
> +		__put_online_cpus();

And a memory barrier is unconditionally executed by __put_online_cpus().

> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> +++ b/kernel/cpu.c
> @@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
> +}
> 
> +void __get_online_cpus(void)
>  {
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_writer == readers_block)) {
> +		__put_online_cpus();

Suppose we got delayed here for some time.  The writer might complete,
and be awakened by the blocked readers (we have not incremented our
counter yet).  We would then drop through, do the atomic_dec_and_test()
and deliver a spurious wake_up_all() at some random time in the future.

Which should be OK because __wait_event() looks to handle spurious
wake_up()s.

> +		atomic_inc(&cpuhp_waitcount);
> +
> +		/*
> +		 * We either call schedule() in the wait, or we'll fall through
> +		 * and reschedule on the preempt_enable() in get_online_cpus().
> +		 */
> +		preempt_enable_no_resched();
> +		__wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
> +		preempt_disable();
> 
> +		if (atomic_dec_and_test(&cpuhp_waitcount))
> +			wake_up_all(&cpuhp_writer);

There can be only one writer, so why the wake_up_all()?

> +
> +		goto again;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */

In other words, if they see our decrement (presumably to aggregate zero,
as that is the only time it matters) they will also see our critical section.

> +	this_cpu_dec(__cpuhp_refcount);
> +
> +	/* Prod writer to recheck readers_active */
> +	wake_up_all(&cpuhp_writer);
>  }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> 
> +#define per_cpu_sum(var)						\
> +({ 									\
> + 	typeof(var) __sum = 0;						\
> + 	int cpu;							\
> + 	for_each_possible_cpu(cpu)					\
> + 		__sum += per_cpu(var, cpu);				\
> + 	__sum;								\
> +)}
> +
> +/*
> + * See srcu_readers_active_idx_check()
> + */
> +static bool cpuhp_readers_active_check(void)
>  {
> +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> +	smp_mb(); /* B matches A */

In other words, if we see __get_online_cpus() cpuhp_seq increment, we
are guaranteed to also see its __cpuhp_refcount increment.

> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
> 
> +	smp_mb(); /* D matches C */
> 
> +	return per_cpu_sum(cpuhp_seq) == seq;

On equality, we know that there could not be any "sneak path" pairs
where we see a decrement but not the corresponding increment for a
given reader.  If we saw its decrement, the memory barriers guarantee
that we now see its cpuhp_seq increment.

>  }
> 
>  /*
>   * This ensures that the hotplug operation can begin only when the
>   * refcount goes to zero.
>   *
>   * Since cpu_hotplug_begin() is always called after invoking
>   * cpu_maps_update_begin(), we can be sure that only one writer is active.
>   */
>  void cpu_hotplug_begin(void)
>  {
> +	unsigned int count = 0;
> +	int cpu;
> 
> +	lockdep_assert_held(&cpu_add_remove_lock);
> +
> +	/* allow reader-in-writer recursion */
> +	current->cpuhp_ref++;
> +
> +	/* make readers take the slow path */
> +	__cpuhp_writer = readers_slow;
> +
> +	/* See percpu_down_write() */
> +	synchronize_sched();

At this point, we know that all readers take the slow path.

> +	/* make readers block */
> +	__cpuhp_writer = readers_block;
> +
> +	smp_mb(); /* E matches A */

If they don't see our write of readers_block to __cpuhp_writer, then
we are guaranteed to see their __cpuhp_refcount increment, and therefore
will wait for them.

> +	/* Wait for all readers to go away */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> +	/* Signal the writer is done, no fast path yet */
> +	__cpuhp_writer = readers_slow;
> +	wake_up_all(&cpuhp_readers);

OK, the wait_event()/wake_up_all() prevents the races where the
readers are delayed between fetching __cpuhp_writer and blocking.

> +	/* See percpu_up_write() */
> +	synchronize_sched();

At this point, readers no longer attempt to block.

You avoid falling into the usual acquire-release-mismatch trap by using
__cpuhp_refcount on both the fastpatch and the slowpath, so that it is OK
to acquire on the fastpath and release on the slowpath (and vice versa).

> +	/* Let 'em rip */
> +	__cpuhp_writer = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-25 21:22                               ` Paul E. McKenney
@ 2013-09-26 11:10                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 11:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> A couple of nits and some commentary, but if there are races, they are
> quite subtle.  ;-)

*whee*..

I made one little change in the logic; I moved the waitcount increment
to before the __put_online_cpus() call, such that the writer will have
to wait for us to wake up before trying again -- not for us to actually
have acquired the read lock, for that we'd need to mess up
__get_online_cpus() a bit more.

Complete patch below.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h   |   58 +++++++++++++
 include/linux/sched.h |    3 
 kernel/cpu.c          |  209 +++++++++++++++++++++++++++++++++++---------------
 kernel/sched/core.c   |    2 
 4 files changed, 208 insertions(+), 64 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader recursion */
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (current->cpuhp_ref++)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_state)) {
+		/* The barrier here is supplied by synchronize_sched(). */
+		__this_cpu_inc(__cpuhp_refcount);
+	} else {
+		__get_online_cpus(); /* Unconditional memory barrier. */
+	}
+	preempt_enable();
+	/*
+	 * The barrier() from preempt_enable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (--current->cpuhp_ref)
+		return;
+
+	/*
+	 * The barrier() in preempt_disable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+	preempt_disable();
+	if (likely(!__cpuhp_state)) {
+		/* The barrier here is supplied by synchronize_sched().  */
+		__this_cpu_dec(__cpuhp_refcount);
+	} else {
+		__put_online_cpus(); /* Unconditional memory barrier. */
+	}
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+	/* See __srcu_read_lock() */
+	__this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* A matches B, E */
+	__this_cpu_inc(cpuhp_seq);
+
+	if (unlikely(__cpuhp_state == readers_block)) {
+		/*
+		 * Make sure an outgoing writer sees the waitcount to ensure
+		 * we make progress.
+		 */
+		atomic_inc(&cpuhp_waitcount);
+		__put_online_cpus();
+
+		/*
+		 * We either call schedule() in the wait, or we'll fall through
+		 * and reschedule on the preempt_enable() in get_online_cpus().
+		 */
+		preempt_enable_no_resched();
+		__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+		preempt_disable();
+
+		if (atomic_dec_and_test(&cpuhp_waitcount))
+			wake_up_all(&cpuhp_writer);
+
+		goto again;
+	}
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void get_online_cpus(void)
+void __put_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* See __srcu_read_unlock() */
+	smp_mb(); /* C matches D */
+	/*
+	 * In other words, if they see our decrement (presumably to aggregate
+	 * zero, as that is the only time it matters) they will also see our
+	 * critical section.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
 
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+)}
 
-void put_online_cpus(void)
+/*
+ * See srcu_readers_active_idx_check() for a rather more detailed explanation.
+ */
+static bool cpuhp_readers_active_check(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+	smp_mb(); /* B matches A */
+
+	/*
+	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
+	 * we are guaranteed to also see its __cpuhp_refcount increment.
+	 */
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	smp_mb(); /* D matches C */
 
+	/*
+	 * On equality, we know that there could not be any "sneak path" pairs
+	 * where we see a decrement but not the corresponding increment for a
+	 * given reader. If we saw its decrement, the memory barriers guarantee
+	 * that we now see its cpuhp_seq increment.
+	 */
+
+	return per_cpu_sum(cpuhp_seq) == seq;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	/*
+	 * Since cpu_hotplug_begin() is always called after invoking
+	 * cpu_maps_update_begin(), we can be sure that only one writer is
+	 * active.
+	 */
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	/* Allow reader-in-writer recursion. */
+	current->cpuhp_ref++;
+
+	/* Notify readers to take the slow path. */
+	__cpuhp_state = readers_slow;
+
+	/* See percpu_down_write(); guarantees all readers take the slow path */
+	synchronize_sched();
+
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish synchronize_sched() above, new readers could still come in.
+	 */
+	__cpuhp_state = readers_block;
+
+	smp_mb(); /* E matches A */
+
+	/*
+	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * then we are guaranteed to see their __cpuhp_refcount increment, and
+	 * therefore will wait for them.
+	 */
+
+	/* Wait for all now active readers to complete. */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done, no fast path yet. */
+	__cpuhp_state = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/*
+	 * The wait_event()/wake_up_all() prevents the race where the readers
+	 * are delayed between fetching __cpuhp_state and blocking.
+	 */
+
+	/* See percpu_up_write(); readers will no longer attempt to block. */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_state = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 11:10                                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 11:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> A couple of nits and some commentary, but if there are races, they are
> quite subtle.  ;-)

*whee*..

I made one little change in the logic; I moved the waitcount increment
to before the __put_online_cpus() call, such that the writer will have
to wait for us to wake up before trying again -- not for us to actually
have acquired the read lock, for that we'd need to mess up
__get_online_cpus() a bit more.

Complete patch below.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cpu.h   |   58 +++++++++++++
 include/linux/sched.h |    3 
 kernel/cpu.c          |  209 +++++++++++++++++++++++++++++++++++---------------
 kernel/sched/core.c   |    2 
 4 files changed, 208 insertions(+), 64 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
+#include <linux/percpu.h>
 
 struct device;
 
@@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
 #ifdef CONFIG_HOTPLUG_CPU
 /* Stop CPUs going up and down. */
 
+extern void cpu_hotplug_init_task(struct task_struct *p);
+
 extern void cpu_hotplug_begin(void);
 extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+	might_sleep();
+
+	/* Support reader recursion */
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (current->cpuhp_ref++)
+		return;
+
+	preempt_disable();
+	if (likely(!__cpuhp_state)) {
+		/* The barrier here is supplied by synchronize_sched(). */
+		__this_cpu_inc(__cpuhp_refcount);
+	} else {
+		__get_online_cpus(); /* Unconditional memory barrier. */
+	}
+	preempt_enable();
+	/*
+	 * The barrier() from preempt_enable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+	/* The value was >= 1 and remains so, reordering causes no harm. */
+	if (--current->cpuhp_ref)
+		return;
+
+	/*
+	 * The barrier() in preempt_disable() prevents the compiler from
+	 * bleeding the critical section out.
+	 */
+	preempt_disable();
+	if (likely(!__cpuhp_state)) {
+		/* The barrier here is supplied by synchronize_sched().  */
+		__this_cpu_dec(__cpuhp_refcount);
+	} else {
+		__put_online_cpus(); /* Unconditional memory barrier. */
+	}
+	preempt_enable();
+}
+
 extern void cpu_hotplug_disable(void);
 extern void cpu_hotplug_enable(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
@@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
 static inline void cpu_hotplug_begin(void) {}
 static inline void cpu_hotplug_done(void) {}
 #define get_online_cpus()	do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	int		cpuhp_ref;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static struct {
-	struct task_struct *active_writer;
-	struct mutex lock; /* Synchronizes accesses to refcount, */
-	/*
-	 * Also blocks the new readers during
-	 * an ongoing cpu hotplug operation.
-	 */
-	int refcount;
-} cpu_hotplug = {
-	.active_writer = NULL,
-	.lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
-	.refcount = 0,
-};
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+	p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+	/* See __srcu_read_lock() */
+	__this_cpu_inc(__cpuhp_refcount);
+	smp_mb(); /* A matches B, E */
+	__this_cpu_inc(cpuhp_seq);
+
+	if (unlikely(__cpuhp_state == readers_block)) {
+		/*
+		 * Make sure an outgoing writer sees the waitcount to ensure
+		 * we make progress.
+		 */
+		atomic_inc(&cpuhp_waitcount);
+		__put_online_cpus();
+
+		/*
+		 * We either call schedule() in the wait, or we'll fall through
+		 * and reschedule on the preempt_enable() in get_online_cpus().
+		 */
+		preempt_enable_no_resched();
+		__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+		preempt_disable();
+
+		if (atomic_dec_and_test(&cpuhp_waitcount))
+			wake_up_all(&cpuhp_writer);
+
+		goto again;
+	}
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
 
-void get_online_cpus(void)
+void __put_online_cpus(void)
 {
-	might_sleep();
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
-	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* See __srcu_read_unlock() */
+	smp_mb(); /* C matches D */
+	/*
+	 * In other words, if they see our decrement (presumably to aggregate
+	 * zero, as that is the only time it matters) they will also see our
+	 * critical section.
+	 */
+	this_cpu_dec(__cpuhp_refcount);
 
+	/* Prod writer to recheck readers_active */
+	wake_up_all(&cpuhp_writer);
 }
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+#define per_cpu_sum(var)						\
+({ 									\
+ 	typeof(var) __sum = 0;						\
+ 	int cpu;							\
+ 	for_each_possible_cpu(cpu)					\
+ 		__sum += per_cpu(var, cpu);				\
+ 	__sum;								\
+)}
 
-void put_online_cpus(void)
+/*
+ * See srcu_readers_active_idx_check() for a rather more detailed explanation.
+ */
+static bool cpuhp_readers_active_check(void)
 {
-	if (cpu_hotplug.active_writer == current)
-		return;
-	mutex_lock(&cpu_hotplug.lock);
+	unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+	smp_mb(); /* B matches A */
+
+	/*
+	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
+	 * we are guaranteed to also see its __cpuhp_refcount increment.
+	 */
 
-	if (WARN_ON(!cpu_hotplug.refcount))
-		cpu_hotplug.refcount++; /* try to fix things up */
+	if (per_cpu_sum(__cpuhp_refcount) != 0)
+		return false;
 
-	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
-		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	smp_mb(); /* D matches C */
 
+	/*
+	 * On equality, we know that there could not be any "sneak path" pairs
+	 * where we see a decrement but not the corresponding increment for a
+	 * given reader. If we saw its decrement, the memory barriers guarantee
+	 * that we now see its cpuhp_seq increment.
+	 */
+
+	return per_cpu_sum(cpuhp_seq) == seq;
 }
-EXPORT_SYMBOL_GPL(put_online_cpus);
 
 /*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- *   writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- *   non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
  */
 void cpu_hotplug_begin(void)
 {
-	cpu_hotplug.active_writer = current;
+	/*
+	 * Since cpu_hotplug_begin() is always called after invoking
+	 * cpu_maps_update_begin(), we can be sure that only one writer is
+	 * active.
+	 */
+	lockdep_assert_held(&cpu_add_remove_lock);
 
-	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
-		if (likely(!cpu_hotplug.refcount))
-			break;
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
-		schedule();
-	}
+	/* Allow reader-in-writer recursion. */
+	current->cpuhp_ref++;
+
+	/* Notify readers to take the slow path. */
+	__cpuhp_state = readers_slow;
+
+	/* See percpu_down_write(); guarantees all readers take the slow path */
+	synchronize_sched();
+
+	/*
+	 * Notify new readers to block; up until now, and thus throughout the
+	 * longish synchronize_sched() above, new readers could still come in.
+	 */
+	__cpuhp_state = readers_block;
+
+	smp_mb(); /* E matches A */
+
+	/*
+	 * If they don't see our writer of readers_block to __cpuhp_state,
+	 * then we are guaranteed to see their __cpuhp_refcount increment, and
+	 * therefore will wait for them.
+	 */
+
+	/* Wait for all now active readers to complete. */
+	wait_event(cpuhp_writer, cpuhp_readers_active_check());
 }
 
 void cpu_hotplug_done(void)
 {
-	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	/* Signal the writer is done, no fast path yet. */
+	__cpuhp_state = readers_slow;
+	wake_up_all(&cpuhp_readers);
+
+	/*
+	 * The wait_event()/wake_up_all() prevents the race where the readers
+	 * are delayed between fetching __cpuhp_state and blocking.
+	 */
+
+	/* See percpu_up_write(); readers will no longer attempt to block. */
+	synchronize_sched();
+
+	/* Let 'em rip */
+	__cpuhp_state = readers_fast;
+	current->cpuhp_ref--;
+
+	/*
+	 * Wait for any pending readers to be running. This ensures readers
+	 * after writer and avoids writers starving readers.
+	 */
+	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
 	INIT_LIST_HEAD(&p->numa_entry);
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+
+	cpu_hotplug_init_task(p);
 }
 
 #ifdef CONFIG_NUMA_BALANCING

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
       [not found]                                 ` <20130926155321.GA4342@redhat.com>
@ 2013-09-26 16:13                                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 16:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >  void cpu_hotplug_done(void)
> >  {
> > -	cpu_hotplug.active_writer = NULL;
> > -	mutex_unlock(&cpu_hotplug.lock);
> > +	/* Signal the writer is done, no fast path yet. */
> > +	__cpuhp_state = readers_slow;
> > +	wake_up_all(&cpuhp_readers);
> > +
> > +	/*
> > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > +	 * are delayed between fetching __cpuhp_state and blocking.
> > +	 */
> > +
> > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > +	synchronize_sched();
> 
> Shouldn't you move wake_up_all(&cpuhp_readers) down after
> synchronize_sched() (or add another one) ? To ensure that a reader can't
> see state = BLOCK after wakeup().

Well, if they are blocked, the wake_up_all() will do an actual
try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

The woken task will get a MB from passing through the context switch to
make it actually run. And therefore; like Paul's comment says; it cannot
observe the previous BLOCK state but must indeed see the just issued
SLOW state.

Right?

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 16:13                                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 16:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >  void cpu_hotplug_done(void)
> >  {
> > -	cpu_hotplug.active_writer = NULL;
> > -	mutex_unlock(&cpu_hotplug.lock);
> > +	/* Signal the writer is done, no fast path yet. */
> > +	__cpuhp_state = readers_slow;
> > +	wake_up_all(&cpuhp_readers);
> > +
> > +	/*
> > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > +	 * are delayed between fetching __cpuhp_state and blocking.
> > +	 */
> > +
> > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > +	synchronize_sched();
> 
> Shouldn't you move wake_up_all(&cpuhp_readers) down after
> synchronize_sched() (or add another one) ? To ensure that a reader can't
> see state = BLOCK after wakeup().

Well, if they are blocked, the wake_up_all() will do an actual
try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

The woken task will get a MB from passing through the context switch to
make it actually run. And therefore; like Paul's comment says; it cannot
observe the previous BLOCK state but must indeed see the just issued
SLOW state.

Right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 16:13                                     ` Peter Zijlstra
@ 2013-09-26 16:14                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-26 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/26, Peter Zijlstra wrote:
>
> On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > -	cpu_hotplug.active_writer = NULL;
> > > -	mutex_unlock(&cpu_hotplug.lock);
> > > +	/* Signal the writer is done, no fast path yet. */
> > > +	__cpuhp_state = readers_slow;
> > > +	wake_up_all(&cpuhp_readers);
> > > +
> > > +	/*
> > > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > > +	 * are delayed between fetching __cpuhp_state and blocking.
> > > +	 */
> > > +
> > > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > > +	synchronize_sched();
> >
> > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > see state = BLOCK after wakeup().
>
> Well, if they are blocked, the wake_up_all() will do an actual
> try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

Yes. Everything is fine with the already blocked readers.

I meant the new reader which still can see state = BLOCK after we
do wakeup(), but I didn't notice it should do __wait_event() which
takes the lock unconditionally, it must see the change after that.

> Right?

Yes, I was wrong, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 16:14                                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-26 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/26, Peter Zijlstra wrote:
>
> On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:
> > >  void cpu_hotplug_done(void)
> > >  {
> > > -	cpu_hotplug.active_writer = NULL;
> > > -	mutex_unlock(&cpu_hotplug.lock);
> > > +	/* Signal the writer is done, no fast path yet. */
> > > +	__cpuhp_state = readers_slow;
> > > +	wake_up_all(&cpuhp_readers);
> > > +
> > > +	/*
> > > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > > +	 * are delayed between fetching __cpuhp_state and blocking.
> > > +	 */
> > > +
> > > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > > +	synchronize_sched();
> >
> > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > see state = BLOCK after wakeup().
>
> Well, if they are blocked, the wake_up_all() will do an actual
> try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

Yes. Everything is fine with the already blocked readers.

I meant the new reader which still can see state = BLOCK after we
do wakeup(), but I didn't notice it should do __wait_event() which
takes the lock unconditionally, it must see the change after that.

> Right?

Yes, I was wrong, thanks.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 16:14                                       ` Oleg Nesterov
@ 2013-09-26 16:40                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 16:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 06:14:26PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > -	cpu_hotplug.active_writer = NULL;
> > > > -	mutex_unlock(&cpu_hotplug.lock);
> > > > +	/* Signal the writer is done, no fast path yet. */
> > > > +	__cpuhp_state = readers_slow;
> > > > +	wake_up_all(&cpuhp_readers);
> > > > +
> > > > +	/*
> > > > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > > > +	 * are delayed between fetching __cpuhp_state and blocking.
> > > > +	 */
> > > > +
> > > > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > > > +	synchronize_sched();
> > >
> > > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > > see state = BLOCK after wakeup().
> >
> > Well, if they are blocked, the wake_up_all() will do an actual
> > try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().
> 
> Yes. Everything is fine with the already blocked readers.
> 
> I meant the new reader which still can see state = BLOCK after we
> do wakeup(), but I didn't notice it should do __wait_event() which
> takes the lock unconditionally, it must see the change after that.

Ah, because both __wake_up() and __wait_event()->prepare_to_wait() take
q->lock. Thereby matching the __wake_up() RELEASE to the __wait_event()
ACQUIRE, creating the full barrier.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 16:40                                         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 16:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 06:14:26PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> > > > -	cpu_hotplug.active_writer = NULL;
> > > > -	mutex_unlock(&cpu_hotplug.lock);
> > > > +	/* Signal the writer is done, no fast path yet. */
> > > > +	__cpuhp_state = readers_slow;
> > > > +	wake_up_all(&cpuhp_readers);
> > > > +
> > > > +	/*
> > > > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > > > +	 * are delayed between fetching __cpuhp_state and blocking.
> > > > +	 */
> > > > +
> > > > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > > > +	synchronize_sched();
> > >
> > > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > > see state = BLOCK after wakeup().
> >
> > Well, if they are blocked, the wake_up_all() will do an actual
> > try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().
> 
> Yes. Everything is fine with the already blocked readers.
> 
> I meant the new reader which still can see state = BLOCK after we
> do wakeup(), but I didn't notice it should do __wait_event() which
> takes the lock unconditionally, it must see the change after that.

Ah, because both __wake_up() and __wait_event()->prepare_to_wait() take
q->lock. Thereby matching the __wake_up() RELEASE to the __wait_event()
ACQUIRE, creating the full barrier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 11:10                                 ` Peter Zijlstra
@ 2013-09-26 16:58                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-26 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

Peter,

Sorry. Unlikely I will be able to read this patch today. So let me
ask another potentially wrong question without any thinking.

On 09/26, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_state == readers_block)) {

OK. Either we should see state = BLOCK or the writer should notice the
change in __cpuhp_refcount/seq. (altough I'd like to recheck this
cpuhp_seq logic ;)

> +		atomic_inc(&cpuhp_waitcount);
> +		__put_online_cpus();

OK, this does wake(cpuhp_writer).

>  void cpu_hotplug_begin(void)
>  {
> ...
> +	/*
> +	 * Notify new readers to block; up until now, and thus throughout the
> +	 * longish synchronize_sched() above, new readers could still come in.
> +	 */
> +	__cpuhp_state = readers_block;
> +
> +	smp_mb(); /* E matches A */
> +
> +	/*
> +	 * If they don't see our writer of readers_block to __cpuhp_state,
> +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> +	 * therefore will wait for them.
> +	 */
> +
> +	/* Wait for all now active readers to complete. */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());

But. doesn't this mean that we need __wait_event() here as well?

Isn't it possible that the reader sees BLOCK but the writer does _not_
see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
"either", not "both".

Don't we need to ensure that we can't check cpuhp_readers_active_check()
after wake(cpuhp_writer) was already called by the reader and before we
take the same lock?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 16:58                                   ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-26 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

Peter,

Sorry. Unlikely I will be able to read this patch today. So let me
ask another potentially wrong question without any thinking.

On 09/26, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_state == readers_block)) {

OK. Either we should see state = BLOCK or the writer should notice the
change in __cpuhp_refcount/seq. (altough I'd like to recheck this
cpuhp_seq logic ;)

> +		atomic_inc(&cpuhp_waitcount);
> +		__put_online_cpus();

OK, this does wake(cpuhp_writer).

>  void cpu_hotplug_begin(void)
>  {
> ...
> +	/*
> +	 * Notify new readers to block; up until now, and thus throughout the
> +	 * longish synchronize_sched() above, new readers could still come in.
> +	 */
> +	__cpuhp_state = readers_block;
> +
> +	smp_mb(); /* E matches A */
> +
> +	/*
> +	 * If they don't see our writer of readers_block to __cpuhp_state,
> +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> +	 * therefore will wait for them.
> +	 */
> +
> +	/* Wait for all now active readers to complete. */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());

But. doesn't this mean that we need __wait_event() here as well?

Isn't it possible that the reader sees BLOCK but the writer does _not_
see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
"either", not "both".

Don't we need to ensure that we can't check cpuhp_readers_active_check()
after wake(cpuhp_writer) was already called by the reader and before we
take the same lock?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 16:58                                   ` Oleg Nesterov
@ 2013-09-26 17:50                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 17:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 06:58:40PM +0200, Oleg Nesterov wrote:
> Peter,
> 
> Sorry. Unlikely I will be able to read this patch today. So let me
> ask another potentially wrong question without any thinking.
> 
> On 09/26, Peter Zijlstra wrote:
> >
> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +	/* See __srcu_read_lock() */
> > +	__this_cpu_inc(__cpuhp_refcount);
> > +	smp_mb(); /* A matches B, E */
> > +	__this_cpu_inc(cpuhp_seq);
> > +
> > +	if (unlikely(__cpuhp_state == readers_block)) {
> 
> OK. Either we should see state = BLOCK or the writer should notice the
> change in __cpuhp_refcount/seq. (altough I'd like to recheck this
> cpuhp_seq logic ;)
> 
> > +		atomic_inc(&cpuhp_waitcount);
> > +		__put_online_cpus();
> 
> OK, this does wake(cpuhp_writer).
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > ...
> > +	/*
> > +	 * Notify new readers to block; up until now, and thus throughout the
> > +	 * longish synchronize_sched() above, new readers could still come in.
> > +	 */
> > +	__cpuhp_state = readers_block;
> > +
> > +	smp_mb(); /* E matches A */
> > +
> > +	/*
> > +	 * If they don't see our writer of readers_block to __cpuhp_state,
> > +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> > +	 * therefore will wait for them.
> > +	 */
> > +
> > +	/* Wait for all now active readers to complete. */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> But. doesn't this mean that we need __wait_event() here as well?
> 
> Isn't it possible that the reader sees BLOCK but the writer does _not_
> see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
> "either", not "both".

But if the readers does see BLOCK it will not be an active reader no
more; and thus the writer doesn't need to observe and wait for it.

> Don't we need to ensure that we can't check cpuhp_readers_active_check()
> after wake(cpuhp_writer) was already called by the reader and before we
> take the same lock?

I'm too tired to fully grasp what you're asking here; but given the
previous answer I think not.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-26 17:50                                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-26 17:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 06:58:40PM +0200, Oleg Nesterov wrote:
> Peter,
> 
> Sorry. Unlikely I will be able to read this patch today. So let me
> ask another potentially wrong question without any thinking.
> 
> On 09/26, Peter Zijlstra wrote:
> >
> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +	/* See __srcu_read_lock() */
> > +	__this_cpu_inc(__cpuhp_refcount);
> > +	smp_mb(); /* A matches B, E */
> > +	__this_cpu_inc(cpuhp_seq);
> > +
> > +	if (unlikely(__cpuhp_state == readers_block)) {
> 
> OK. Either we should see state = BLOCK or the writer should notice the
> change in __cpuhp_refcount/seq. (altough I'd like to recheck this
> cpuhp_seq logic ;)
> 
> > +		atomic_inc(&cpuhp_waitcount);
> > +		__put_online_cpus();
> 
> OK, this does wake(cpuhp_writer).
> 
> >  void cpu_hotplug_begin(void)
> >  {
> > ...
> > +	/*
> > +	 * Notify new readers to block; up until now, and thus throughout the
> > +	 * longish synchronize_sched() above, new readers could still come in.
> > +	 */
> > +	__cpuhp_state = readers_block;
> > +
> > +	smp_mb(); /* E matches A */
> > +
> > +	/*
> > +	 * If they don't see our writer of readers_block to __cpuhp_state,
> > +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> > +	 * therefore will wait for them.
> > +	 */
> > +
> > +	/* Wait for all now active readers to complete. */
> > +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
> 
> But. doesn't this mean that we need __wait_event() here as well?
> 
> Isn't it possible that the reader sees BLOCK but the writer does _not_
> see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
> "either", not "both".

But if the readers does see BLOCK it will not be an active reader no
more; and thus the writer doesn't need to observe and wait for it.

> Don't we need to ensure that we can't check cpuhp_readers_active_check()
> after wake(cpuhp_writer) was already called by the reader and before we
> take the same lock?

I'm too tired to fully grasp what you're asking here; but given the
previous answer I think not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 17:50                                     ` Peter Zijlstra
@ 2013-09-27 18:15                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-27 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/26, Peter Zijlstra wrote:
>
> But if the readers does see BLOCK it will not be an active reader no
> more; and thus the writer doesn't need to observe and wait for it.

I meant they both can block, but please ignore. Today I simply can't
understand what I was thinking about yesterday.


I tried hard to find any hole in this version but failed, I believe it
is correct.

But, could you help me to understand some details?

> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_state == readers_block)) {

Note that there is no barrier() after inc(seq) and __cpuhp_state
check, this inc() can be "postponed" till ...

> +void __put_online_cpus(void)
>  {
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */

... this mb() in __put_online_cpus().

And this is fine! The qustion is, perhaps it would be more "natural"
and understandable to shift this_cpu_inc(cpuhp_seq) into
__put_online_cpus().

We need to ensure 2 things:

1. The reader should notic state = BLOCK or the writer should see
   inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
   __get_online_cpus() and in cpu_hotplug_begin().

   We do not care if the writer misses some inc(__cpuhp_refcount)
   in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
   state = readers_block (and inc(cpuhp_seq) can't help anyway).

2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
   from __put_online_cpus() (note that the writer can miss the
   corresponding inc() if it was done on another CPU, so this dec()
   can lead to sum() == 0), it should also notice the change in cpuhp_seq.

   Fortunately, this can only happen if the reader migrates, in
   this case schedule() provides a barrier, the writer can't miss
   the change in cpuhp_seq.

IOW. Unless I missed something, cpuhp_seq is actually needed to
serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
and /* D matches C */ in cpuhp_readers_active_check(), and this
is not immediately clear if you look at __get_online_cpus().

I do not suggest to change this code, but please tell me if my
understanding is not correct.

> +static bool cpuhp_readers_active_check(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> +	smp_mb(); /* B matches A */
> +
> +	/*
> +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> +	 */
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	smp_mb(); /* D matches C */

It seems that both barries could be smp_rmb() ? I am not sure the comments
from srcu_readers_active_idx_check() can explain mb(), note that
__srcu_read_lock() always succeeds unlike get_cpus_online().

>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done, no fast path yet. */
> +	__cpuhp_state = readers_slow;
> +	wake_up_all(&cpuhp_readers);
> +
> +	/*
> +	 * The wait_event()/wake_up_all() prevents the race where the readers
> +	 * are delayed between fetching __cpuhp_state and blocking.
> +	 */
> +
> +	/* See percpu_up_write(); readers will no longer attempt to block. */
> +	synchronize_sched();
> +
> +	/* Let 'em rip */
> +	__cpuhp_state = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }

OK, to some degree I can understand "avoids writers starving readers"
part (although the next writer should do synchronize_sched() first),
but could you explain "ensures readers after writer" ?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-27 18:15                                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-27 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/26, Peter Zijlstra wrote:
>
> But if the readers does see BLOCK it will not be an active reader no
> more; and thus the writer doesn't need to observe and wait for it.

I meant they both can block, but please ignore. Today I simply can't
understand what I was thinking about yesterday.


I tried hard to find any hole in this version but failed, I believe it
is correct.

But, could you help me to understand some details?

> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	__this_cpu_inc(cpuhp_seq);
> +
> +	if (unlikely(__cpuhp_state == readers_block)) {

Note that there is no barrier() after inc(seq) and __cpuhp_state
check, this inc() can be "postponed" till ...

> +void __put_online_cpus(void)
>  {
> -	might_sleep();
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> -	cpu_hotplug.refcount++;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */

... this mb() in __put_online_cpus().

And this is fine! The qustion is, perhaps it would be more "natural"
and understandable to shift this_cpu_inc(cpuhp_seq) into
__put_online_cpus().

We need to ensure 2 things:

1. The reader should notic state = BLOCK or the writer should see
   inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
   __get_online_cpus() and in cpu_hotplug_begin().

   We do not care if the writer misses some inc(__cpuhp_refcount)
   in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
   state = readers_block (and inc(cpuhp_seq) can't help anyway).

2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
   from __put_online_cpus() (note that the writer can miss the
   corresponding inc() if it was done on another CPU, so this dec()
   can lead to sum() == 0), it should also notice the change in cpuhp_seq.

   Fortunately, this can only happen if the reader migrates, in
   this case schedule() provides a barrier, the writer can't miss
   the change in cpuhp_seq.

IOW. Unless I missed something, cpuhp_seq is actually needed to
serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
and /* D matches C */ in cpuhp_readers_active_check(), and this
is not immediately clear if you look at __get_online_cpus().

I do not suggest to change this code, but please tell me if my
understanding is not correct.

> +static bool cpuhp_readers_active_check(void)
>  {
> -	if (cpu_hotplug.active_writer == current)
> -		return;
> -	mutex_lock(&cpu_hotplug.lock);
> +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> +	smp_mb(); /* B matches A */
> +
> +	/*
> +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> +	 */
>  
> -	if (WARN_ON(!cpu_hotplug.refcount))
> -		cpu_hotplug.refcount++; /* try to fix things up */
> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
>  
> -	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> -		wake_up_process(cpu_hotplug.active_writer);
> -	mutex_unlock(&cpu_hotplug.lock);
> +	smp_mb(); /* D matches C */

It seems that both barries could be smp_rmb() ? I am not sure the comments
from srcu_readers_active_idx_check() can explain mb(), note that
__srcu_read_lock() always succeeds unlike get_cpus_online().

>  void cpu_hotplug_done(void)
>  {
> -	cpu_hotplug.active_writer = NULL;
> -	mutex_unlock(&cpu_hotplug.lock);
> +	/* Signal the writer is done, no fast path yet. */
> +	__cpuhp_state = readers_slow;
> +	wake_up_all(&cpuhp_readers);
> +
> +	/*
> +	 * The wait_event()/wake_up_all() prevents the race where the readers
> +	 * are delayed between fetching __cpuhp_state and blocking.
> +	 */
> +
> +	/* See percpu_up_write(); readers will no longer attempt to block. */
> +	synchronize_sched();
> +
> +	/* Let 'em rip */
> +	__cpuhp_state = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }

OK, to some degree I can understand "avoids writers starving readers"
part (although the next writer should do synchronize_sched() first),
but could you explain "ensures readers after writer" ?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-27 18:15                                       ` Oleg Nesterov
@ 2013-09-27 20:41                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-27 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > But if the readers does see BLOCK it will not be an active reader no
> > more; and thus the writer doesn't need to observe and wait for it.
> 
> I meant they both can block, but please ignore. Today I simply can't
> understand what I was thinking about yesterday.

I think we all know that state all too well ;-)

> I tried hard to find any hole in this version but failed, I believe it
> is correct.

Yay!

> But, could you help me to understand some details?

I'll try, but I'm not too bright atm myself :-)

> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +	/* See __srcu_read_lock() */
> > +	__this_cpu_inc(__cpuhp_refcount);
> > +	smp_mb(); /* A matches B, E */
> > +	__this_cpu_inc(cpuhp_seq);
> > +
> > +	if (unlikely(__cpuhp_state == readers_block)) {
> 
> Note that there is no barrier() after inc(seq) and __cpuhp_state
> check, this inc() can be "postponed" till ...
> 
> > +void __put_online_cpus(void)
> >  {
> > +	/* See __srcu_read_unlock() */
> > +	smp_mb(); /* C matches D */
> 
> ... this mb() in __put_online_cpus().
> 
> And this is fine! The qustion is, perhaps it would be more "natural"
> and understandable to shift this_cpu_inc(cpuhp_seq) into
> __put_online_cpus().

Possibly; I never got further than that the required order is:

  ref++
  MB
  seq++
  MB
  ref--

It doesn't matter if the seq++ is in the lock or unlock primitive. I
never considered one place more natural than the other.

> We need to ensure 2 things:
> 
> 1. The reader should notic state = BLOCK or the writer should see
>    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>    __get_online_cpus() and in cpu_hotplug_begin().
> 
>    We do not care if the writer misses some inc(__cpuhp_refcount)
>    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>    state = readers_block (and inc(cpuhp_seq) can't help anyway).

Agreed.

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>    from __put_online_cpus() (note that the writer can miss the
>    corresponding inc() if it was done on another CPU, so this dec()
>    can lead to sum() == 0), it should also notice the change in cpuhp_seq.
> 
>    Fortunately, this can only happen if the reader migrates, in
>    this case schedule() provides a barrier, the writer can't miss
>    the change in cpuhp_seq.

Again, agreed; this is also the message of the second comment in
cpuhp_readers_active_check() by Paul.

> IOW. Unless I missed something, cpuhp_seq is actually needed to
> serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
> and /* D matches C */ in cpuhp_readers_active_check(), and this
> is not immediately clear if you look at __get_online_cpus().
> 
> I do not suggest to change this code, but please tell me if my
> understanding is not correct.

I think you're entirely right.

> > +static bool cpuhp_readers_active_check(void)
> >  {
> > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > +
> > +	smp_mb(); /* B matches A */
> > +
> > +	/*
> > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > +	 */
> >  
> > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > +		return false;
> >  
> > +	smp_mb(); /* D matches C */
> 
> It seems that both barries could be smp_rmb() ? I am not sure the comments
> from srcu_readers_active_idx_check() can explain mb(), note that
> __srcu_read_lock() always succeeds unlike get_cpus_online().

I see what you mean; cpuhp_readers_active_check() is all purely reads;
there are no writes to order.

Paul; is there any argument for the MB here as opposed to RMB; and if
not should we change both these and SRCU?

> >  void cpu_hotplug_done(void)
> >  {
> > +	/* Signal the writer is done, no fast path yet. */
> > +	__cpuhp_state = readers_slow;
> > +	wake_up_all(&cpuhp_readers);
> > +
> > +	/*
> > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > +	 * are delayed between fetching __cpuhp_state and blocking.
> > +	 */
> > +
> > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > +	synchronize_sched();
> > +
> > +	/* Let 'em rip */
> > +	__cpuhp_state = readers_fast;
> > +	current->cpuhp_ref--;
> > +
> > +	/*
> > +	 * Wait for any pending readers to be running. This ensures readers
> > +	 * after writer and avoids writers starving readers.
> > +	 */
> > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> >  }
> 
> OK, to some degree I can understand "avoids writers starving readers"
> part (although the next writer should do synchronize_sched() first),
> but could you explain "ensures readers after writer" ?

Suppose reader A sees state == BLOCK and goes to sleep; our writer B
does cpu_hotplug_done() and wakes all pending readers. If for some
reason A doesn't schedule to inc ref until B again executes
cpu_hotplug_begin() and state is once again BLOCK, A will not have made
any progress.

The waitcount increment before __put_online_cpus() ensures
cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs
far enough to at least pass the dec_and_test().

And once past the dec_and_test() preemption is disabled and the
sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee
we'll have acquired a reference and are an active reader.



^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-27 20:41                                         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-27 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > But if the readers does see BLOCK it will not be an active reader no
> > more; and thus the writer doesn't need to observe and wait for it.
> 
> I meant they both can block, but please ignore. Today I simply can't
> understand what I was thinking about yesterday.

I think we all know that state all too well ;-)

> I tried hard to find any hole in this version but failed, I believe it
> is correct.

Yay!

> But, could you help me to understand some details?

I'll try, but I'm not too bright atm myself :-)

> > +void __get_online_cpus(void)
> > +{
> > +again:
> > +	/* See __srcu_read_lock() */
> > +	__this_cpu_inc(__cpuhp_refcount);
> > +	smp_mb(); /* A matches B, E */
> > +	__this_cpu_inc(cpuhp_seq);
> > +
> > +	if (unlikely(__cpuhp_state == readers_block)) {
> 
> Note that there is no barrier() after inc(seq) and __cpuhp_state
> check, this inc() can be "postponed" till ...
> 
> > +void __put_online_cpus(void)
> >  {
> > +	/* See __srcu_read_unlock() */
> > +	smp_mb(); /* C matches D */
> 
> ... this mb() in __put_online_cpus().
> 
> And this is fine! The qustion is, perhaps it would be more "natural"
> and understandable to shift this_cpu_inc(cpuhp_seq) into
> __put_online_cpus().

Possibly; I never got further than that the required order is:

  ref++
  MB
  seq++
  MB
  ref--

It doesn't matter if the seq++ is in the lock or unlock primitive. I
never considered one place more natural than the other.

> We need to ensure 2 things:
> 
> 1. The reader should notic state = BLOCK or the writer should see
>    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>    __get_online_cpus() and in cpu_hotplug_begin().
> 
>    We do not care if the writer misses some inc(__cpuhp_refcount)
>    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>    state = readers_block (and inc(cpuhp_seq) can't help anyway).

Agreed.

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>    from __put_online_cpus() (note that the writer can miss the
>    corresponding inc() if it was done on another CPU, so this dec()
>    can lead to sum() == 0), it should also notice the change in cpuhp_seq.
> 
>    Fortunately, this can only happen if the reader migrates, in
>    this case schedule() provides a barrier, the writer can't miss
>    the change in cpuhp_seq.

Again, agreed; this is also the message of the second comment in
cpuhp_readers_active_check() by Paul.

> IOW. Unless I missed something, cpuhp_seq is actually needed to
> serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
> and /* D matches C */ in cpuhp_readers_active_check(), and this
> is not immediately clear if you look at __get_online_cpus().
> 
> I do not suggest to change this code, but please tell me if my
> understanding is not correct.

I think you're entirely right.

> > +static bool cpuhp_readers_active_check(void)
> >  {
> > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > +
> > +	smp_mb(); /* B matches A */
> > +
> > +	/*
> > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > +	 */
> >  
> > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > +		return false;
> >  
> > +	smp_mb(); /* D matches C */
> 
> It seems that both barries could be smp_rmb() ? I am not sure the comments
> from srcu_readers_active_idx_check() can explain mb(), note that
> __srcu_read_lock() always succeeds unlike get_cpus_online().

I see what you mean; cpuhp_readers_active_check() is all purely reads;
there are no writes to order.

Paul; is there any argument for the MB here as opposed to RMB; and if
not should we change both these and SRCU?

> >  void cpu_hotplug_done(void)
> >  {
> > +	/* Signal the writer is done, no fast path yet. */
> > +	__cpuhp_state = readers_slow;
> > +	wake_up_all(&cpuhp_readers);
> > +
> > +	/*
> > +	 * The wait_event()/wake_up_all() prevents the race where the readers
> > +	 * are delayed between fetching __cpuhp_state and blocking.
> > +	 */
> > +
> > +	/* See percpu_up_write(); readers will no longer attempt to block. */
> > +	synchronize_sched();
> > +
> > +	/* Let 'em rip */
> > +	__cpuhp_state = readers_fast;
> > +	current->cpuhp_ref--;
> > +
> > +	/*
> > +	 * Wait for any pending readers to be running. This ensures readers
> > +	 * after writer and avoids writers starving readers.
> > +	 */
> > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> >  }
> 
> OK, to some degree I can understand "avoids writers starving readers"
> part (although the next writer should do synchronize_sched() first),
> but could you explain "ensures readers after writer" ?

Suppose reader A sees state == BLOCK and goes to sleep; our writer B
does cpu_hotplug_done() and wakes all pending readers. If for some
reason A doesn't schedule to inc ref until B again executes
cpu_hotplug_begin() and state is once again BLOCK, A will not have made
any progress.

The waitcount increment before __put_online_cpus() ensures
cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs
far enough to at least pass the dec_and_test().

And once past the dec_and_test() preemption is disabled and the
sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee
we'll have acquired a reference and are an active reader.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-27 20:41                                         ` Peter Zijlstra
@ 2013-09-28 12:48                                           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-28 12:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/27, Peter Zijlstra wrote:
>
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
>
> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > +	smp_mb(); /* B matches A */
> > > +
> > > +	/*
> > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +	 */
> > >
> > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > +		return false;
> > >
> > > +	smp_mb(); /* D matches C */
> >
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(),

To avoid the confusion, I meant "those comments can't explain mb()s here,
in cpuhp_readers_active_check()".

> > note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().

And this cput_hotplug_ and synchronize_srcu() differ, see below.

> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
>
> Paul; is there any argument for the MB here as opposed to RMB;

Yes, Paul, please ;)

> and if
> not should we change both these and SRCU?

I guess that SRCU is more "complex" in this respect. IIUC,
cpuhp_readers_active_check() needs "more" barriers because if
synchronize_srcu() succeeds it needs to synchronize with the new readers
which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
srcu never blocks the readers, srcu_read_*() always succeeds.



Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
->c and ->seq. A plain this_cpu_inc() should be fine?

And since it disables preemption, why it can't use __this_cpu_inc() to inc
->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
it should also do _unlock.

But this is minor/offtopic.

> > >  void cpu_hotplug_done(void)
> > >  {
...
> > > +	/*
> > > +	 * Wait for any pending readers to be running. This ensures readers
> > > +	 * after writer and avoids writers starving readers.
> > > +	 */
> > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > >  }
> >
> > OK, to some degree I can understand "avoids writers starving readers"
> > part (although the next writer should do synchronize_sched() first),
> > but could you explain "ensures readers after writer" ?
>
> Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> does cpu_hotplug_done() and wakes all pending readers. If for some
> reason A doesn't schedule to inc ref until B again executes
> cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> any progress.

Yes, yes, thanks, this is clear. But this explains "writers starving readers".
And let me repeat, if B again executes cpu_hotplug_begin() it will do
another synchronize_sched() before it sets BLOCK, so I am not sure we
need this "in practice".

I was confused by "ensures readers after writer", I thought this means
we need the additional synchronization with the readers which are going
to increment cpuhp_waitcount, say, some sort of barries.

Please note that this wait_event() adds a problem... it doesn't allow
to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
in this case. We can solve this, but this wait_event() complicates
the problem.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-28 12:48                                           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-28 12:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/27, Peter Zijlstra wrote:
>
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
>
> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > +	smp_mb(); /* B matches A */
> > > +
> > > +	/*
> > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +	 */
> > >
> > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > +		return false;
> > >
> > > +	smp_mb(); /* D matches C */
> >
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(),

To avoid the confusion, I meant "those comments can't explain mb()s here,
in cpuhp_readers_active_check()".

> > note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().

And this cput_hotplug_ and synchronize_srcu() differ, see below.

> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
>
> Paul; is there any argument for the MB here as opposed to RMB;

Yes, Paul, please ;)

> and if
> not should we change both these and SRCU?

I guess that SRCU is more "complex" in this respect. IIUC,
cpuhp_readers_active_check() needs "more" barriers because if
synchronize_srcu() succeeds it needs to synchronize with the new readers
which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
srcu never blocks the readers, srcu_read_*() always succeeds.



Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
->c and ->seq. A plain this_cpu_inc() should be fine?

And since it disables preemption, why it can't use __this_cpu_inc() to inc
->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
it should also do _unlock.

But this is minor/offtopic.

> > >  void cpu_hotplug_done(void)
> > >  {
...
> > > +	/*
> > > +	 * Wait for any pending readers to be running. This ensures readers
> > > +	 * after writer and avoids writers starving readers.
> > > +	 */
> > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > >  }
> >
> > OK, to some degree I can understand "avoids writers starving readers"
> > part (although the next writer should do synchronize_sched() first),
> > but could you explain "ensures readers after writer" ?
>
> Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> does cpu_hotplug_done() and wakes all pending readers. If for some
> reason A doesn't schedule to inc ref until B again executes
> cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> any progress.

Yes, yes, thanks, this is clear. But this explains "writers starving readers".
And let me repeat, if B again executes cpu_hotplug_begin() it will do
another synchronize_sched() before it sets BLOCK, so I am not sure we
need this "in practice".

I was confused by "ensures readers after writer", I thought this means
we need the additional synchronization with the readers which are going
to increment cpuhp_waitcount, say, some sort of barries.

Please note that this wait_event() adds a problem... it doesn't allow
to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
in this case. We can solve this, but this wait_event() complicates
the problem.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-28 12:48                                           ` Oleg Nesterov
@ 2013-09-28 14:47                                             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-28 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +	/*
> > > > +	 * Wait for any pending readers to be running. This ensures readers
> > > > +	 * after writer and avoids writers starving readers.
> > > > +	 */
> > > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.

Ah no; I just wanted to guarantee that any pending readers did get a
chance to run. And yes due to the two sync_sched() calls it seems
somewhat unlikely in practise.

> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.

That seems like a particularly easy fix; something like so?

---
 include/linux/cpu.h |    1 
 kernel/cpu.c        |   84 ++++++++++++++++++++++++++++++++++------------------
 2 files changed, 56 insertions(+), 29 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -109,6 +109,7 @@ enum {
 #define CPU_DOWN_FAILED_FROZEN	(CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
 #define CPU_DEAD_FROZEN		(CPU_DEAD | CPU_TASKS_FROZEN)
 #define CPU_DYING_FROZEN	(CPU_DYING | CPU_TASKS_FROZEN)
+#define CPU_POST_DEAD_FROZEN	(CPU_POST_DEAD | CPU_TASKS_FROZEN)
 #define CPU_STARTING_FROZEN	(CPU_STARTING | CPU_TASKS_FROZEN)
 
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -364,8 +364,7 @@ static int __ref take_cpu_down(void *_pa
 	return 0;
 }
 
-/* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref __cpu_down(unsigned int cpu, int tasks_frozen)
 {
 	int err, nr_calls = 0;
 	void *hcpu = (void *)(long)cpu;
@@ -375,21 +374,13 @@ static int __ref _cpu_down(unsigned int
 		.hcpu = hcpu,
 	};
 
-	if (num_online_cpus() == 1)
-		return -EBUSY;
-
-	if (!cpu_online(cpu))
-		return -EINVAL;
-
-	cpu_hotplug_begin();
-
 	err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
 	if (err) {
 		nr_calls--;
 		__cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
 		printk("%s: attempt to take down CPU %u failed\n",
 				__func__, cpu);
-		goto out_release;
+		return err;
 	}
 	smpboot_park_threads(cpu);
 
@@ -398,7 +389,7 @@ static int __ref _cpu_down(unsigned int
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		smpboot_unpark_threads(cpu);
 		cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);
-		goto out_release;
+		return err;
 	}
 	BUG_ON(cpu_online(cpu));
 
@@ -420,10 +411,27 @@ static int __ref _cpu_down(unsigned int
 
 	check_for_tasks(cpu);
 
-out_release:
+	return err;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+{
+	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
+	int err;
+
+	if (num_online_cpus() == 1)
+		return -EBUSY;
+
+	if (!cpu_online(cpu))
+		return -EINVAL;
+
+	cpu_hotplug_begin();
+	err = __cpu_down(cpu, tasks_frozen);
 	cpu_hotplug_done();
+
 	if (!err)
-		cpu_notify_nofail(CPU_POST_DEAD | mod, hcpu);
+		cpu_notify_nofail(CPU_POST_DEAD | mod, (void *)(long)cpu);
 	return err;
 }
 
@@ -447,30 +455,22 @@ int __ref cpu_down(unsigned int cpu)
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen)
+static int ___cpu_up(unsigned int cpu, int tasks_frozen)
 {
 	int ret, nr_calls = 0;
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;
 
-	cpu_hotplug_begin();
-
-	if (cpu_online(cpu) || !cpu_present(cpu)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	idle = idle_thread_get(cpu);
 	if (IS_ERR(idle)) {
 		ret = PTR_ERR(idle);
-		goto out;
+		return ret;
 	}
 
 	ret = smpboot_create_threads(cpu);
 	if (ret)
-		goto out;
+		return ret;
 
 	ret = __cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls);
 	if (ret) {
@@ -492,9 +492,24 @@ static int _cpu_up(unsigned int cpu, int
 	/* Now call notifier in preparation. */
 	cpu_notify(CPU_ONLINE | mod, hcpu);
 
+	return 0;
+
 out_notify:
-	if (ret != 0)
-		__cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+	__cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+	return ret;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int _cpu_up(unsigned int cpu, int tasks_frozen)
+{
+	cpu_hotplug_begin();
+
+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = ___cpu_up(cpu, tasks_frozen);
 out:
 	cpu_hotplug_done();
 
@@ -572,11 +587,13 @@ int disable_nonboot_cpus(void)
 	 */
 	cpumask_clear(frozen_cpus);
 
+	cpu_hotplug_begin();
+
 	printk("Disabling non-boot CPUs ...\n");
 	for_each_online_cpu(cpu) {
 		if (cpu == first_cpu)
 			continue;
-		error = _cpu_down(cpu, 1);
+		error = __cpu_down(cpu, 1);
 		if (!error)
 			cpumask_set_cpu(cpu, frozen_cpus);
 		else {
@@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
 		}
 	}
 
+	cpu_hotplug_done();
+
+	for_each_cpu(cpu, frozen_cpus)
+		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
+
 	if (!error) {
 		BUG_ON(num_online_cpus() > 1);
 		/* Make sure the CPUs won't be enabled by someone else */
@@ -619,8 +641,10 @@ void __ref enable_nonboot_cpus(void)
 
 	arch_enable_nonboot_cpus_begin();
 
+	cpu_hotplug_begin();
+
 	for_each_cpu(cpu, frozen_cpus) {
-		error = _cpu_up(cpu, 1);
+		error = ___cpu_up(cpu, 1);
 		if (!error) {
 			printk(KERN_INFO "CPU%d is up\n", cpu);
 			continue;
@@ -628,6 +652,8 @@ void __ref enable_nonboot_cpus(void)
 		printk(KERN_WARNING "Error taking CPU%d up: %d\n", cpu, error);
 	}
 
+	cpu_hotplug_done();
+
 	arch_enable_nonboot_cpus_end();
 
 	cpumask_clear(frozen_cpus);

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-28 14:47                                             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-28 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +	/*
> > > > +	 * Wait for any pending readers to be running. This ensures readers
> > > > +	 * after writer and avoids writers starving readers.
> > > > +	 */
> > > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.

Ah no; I just wanted to guarantee that any pending readers did get a
chance to run. And yes due to the two sync_sched() calls it seems
somewhat unlikely in practise.

> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.

That seems like a particularly easy fix; something like so?

---
 include/linux/cpu.h |    1 
 kernel/cpu.c        |   84 ++++++++++++++++++++++++++++++++++------------------
 2 files changed, 56 insertions(+), 29 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -109,6 +109,7 @@ enum {
 #define CPU_DOWN_FAILED_FROZEN	(CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
 #define CPU_DEAD_FROZEN		(CPU_DEAD | CPU_TASKS_FROZEN)
 #define CPU_DYING_FROZEN	(CPU_DYING | CPU_TASKS_FROZEN)
+#define CPU_POST_DEAD_FROZEN	(CPU_POST_DEAD | CPU_TASKS_FROZEN)
 #define CPU_STARTING_FROZEN	(CPU_STARTING | CPU_TASKS_FROZEN)
 
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -364,8 +364,7 @@ static int __ref take_cpu_down(void *_pa
 	return 0;
 }
 
-/* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref __cpu_down(unsigned int cpu, int tasks_frozen)
 {
 	int err, nr_calls = 0;
 	void *hcpu = (void *)(long)cpu;
@@ -375,21 +374,13 @@ static int __ref _cpu_down(unsigned int
 		.hcpu = hcpu,
 	};
 
-	if (num_online_cpus() == 1)
-		return -EBUSY;
-
-	if (!cpu_online(cpu))
-		return -EINVAL;
-
-	cpu_hotplug_begin();
-
 	err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
 	if (err) {
 		nr_calls--;
 		__cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
 		printk("%s: attempt to take down CPU %u failed\n",
 				__func__, cpu);
-		goto out_release;
+		return err;
 	}
 	smpboot_park_threads(cpu);
 
@@ -398,7 +389,7 @@ static int __ref _cpu_down(unsigned int
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		smpboot_unpark_threads(cpu);
 		cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);
-		goto out_release;
+		return err;
 	}
 	BUG_ON(cpu_online(cpu));
 
@@ -420,10 +411,27 @@ static int __ref _cpu_down(unsigned int
 
 	check_for_tasks(cpu);
 
-out_release:
+	return err;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+{
+	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
+	int err;
+
+	if (num_online_cpus() == 1)
+		return -EBUSY;
+
+	if (!cpu_online(cpu))
+		return -EINVAL;
+
+	cpu_hotplug_begin();
+	err = __cpu_down(cpu, tasks_frozen);
 	cpu_hotplug_done();
+
 	if (!err)
-		cpu_notify_nofail(CPU_POST_DEAD | mod, hcpu);
+		cpu_notify_nofail(CPU_POST_DEAD | mod, (void *)(long)cpu);
 	return err;
 }
 
@@ -447,30 +455,22 @@ int __ref cpu_down(unsigned int cpu)
 EXPORT_SYMBOL(cpu_down);
 #endif /*CONFIG_HOTPLUG_CPU*/
 
-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen)
+static int ___cpu_up(unsigned int cpu, int tasks_frozen)
 {
 	int ret, nr_calls = 0;
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct task_struct *idle;
 
-	cpu_hotplug_begin();
-
-	if (cpu_online(cpu) || !cpu_present(cpu)) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	idle = idle_thread_get(cpu);
 	if (IS_ERR(idle)) {
 		ret = PTR_ERR(idle);
-		goto out;
+		return ret;
 	}
 
 	ret = smpboot_create_threads(cpu);
 	if (ret)
-		goto out;
+		return ret;
 
 	ret = __cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls);
 	if (ret) {
@@ -492,9 +492,24 @@ static int _cpu_up(unsigned int cpu, int
 	/* Now call notifier in preparation. */
 	cpu_notify(CPU_ONLINE | mod, hcpu);
 
+	return 0;
+
 out_notify:
-	if (ret != 0)
-		__cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+	__cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+	return ret;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int _cpu_up(unsigned int cpu, int tasks_frozen)
+{
+	cpu_hotplug_begin();
+
+	if (cpu_online(cpu) || !cpu_present(cpu)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = ___cpu_up(cpu, tasks_frozen);
 out:
 	cpu_hotplug_done();
 
@@ -572,11 +587,13 @@ int disable_nonboot_cpus(void)
 	 */
 	cpumask_clear(frozen_cpus);
 
+	cpu_hotplug_begin();
+
 	printk("Disabling non-boot CPUs ...\n");
 	for_each_online_cpu(cpu) {
 		if (cpu == first_cpu)
 			continue;
-		error = _cpu_down(cpu, 1);
+		error = __cpu_down(cpu, 1);
 		if (!error)
 			cpumask_set_cpu(cpu, frozen_cpus);
 		else {
@@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
 		}
 	}
 
+	cpu_hotplug_done();
+
+	for_each_cpu(cpu, frozen_cpus)
+		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
+
 	if (!error) {
 		BUG_ON(num_online_cpus() > 1);
 		/* Make sure the CPUs won't be enabled by someone else */
@@ -619,8 +641,10 @@ void __ref enable_nonboot_cpus(void)
 
 	arch_enable_nonboot_cpus_begin();
 
+	cpu_hotplug_begin();
+
 	for_each_cpu(cpu, frozen_cpus) {
-		error = _cpu_up(cpu, 1);
+		error = ___cpu_up(cpu, 1);
 		if (!error) {
 			printk(KERN_INFO "CPU%d is up\n", cpu);
 			continue;
@@ -628,6 +652,8 @@ void __ref enable_nonboot_cpus(void)
 		printk(KERN_WARNING "Error taking CPU%d up: %d\n", cpu, error);
 	}
 
+	cpu_hotplug_done();
+
 	arch_enable_nonboot_cpus_end();
 
 	cpumask_clear(frozen_cpus);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-28 14:47                                             ` Peter Zijlstra
@ 2013-09-28 16:31                                               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-28 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Rafael J. Wysocki, Viresh Kumar

On 09/28, Peter Zijlstra wrote:
>
> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>
> > Please note that this wait_event() adds a problem... it doesn't allow
> > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > in this case. We can solve this, but this wait_event() complicates
> > the problem.
>
> That seems like a particularly easy fix; something like so?

Yes, but...

> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>
> +	cpu_hotplug_done();
> +
> +	for_each_cpu(cpu, frozen_cpus)
> +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);

This changes the protocol, I simply do not know if it is fine in general
to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
currently it is possible that CPU_DOWN_PREPARE takes some global lock
released by CPU_DOWN_FAILED or CPU_POST_DEAD.

Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
this notification if FROZEN. So yes, probably this is fine, but needs an
ack from cpufreq maintainers (cc'ed), for example to ensure that it is
fine to call __cpufreq_remove_dev_prepare() twice without _finish().

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-28 16:31                                               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-28 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Rafael J. Wysocki, Viresh Kumar

On 09/28, Peter Zijlstra wrote:
>
> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>
> > Please note that this wait_event() adds a problem... it doesn't allow
> > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > in this case. We can solve this, but this wait_event() complicates
> > the problem.
>
> That seems like a particularly easy fix; something like so?

Yes, but...

> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>
> +	cpu_hotplug_done();
> +
> +	for_each_cpu(cpu, frozen_cpus)
> +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);

This changes the protocol, I simply do not know if it is fine in general
to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
currently it is possible that CPU_DOWN_PREPARE takes some global lock
released by CPU_DOWN_FAILED or CPU_POST_DEAD.

Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
this notification if FROZEN. So yes, probably this is fine, but needs an
ack from cpufreq maintainers (cc'ed), for example to ensure that it is
fine to call __cpufreq_remove_dev_prepare() twice without _finish().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-28 12:48                                           ` Oleg Nesterov
@ 2013-09-28 20:46                                             ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-28 20:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> On 09/27, Peter Zijlstra wrote:
> >
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> >
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +	smp_mb(); /* B matches A */
> > > > +
> > > > +	/*
> > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +	 */
> > > >
> > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +		return false;
> > > >
> > > > +	smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(),
> 
> To avoid the confusion, I meant "those comments can't explain mb()s here,
> in cpuhp_readers_active_check()".
> 
> > > note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> And this cput_hotplug_ and synchronize_srcu() differ, see below.
> 
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB;
> 
> Yes, Paul, please ;)

Sorry to be slow -- I will reply by end of Monday Pacific time at the
latest.  I need to allow myself enough time so that it seems new...

Also I might try some mechanical proofs of parts of it.

							Thanx, Paul

> > and if
> > not should we change both these and SRCU?
> 
> I guess that SRCU is more "complex" in this respect. IIUC,
> cpuhp_readers_active_check() needs "more" barriers because if
> synchronize_srcu() succeeds it needs to synchronize with the new readers
> which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
> srcu never blocks the readers, srcu_read_*() always succeeds.
> 
> 
> 
> Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
> ->c and ->seq. A plain this_cpu_inc() should be fine?
> 
> And since it disables preemption, why it can't use __this_cpu_inc() to inc
> ->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
> do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
> it should also do _unlock.
> 
> But this is minor/offtopic.
> 
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +	/*
> > > > +	 * Wait for any pending readers to be running. This ensures readers
> > > > +	 * after writer and avoids writers starving readers.
> > > > +	 */
> > > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.
> 
> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-28 20:46                                             ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-28 20:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> On 09/27, Peter Zijlstra wrote:
> >
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> >
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +	smp_mb(); /* B matches A */
> > > > +
> > > > +	/*
> > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +	 */
> > > >
> > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +		return false;
> > > >
> > > > +	smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(),
> 
> To avoid the confusion, I meant "those comments can't explain mb()s here,
> in cpuhp_readers_active_check()".
> 
> > > note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> And this cput_hotplug_ and synchronize_srcu() differ, see below.
> 
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB;
> 
> Yes, Paul, please ;)

Sorry to be slow -- I will reply by end of Monday Pacific time at the
latest.  I need to allow myself enough time so that it seems new...

Also I might try some mechanical proofs of parts of it.

							Thanx, Paul

> > and if
> > not should we change both these and SRCU?
> 
> I guess that SRCU is more "complex" in this respect. IIUC,
> cpuhp_readers_active_check() needs "more" barriers because if
> synchronize_srcu() succeeds it needs to synchronize with the new readers
> which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
> srcu never blocks the readers, srcu_read_*() always succeeds.
> 
> 
> 
> Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
> ->c and ->seq. A plain this_cpu_inc() should be fine?
> 
> And since it disables preemption, why it can't use __this_cpu_inc() to inc
> ->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
> do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
> it should also do _unlock.
> 
> But this is minor/offtopic.
> 
> > > >  void cpu_hotplug_done(void)
> > > >  {
> ...
> > > > +	/*
> > > > +	 * Wait for any pending readers to be running. This ensures readers
> > > > +	 * after writer and avoids writers starving readers.
> > > > +	 */
> > > > +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > >  }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
> 
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
> 
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.
> 
> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.
> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-27 18:15                                       ` Oleg Nesterov
@ 2013-09-29 13:56                                         ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-29 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/27, Oleg Nesterov wrote:
>
> I tried hard to find any hole in this version but failed, I believe it
> is correct.

And I still believe it is. But now I am starting to think that we
don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
>    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>    __get_online_cpus() and in cpu_hotplug_begin().
>
>    We do not care if the writer misses some inc(__cpuhp_refcount)
>    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>    state = readers_block (and inc(cpuhp_seq) can't help anyway).

Yes!

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>    from __put_online_cpus() (note that the writer can miss the
>    corresponding inc() if it was done on another CPU, so this dec()
>    can lead to sum() == 0),

But this can't happen in this version? Somehow I forgot that
__get_online_cpus() does inc/get under preempt_disable(), always on
the same CPU. And thanks to mb's the writer should not miss the
reader which has already passed the "state != BLOCK" check.

To simplify the discussion, lets ignore the "readers_fast" state,
synchronize_sched() logic looks obviously correct. IOW, lets discuss
only the SLOW -> BLOCK transition.

	cput_hotplug_begin()
	{
		state = BLOCK;

		mb();

		wait_event(cpuhp_writer,
				per_cpu_sum(__cpuhp_refcount) == 0);
	}

should work just fine? Ignoring all details, we have

	get_online_cpus()
	{
	again:
		preempt_disable();

		__this_cpu_inc(__cpuhp_refcount);

		mb();

		if (state == BLOCK) {

			mb();

			__this_cpu_dec(__cpuhp_refcount);
			wake_up_all(cpuhp_writer);

			preempt_enable();
			wait_event(state != BLOCK);
			goto again;
		}

		preempt_enable();
	}

It seems to me that these mb's guarantee all we need, no?

It looks really simple. The reader can only succed if it doesn't see
BLOCK, in this case per_cpu_sum() should see the change,

We have

	WRITER					READER on CPU X

	state = BLOCK;				__cpuhp_refcount[X]++;

	mb();					mb();

	...
	count += __cpuhp_refcount[X];		if (state != BLOCK)
	...						return;

						mb();
						__cpuhp_refcount[X]--;

Either reader or writer should notice the STORE we care about.

If a reader can decrement __cpuhp_refcount, we have 2 cases:

	1. It is the reader holding this lock. In this case we
	   can't miss the corresponding inc() done by this reader,
	   because this reader didn't see BLOCK in the past.

	   It is just the

			A == B == 0
	   	CPU_0			CPU_1
	   	-----			-----
	   	A = 1;			B = 1;
	   	mb();			mb();
	   	b = B;			a = A;

	   pattern, at least one CPU should see 1 in its a/b.

	2. It is the reader which tries to take this lock and
	   noticed state == BLOCK. We could miss the result of
	   its inc(), but we do not care, this reader is going
	   to block.

	   _If_ the reader could migrate between inc/dec, then
	   yes, we have a problem. Because that dec() could make
	   the result of per_cpu_sum() = 0. IOW, we could miss
	   inc() but notice dec(). But given that it does this
	   on the same CPU this is not possible.

So why do we need cpuhp_seq?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-29 13:56                                         ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-29 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/27, Oleg Nesterov wrote:
>
> I tried hard to find any hole in this version but failed, I believe it
> is correct.

And I still believe it is. But now I am starting to think that we
don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
>    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
>    __get_online_cpus() and in cpu_hotplug_begin().
>
>    We do not care if the writer misses some inc(__cpuhp_refcount)
>    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
>    state = readers_block (and inc(cpuhp_seq) can't help anyway).

Yes!

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
>    from __put_online_cpus() (note that the writer can miss the
>    corresponding inc() if it was done on another CPU, so this dec()
>    can lead to sum() == 0),

But this can't happen in this version? Somehow I forgot that
__get_online_cpus() does inc/get under preempt_disable(), always on
the same CPU. And thanks to mb's the writer should not miss the
reader which has already passed the "state != BLOCK" check.

To simplify the discussion, lets ignore the "readers_fast" state,
synchronize_sched() logic looks obviously correct. IOW, lets discuss
only the SLOW -> BLOCK transition.

	cput_hotplug_begin()
	{
		state = BLOCK;

		mb();

		wait_event(cpuhp_writer,
				per_cpu_sum(__cpuhp_refcount) == 0);
	}

should work just fine? Ignoring all details, we have

	get_online_cpus()
	{
	again:
		preempt_disable();

		__this_cpu_inc(__cpuhp_refcount);

		mb();

		if (state == BLOCK) {

			mb();

			__this_cpu_dec(__cpuhp_refcount);
			wake_up_all(cpuhp_writer);

			preempt_enable();
			wait_event(state != BLOCK);
			goto again;
		}

		preempt_enable();
	}

It seems to me that these mb's guarantee all we need, no?

It looks really simple. The reader can only succed if it doesn't see
BLOCK, in this case per_cpu_sum() should see the change,

We have

	WRITER					READER on CPU X

	state = BLOCK;				__cpuhp_refcount[X]++;

	mb();					mb();

	...
	count += __cpuhp_refcount[X];		if (state != BLOCK)
	...						return;

						mb();
						__cpuhp_refcount[X]--;

Either reader or writer should notice the STORE we care about.

If a reader can decrement __cpuhp_refcount, we have 2 cases:

	1. It is the reader holding this lock. In this case we
	   can't miss the corresponding inc() done by this reader,
	   because this reader didn't see BLOCK in the past.

	   It is just the

			A == B == 0
	   	CPU_0			CPU_1
	   	-----			-----
	   	A = 1;			B = 1;
	   	mb();			mb();
	   	b = B;			a = A;

	   pattern, at least one CPU should see 1 in its a/b.

	2. It is the reader which tries to take this lock and
	   noticed state == BLOCK. We could miss the result of
	   its inc(), but we do not care, this reader is going
	   to block.

	   _If_ the reader could migrate between inc/dec, then
	   yes, we have a problem. Because that dec() could make
	   the result of per_cpu_sum() = 0. IOW, we could miss
	   inc() but notice dec(). But given that it does this
	   on the same CPU this is not possible.

So why do we need cpuhp_seq?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-17 14:30     ` Peter Zijlstra
@ 2013-09-29 18:36       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-29 18:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

Hello.

Paul, Peter, et al, could you review the code below?

I am not sending the patch, I think it is simpler to read the code
inline (just in case, I didn't try to compile it yet).

It is functionally equivalent to

	struct xxx_struct {
		atomic_t counter;
	};

	static inline bool xxx_is_idle(struct xxx_struct *xxx)
	{
		return atomic_read(&xxx->counter) == 0;
	}

	static inline void xxx_enter(struct xxx_struct *xxx)
	{
		atomic_inc(&xxx->counter);
		synchronize_sched();
	}

	static inline void xxx_enter(struct xxx_struct *xxx)
	{
		synchronize_sched();
		atomic_dec(&xxx->counter);
	}

except: it records the state and synchronize_sched() is only called by
xxx_enter() and only if necessary.

Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
(Peter, I think they should be unified anyway, but lets ignore this for
now). Or freeze_super() (which currently looks buggy), perhaps something
else. This pattern

	writer:
		state = SLOW_MODE;
		synchronize_rcu/sched();

	reader:
		preempt_disable();	// or rcu_read_lock();
		if (state != SLOW_MODE)
			...

is quite common.

Note:
	- This implementation allows multiple writers, and sometimes
	  this makes sense.

	- But it's trivial to add "bool xxx->exclusive" set by xxx_init().
	  If it is true only one xxx_enter() is possible, other callers
	  should block until xxx_exit(). This is what percpu_down_write()
	  actually needs.

	- Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Do you think it is correct? Makes sense? (BUG_ON's are just comments).

Oleg.

// .h	-----------------------------------------------------------------------

struct xxx_struct {
	int			gp_state;

	int			gp_count;
	wait_queue_head_t	gp_waitq;

	int			cb_state;
	struct rcu_head		cb_head;
};

static inline bool xxx_is_idle(struct xxx_struct *xxx)
{
	return !xxx->gp_state; /* GP_IDLE */
}

extern void xxx_enter(struct xxx_struct *xxx);
extern void xxx_exit(struct xxx_struct *xxx);

// .c	-----------------------------------------------------------------------

enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };

enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };

#define xxx_lock	gp_waitq.lock

void xxx_enter(struct xxx_struct *xxx)
{
	bool need_wait, need_sync;

	spin_lock_irq(&xxx->xxx_lock);
	need_wait = xxx->gp_count++;
	need_sync = xxx->gp_state == GP_IDLE;
	if (need_sync)
		xxx->gp_state = GP_PENDING;
	spin_unlock_irq(&xxx->xxx_lock);

	BUG_ON(need_wait && need_sync);

	} if (need_sync) {
		synchronize_sched();
		xxx->gp_state = GP_PASSED;
		wake_up_all(&xxx->gp_waitq);
	} else if (need_wait) {
		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
	} else {
		BUG_ON(xxx->gp_state != GP_PASSED);
	}
}

static void cb_rcu_func(struct rcu_head *rcu)
{
	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
	long flags;

	BUG_ON(xxx->gp_state != GP_PASSED);
	BUG_ON(xxx->cb_state == CB_IDLE);

	spin_lock_irqsave(&xxx->xxx_lock, flags);
	if (xxx->gp_count) {
		xxx->cb_state = CB_IDLE;
	} else if (xxx->cb_state == CB_REPLAY) {
		xxx->cb_state = CB_PENDING;
		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
	} else {
		xxx->cb_state = CB_IDLE;
		xxx->gp_state = GP_IDLE;
	}
	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
}

void xxx_exit(struct xxx_struct *xxx)
{
	spin_lock_irq(&xxx->xxx_lock);
	if (!--xxx->gp_count) {
		if (xxx->cb_state == CB_IDLE) {
			xxx->cb_state = CB_PENDING;
			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
		} else if (xxx->cb_state == CB_PENDING) {
			xxx->cb_state = CB_REPLAY;
		}
	}
	spin_unlock_irq(&xxx->xxx_lock);
}


^ permalink raw reply	[flat|nested] 361+ messages in thread

* [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-29 18:36       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-29 18:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

Hello.

Paul, Peter, et al, could you review the code below?

I am not sending the patch, I think it is simpler to read the code
inline (just in case, I didn't try to compile it yet).

It is functionally equivalent to

	struct xxx_struct {
		atomic_t counter;
	};

	static inline bool xxx_is_idle(struct xxx_struct *xxx)
	{
		return atomic_read(&xxx->counter) == 0;
	}

	static inline void xxx_enter(struct xxx_struct *xxx)
	{
		atomic_inc(&xxx->counter);
		synchronize_sched();
	}

	static inline void xxx_enter(struct xxx_struct *xxx)
	{
		synchronize_sched();
		atomic_dec(&xxx->counter);
	}

except: it records the state and synchronize_sched() is only called by
xxx_enter() and only if necessary.

Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
(Peter, I think they should be unified anyway, but lets ignore this for
now). Or freeze_super() (which currently looks buggy), perhaps something
else. This pattern

	writer:
		state = SLOW_MODE;
		synchronize_rcu/sched();

	reader:
		preempt_disable();	// or rcu_read_lock();
		if (state != SLOW_MODE)
			...

is quite common.

Note:
	- This implementation allows multiple writers, and sometimes
	  this makes sense.

	- But it's trivial to add "bool xxx->exclusive" set by xxx_init().
	  If it is true only one xxx_enter() is possible, other callers
	  should block until xxx_exit(). This is what percpu_down_write()
	  actually needs.

	- Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Do you think it is correct? Makes sense? (BUG_ON's are just comments).

Oleg.

// .h	-----------------------------------------------------------------------

struct xxx_struct {
	int			gp_state;

	int			gp_count;
	wait_queue_head_t	gp_waitq;

	int			cb_state;
	struct rcu_head		cb_head;
};

static inline bool xxx_is_idle(struct xxx_struct *xxx)
{
	return !xxx->gp_state; /* GP_IDLE */
}

extern void xxx_enter(struct xxx_struct *xxx);
extern void xxx_exit(struct xxx_struct *xxx);

// .c	-----------------------------------------------------------------------

enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };

enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };

#define xxx_lock	gp_waitq.lock

void xxx_enter(struct xxx_struct *xxx)
{
	bool need_wait, need_sync;

	spin_lock_irq(&xxx->xxx_lock);
	need_wait = xxx->gp_count++;
	need_sync = xxx->gp_state == GP_IDLE;
	if (need_sync)
		xxx->gp_state = GP_PENDING;
	spin_unlock_irq(&xxx->xxx_lock);

	BUG_ON(need_wait && need_sync);

	} if (need_sync) {
		synchronize_sched();
		xxx->gp_state = GP_PASSED;
		wake_up_all(&xxx->gp_waitq);
	} else if (need_wait) {
		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
	} else {
		BUG_ON(xxx->gp_state != GP_PASSED);
	}
}

static void cb_rcu_func(struct rcu_head *rcu)
{
	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
	long flags;

	BUG_ON(xxx->gp_state != GP_PASSED);
	BUG_ON(xxx->cb_state == CB_IDLE);

	spin_lock_irqsave(&xxx->xxx_lock, flags);
	if (xxx->gp_count) {
		xxx->cb_state = CB_IDLE;
	} else if (xxx->cb_state == CB_REPLAY) {
		xxx->cb_state = CB_PENDING;
		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
	} else {
		xxx->cb_state = CB_IDLE;
		xxx->gp_state = GP_IDLE;
	}
	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
}

void xxx_exit(struct xxx_struct *xxx)
{
	spin_lock_irq(&xxx->xxx_lock);
	if (!--xxx->gp_count) {
		if (xxx->cb_state == CB_IDLE) {
			xxx->cb_state = CB_PENDING;
			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
		} else if (xxx->cb_state == CB_PENDING) {
			xxx->cb_state = CB_REPLAY;
		}
	}
	spin_unlock_irq(&xxx->xxx_lock);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 18:36       ` Oleg Nesterov
@ 2013-09-29 20:01         ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-29 20:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Hello.
> 
> Paul, Peter, et al, could you review the code below?
> 
> I am not sending the patch, I think it is simpler to read the code
> inline (just in case, I didn't try to compile it yet).
> 
> It is functionally equivalent to
> 
> 	struct xxx_struct {
> 		atomic_t counter;
> 	};
> 
> 	static inline bool xxx_is_idle(struct xxx_struct *xxx)
> 	{
> 		return atomic_read(&xxx->counter) == 0;
> 	}
> 
> 	static inline void xxx_enter(struct xxx_struct *xxx)
> 	{
> 		atomic_inc(&xxx->counter);
> 		synchronize_sched();
> 	}
> 
> 	static inline void xxx_enter(struct xxx_struct *xxx)
> 	{
> 		synchronize_sched();
> 		atomic_dec(&xxx->counter);
> 	}

But there is nothing for synchronize_sched() to wait for in the above.
Presumably the caller of xxx_is_idle() is required to disable preemption
or be under rcu_read_lock_sched()?

> except: it records the state and synchronize_sched() is only called by
> xxx_enter() and only if necessary.
> 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
> 	writer:
> 		state = SLOW_MODE;
> 		synchronize_rcu/sched();
> 
> 	reader:
> 		preempt_disable();	// or rcu_read_lock();
> 		if (state != SLOW_MODE)
> 			...
> 
> is quite common.

And this does guarantee that by the time the writer's synchronize_whatever()
exits, all readers will know that state==SLOW_MODE.

> Note:
> 	- This implementation allows multiple writers, and sometimes
> 	  this makes sense.

If each writer atomically incremented SLOW_MODE, did its update, then
atomically decremented it, sure.  You could be more clever and avoid
unneeded synchronize_whatever() calls, but I would have to see a good
reason for doing so before recommending this.

OK, but you appear to be doing this below anyway.  ;-)

> 	- But it's trivial to add "bool xxx->exclusive" set by xxx_init().
> 	  If it is true only one xxx_enter() is possible, other callers
> 	  should block until xxx_exit(). This is what percpu_down_write()
> 	  actually needs.

Agreed.

> 	- Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Or just have pointers to the RCU functions in the xxx structure...

So you are trying to make something that abstracts the RCU-protected
state-change pattern?  Or perhaps more accurately, the RCU-protected
state-change-and-back pattern?

> Do you think it is correct? Makes sense? (BUG_ON's are just comments).

... Maybe ...   Please see below for commentary and a question.

							Thanx, Paul

> Oleg.
> 
> // .h	-----------------------------------------------------------------------
> 
> struct xxx_struct {
> 	int			gp_state;
> 
> 	int			gp_count;
> 	wait_queue_head_t	gp_waitq;
> 
> 	int			cb_state;
> 	struct rcu_head		cb_head;

	spinlock_t		xxx_lock;  /* ? */

This spinlock might not make the big-system guys happy, but it appears to
be needed below.

> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> 	return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c	-----------------------------------------------------------------------
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock	gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
> 	bool need_wait, need_sync;
> 
> 	spin_lock_irq(&xxx->xxx_lock);
> 	need_wait = xxx->gp_count++;
> 	need_sync = xxx->gp_state == GP_IDLE;

Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
time, right?

> 	if (need_sync)
> 		xxx->gp_state = GP_PENDING;
> 	spin_unlock_irq(&xxx->xxx_lock);
> 
> 	BUG_ON(need_wait && need_sync);
> 
> 	} if (need_sync) {
> 		synchronize_sched();
> 		xxx->gp_state = GP_PASSED;
> 		wake_up_all(&xxx->gp_waitq);
> 	} else if (need_wait) {
> 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);

Suppose the wakeup is delayed until after the state has been updated
back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.
Never mind!

> 	} else {
> 		BUG_ON(xxx->gp_state != GP_PASSED);
> 	}
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> 	long flags;
> 
> 	BUG_ON(xxx->gp_state != GP_PASSED);
> 	BUG_ON(xxx->cb_state == CB_IDLE);
> 
> 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> 	if (xxx->gp_count) {
> 		xxx->cb_state = CB_IDLE;
> 	} else if (xxx->cb_state == CB_REPLAY) {
> 		xxx->cb_state = CB_PENDING;
> 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 	} else {
> 		xxx->cb_state = CB_IDLE;
> 		xxx->gp_state = GP_IDLE;
> 	}

It took me a bit to work out the above.  It looks like the intent is
to have the last xxx_exit() put the state back to GP_IDLE, which appears
to be the state in which readers can use a fastpath.

This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
there must be an xxx_exit() in our future.

> 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
> 
> void xxx_exit(struct xxx_struct *xxx)
> {
> 	spin_lock_irq(&xxx->xxx_lock);
> 	if (!--xxx->gp_count) {
> 		if (xxx->cb_state == CB_IDLE) {
> 			xxx->cb_state = CB_PENDING;
> 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 		} else if (xxx->cb_state == CB_PENDING) {
> 			xxx->cb_state = CB_REPLAY;
> 		}
> 	}
> 	spin_unlock_irq(&xxx->xxx_lock);
> }

Then we also have something like this?

bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
{
	BUG_ON(!rcu_read_lock_sched_held());
	return xxx->gp_state == GP_IDLE;
}


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-29 20:01         ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-09-29 20:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Hello.
> 
> Paul, Peter, et al, could you review the code below?
> 
> I am not sending the patch, I think it is simpler to read the code
> inline (just in case, I didn't try to compile it yet).
> 
> It is functionally equivalent to
> 
> 	struct xxx_struct {
> 		atomic_t counter;
> 	};
> 
> 	static inline bool xxx_is_idle(struct xxx_struct *xxx)
> 	{
> 		return atomic_read(&xxx->counter) == 0;
> 	}
> 
> 	static inline void xxx_enter(struct xxx_struct *xxx)
> 	{
> 		atomic_inc(&xxx->counter);
> 		synchronize_sched();
> 	}
> 
> 	static inline void xxx_enter(struct xxx_struct *xxx)
> 	{
> 		synchronize_sched();
> 		atomic_dec(&xxx->counter);
> 	}

But there is nothing for synchronize_sched() to wait for in the above.
Presumably the caller of xxx_is_idle() is required to disable preemption
or be under rcu_read_lock_sched()?

> except: it records the state and synchronize_sched() is only called by
> xxx_enter() and only if necessary.
> 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
> 	writer:
> 		state = SLOW_MODE;
> 		synchronize_rcu/sched();
> 
> 	reader:
> 		preempt_disable();	// or rcu_read_lock();
> 		if (state != SLOW_MODE)
> 			...
> 
> is quite common.

And this does guarantee that by the time the writer's synchronize_whatever()
exits, all readers will know that state==SLOW_MODE.

> Note:
> 	- This implementation allows multiple writers, and sometimes
> 	  this makes sense.

If each writer atomically incremented SLOW_MODE, did its update, then
atomically decremented it, sure.  You could be more clever and avoid
unneeded synchronize_whatever() calls, but I would have to see a good
reason for doing so before recommending this.

OK, but you appear to be doing this below anyway.  ;-)

> 	- But it's trivial to add "bool xxx->exclusive" set by xxx_init().
> 	  If it is true only one xxx_enter() is possible, other callers
> 	  should block until xxx_exit(). This is what percpu_down_write()
> 	  actually needs.

Agreed.

> 	- Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Or just have pointers to the RCU functions in the xxx structure...

So you are trying to make something that abstracts the RCU-protected
state-change pattern?  Or perhaps more accurately, the RCU-protected
state-change-and-back pattern?

> Do you think it is correct? Makes sense? (BUG_ON's are just comments).

... Maybe ...   Please see below for commentary and a question.

							Thanx, Paul

> Oleg.
> 
> // .h	-----------------------------------------------------------------------
> 
> struct xxx_struct {
> 	int			gp_state;
> 
> 	int			gp_count;
> 	wait_queue_head_t	gp_waitq;
> 
> 	int			cb_state;
> 	struct rcu_head		cb_head;

	spinlock_t		xxx_lock;  /* ? */

This spinlock might not make the big-system guys happy, but it appears to
be needed below.

> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> 	return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c	-----------------------------------------------------------------------
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock	gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
> 	bool need_wait, need_sync;
> 
> 	spin_lock_irq(&xxx->xxx_lock);
> 	need_wait = xxx->gp_count++;
> 	need_sync = xxx->gp_state == GP_IDLE;

Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
time, right?

> 	if (need_sync)
> 		xxx->gp_state = GP_PENDING;
> 	spin_unlock_irq(&xxx->xxx_lock);
> 
> 	BUG_ON(need_wait && need_sync);
> 
> 	} if (need_sync) {
> 		synchronize_sched();
> 		xxx->gp_state = GP_PASSED;
> 		wake_up_all(&xxx->gp_waitq);
> 	} else if (need_wait) {
> 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);

Suppose the wakeup is delayed until after the state has been updated
back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.
Never mind!

> 	} else {
> 		BUG_ON(xxx->gp_state != GP_PASSED);
> 	}
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> 	long flags;
> 
> 	BUG_ON(xxx->gp_state != GP_PASSED);
> 	BUG_ON(xxx->cb_state == CB_IDLE);
> 
> 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> 	if (xxx->gp_count) {
> 		xxx->cb_state = CB_IDLE;
> 	} else if (xxx->cb_state == CB_REPLAY) {
> 		xxx->cb_state = CB_PENDING;
> 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 	} else {
> 		xxx->cb_state = CB_IDLE;
> 		xxx->gp_state = GP_IDLE;
> 	}

It took me a bit to work out the above.  It looks like the intent is
to have the last xxx_exit() put the state back to GP_IDLE, which appears
to be the state in which readers can use a fastpath.

This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
there must be an xxx_exit() in our future.

> 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
> 
> void xxx_exit(struct xxx_struct *xxx)
> {
> 	spin_lock_irq(&xxx->xxx_lock);
> 	if (!--xxx->gp_count) {
> 		if (xxx->cb_state == CB_IDLE) {
> 			xxx->cb_state = CB_PENDING;
> 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 		} else if (xxx->cb_state == CB_PENDING) {
> 			xxx->cb_state = CB_REPLAY;
> 		}
> 	}
> 	spin_unlock_irq(&xxx->xxx_lock);
> }

Then we also have something like this?

bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
{
	BUG_ON(!rcu_read_lock_sched_held());
	return xxx->gp_state == GP_IDLE;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 18:36       ` Oleg Nesterov
@ 2013-09-29 21:34         ` Steven Rostedt
  -1 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-29 21:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner, Linus Torvalds

On Sun, 29 Sep 2013 20:36:34 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 

Just so I'm clear to what you are trying to implement... This is to
handle the case (as Paul said) to see changes to state by RCU and back
again? That is, it isn't enough to see that the state changed to
something (like SLOW MODE), but we also need a way to see it change
back?

With get_online_cpus(), we need to see the state where it changed to
"performing hotplug" where holders need to go into the slow path, and
then also see the state change to "no longe performing hotplug" and the
holders now go back to fast path. Is this the rational for this email?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-29 21:34         ` Steven Rostedt
  0 siblings, 0 replies; 361+ messages in thread
From: Steven Rostedt @ 2013-09-29 21:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner, Linus Torvalds

On Sun, 29 Sep 2013 20:36:34 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

 
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 

Just so I'm clear to what you are trying to implement... This is to
handle the case (as Paul said) to see changes to state by RCU and back
again? That is, it isn't enough to see that the state changed to
something (like SLOW MODE), but we also need a way to see it change
back?

With get_online_cpus(), we need to see the state where it changed to
"performing hotplug" where holders need to go into the slow path, and
then also see the state change to "no longe performing hotplug" and the
holders now go back to fast path. Is this the rational for this email?

Thanks,

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
  2013-09-14  2:57   ` Bob Liu
@ 2013-09-30 10:30     ` Mel Gorman
  -1 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-30 10:30 UTC (permalink / raw)
  To: Bob Liu
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Sat, Sep 14, 2013 at 10:57:35AM +0800, Bob Liu wrote:
> Hi Mel,
> 
> On 09/10/2013 05:31 PM, Mel Gorman wrote:
> > It has been a long time since V6 of this series and time for an update. Much
> > of this is now stabilised with the most important addition being the inclusion
> > of Peter and Rik's work on grouping tasks that share pages together.
> > 
> > This series has a number of goals. It reduces overhead of automatic balancing
> > through scan rate reduction and the avoidance of TLB flushes. It selects a
> > preferred node and moves tasks towards their memory as well as moving memory
> > toward their task. It handles shared pages and groups related tasks together.
> > 
> 
> I found sometimes numa balancing will be broken after khugepaged
> started, because khugepaged always allocate huge page from the node of
> the first scanned normal page during collapsing.
> 

This is a real, but separate problem.

> I think this may related with this topic, I don't know whether this
> series can also fix the issue I mentioned.
> 

This series does not aim to fix that particular problem. There will be
some interactions between the problems as automatic NUMA balancing deals
with THP migration but they are only indirectly related. If khugepaged
does not collapse to huge pages inappropriately then automatic NUMA
balancing will never encounter them.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7
@ 2013-09-30 10:30     ` Mel Gorman
  0 siblings, 0 replies; 361+ messages in thread
From: Mel Gorman @ 2013-09-30 10:30 UTC (permalink / raw)
  To: Bob Liu
  Cc: Peter Zijlstra, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML

On Sat, Sep 14, 2013 at 10:57:35AM +0800, Bob Liu wrote:
> Hi Mel,
> 
> On 09/10/2013 05:31 PM, Mel Gorman wrote:
> > It has been a long time since V6 of this series and time for an update. Much
> > of this is now stabilised with the most important addition being the inclusion
> > of Peter and Rik's work on grouping tasks that share pages together.
> > 
> > This series has a number of goals. It reduces overhead of automatic balancing
> > through scan rate reduction and the avoidance of TLB flushes. It selects a
> > preferred node and moves tasks towards their memory as well as moving memory
> > toward their task. It handles shared pages and groups related tasks together.
> > 
> 
> I found sometimes numa balancing will be broken after khugepaged
> started, because khugepaged always allocate huge page from the node of
> the first scanned normal page during collapsing.
> 

This is a real, but separate problem.

> I think this may related with this topic, I don't know whether this
> series can also fix the issue I mentioned.
> 

This series does not aim to fix that particular problem. There will be
some interactions between the problems as automatic NUMA balancing deals
with THP migration but they are only indirectly related. If khugepaged
does not collapse to huge pages inappropriately then automatic NUMA
balancing will never encounter them.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 20:01         ` Paul E. McKenney
@ 2013-09-30 12:42           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 12:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/29, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> >
> > 	struct xxx_struct {
> > 		atomic_t counter;
> > 	};
> >
> > 	static inline bool xxx_is_idle(struct xxx_struct *xxx)
> > 	{
> > 		return atomic_read(&xxx->counter) == 0;
> > 	}
> >
> > 	static inline void xxx_enter(struct xxx_struct *xxx)
> > 	{
> > 		atomic_inc(&xxx->counter);
> > 		synchronize_sched();
> > 	}
> >
> > 	static inline void xxx_enter(struct xxx_struct *xxx)
> > 	{
> > 		synchronize_sched();
> > 		atomic_dec(&xxx->counter);
> > 	}
>
> But there is nothing for synchronize_sched() to wait for in the above.
> Presumably the caller of xxx_is_idle() is required to disable preemption
> or be under rcu_read_lock_sched()?

Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).

> So you are trying to make something that abstracts the RCU-protected
> state-change pattern?  Or perhaps more accurately, the RCU-protected
> state-change-and-back pattern?

Yes, exactly.

> > struct xxx_struct {
> > 	int			gp_state;
> >
> > 	int			gp_count;
> > 	wait_queue_head_t	gp_waitq;
> >
> > 	int			cb_state;
> > 	struct rcu_head		cb_head;
>
> 	spinlock_t		xxx_lock;  /* ? */

See

	#define xxx_lock	gp_waitq.lock
	
in .c below, but we can add another spinlock.

> This spinlock might not make the big-system guys happy, but it appears to
> be needed below.

Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.

> > // .c	-----------------------------------------------------------------------
> >
> > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> >
> > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >
> > #define xxx_lock	gp_waitq.lock
> >
> > void xxx_enter(struct xxx_struct *xxx)
> > {
> > 	bool need_wait, need_sync;
> >
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	need_wait = xxx->gp_count++;
> > 	need_sync = xxx->gp_state == GP_IDLE;
>
> Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
> time, right?

As you already pointed below - no.

Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the
caller is the "first" writer (need_sync == T) nobody else can change
->gp_state, so xxx_enter() sets GP_PASSED lockless.

> > 	if (need_sync)
> > 		xxx->gp_state = GP_PENDING;
> > 	spin_unlock_irq(&xxx->xxx_lock);
> >
> > 	BUG_ON(need_wait && need_sync);
> >
> > 	} if (need_sync) {
> > 		synchronize_sched();
> > 		xxx->gp_state = GP_PASSED;
> > 		wake_up_all(&xxx->gp_waitq);
> > 	} else if (need_wait) {
> > 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
>
> Suppose the wakeup is delayed until after the state has been updated
> back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.

Yes, exactly.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> >
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 	} else {
> > 		xxx->cb_state = CB_IDLE;
> > 		xxx->gp_state = GP_IDLE;
> > 	}
>
> It took me a bit to work out the above.  It looks like the intent is
> to have the last xxx_exit() put the state back to GP_IDLE, which appears
> to be the state in which readers can use a fastpath.

Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.

The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do

	rcu_cancel(&xxx->cb_head);
	call_rcu_sched(&xxx->cb_head, cb_rcu_func);

to "extend" the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.

> This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
> there must be an xxx_exit() in our future.

Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).

> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	if (!--xxx->gp_count) {
> > 		if (xxx->cb_state == CB_IDLE) {
> > 			xxx->cb_state = CB_PENDING;
> > 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 		} else if (xxx->cb_state == CB_PENDING) {
> > 			xxx->cb_state = CB_REPLAY;
> > 		}
> > 	}
> > 	spin_unlock_irq(&xxx->xxx_lock);
> > }
>
> Then we also have something like this?
>
> bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
> {
> 	BUG_ON(!rcu_read_lock_sched_held());
> 	return xxx->gp_state == GP_IDLE;
> }

Yes, this is what xxx_is_idle() does (ignoring BUG_ON). It actually
checks xxx->gp_state == 0, this is just to avoid the unnecessary export
of GP_* enum.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 12:42           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 12:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/29, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> >
> > 	struct xxx_struct {
> > 		atomic_t counter;
> > 	};
> >
> > 	static inline bool xxx_is_idle(struct xxx_struct *xxx)
> > 	{
> > 		return atomic_read(&xxx->counter) == 0;
> > 	}
> >
> > 	static inline void xxx_enter(struct xxx_struct *xxx)
> > 	{
> > 		atomic_inc(&xxx->counter);
> > 		synchronize_sched();
> > 	}
> >
> > 	static inline void xxx_enter(struct xxx_struct *xxx)
> > 	{
> > 		synchronize_sched();
> > 		atomic_dec(&xxx->counter);
> > 	}
>
> But there is nothing for synchronize_sched() to wait for in the above.
> Presumably the caller of xxx_is_idle() is required to disable preemption
> or be under rcu_read_lock_sched()?

Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).

> So you are trying to make something that abstracts the RCU-protected
> state-change pattern?  Or perhaps more accurately, the RCU-protected
> state-change-and-back pattern?

Yes, exactly.

> > struct xxx_struct {
> > 	int			gp_state;
> >
> > 	int			gp_count;
> > 	wait_queue_head_t	gp_waitq;
> >
> > 	int			cb_state;
> > 	struct rcu_head		cb_head;
>
> 	spinlock_t		xxx_lock;  /* ? */

See

	#define xxx_lock	gp_waitq.lock
	
in .c below, but we can add another spinlock.

> This spinlock might not make the big-system guys happy, but it appears to
> be needed below.

Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.

> > // .c	-----------------------------------------------------------------------
> >
> > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> >
> > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >
> > #define xxx_lock	gp_waitq.lock
> >
> > void xxx_enter(struct xxx_struct *xxx)
> > {
> > 	bool need_wait, need_sync;
> >
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	need_wait = xxx->gp_count++;
> > 	need_sync = xxx->gp_state == GP_IDLE;
>
> Suppose ->gp_state is GP_PASSED.  It could transition to GP_IDLE at any
> time, right?

As you already pointed below - no.

Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the
caller is the "first" writer (need_sync == T) nobody else can change
->gp_state, so xxx_enter() sets GP_PASSED lockless.

> > 	if (need_sync)
> > 		xxx->gp_state = GP_PENDING;
> > 	spin_unlock_irq(&xxx->xxx_lock);
> >
> > 	BUG_ON(need_wait && need_sync);
> >
> > 	} if (need_sync) {
> > 		synchronize_sched();
> > 		xxx->gp_state = GP_PASSED;
> > 		wake_up_all(&xxx->gp_waitq);
> > 	} else if (need_wait) {
> > 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
>
> Suppose the wakeup is delayed until after the state has been updated
> back to GP_IDLE?  Ah, presumably the non-zero ->gp_count prevents this.

Yes, exactly.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> >
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 	} else {
> > 		xxx->cb_state = CB_IDLE;
> > 		xxx->gp_state = GP_IDLE;
> > 	}
>
> It took me a bit to work out the above.  It looks like the intent is
> to have the last xxx_exit() put the state back to GP_IDLE, which appears
> to be the state in which readers can use a fastpath.

Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.

The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do

	rcu_cancel(&xxx->cb_head);
	call_rcu_sched(&xxx->cb_head, cb_rcu_func);

to "extend" the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.

> This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
> there must be an xxx_exit() in our future.

Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).

> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	if (!--xxx->gp_count) {
> > 		if (xxx->cb_state == CB_IDLE) {
> > 			xxx->cb_state = CB_PENDING;
> > 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 		} else if (xxx->cb_state == CB_PENDING) {
> > 			xxx->cb_state = CB_REPLAY;
> > 		}
> > 	}
> > 	spin_unlock_irq(&xxx->xxx_lock);
> > }
>
> Then we also have something like this?
>
> bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
> {
> 	BUG_ON(!rcu_read_lock_sched_held());
> 	return xxx->gp_state == GP_IDLE;
> }

Yes, this is what xxx_is_idle() does (ignoring BUG_ON). It actually
checks xxx->gp_state == 0, this is just to avoid the unnecessary export
of GP_* enum.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 18:36       ` Oleg Nesterov
@ 2013-09-30 12:59         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 12:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). 

If you think the percpu_rwsem users can benefit sure.. So far its good I
didn't go the percpu_rwsem route for it looks like we got something
better at the end of it ;-)

> Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
> 	writer:
> 		state = SLOW_MODE;
> 		synchronize_rcu/sched();
> 
> 	reader:
> 		preempt_disable();	// or rcu_read_lock();
> 		if (state != SLOW_MODE)
> 			...
> 
> is quite common.

Well, if we make percpu_rwsem the defacto container of the pattern and
use that throughout, we'd have only a single implementation and don't
need the abstraction.

That said; we could still use the idea proposed; so let me take a look.

> // .h	-----------------------------------------------------------------------
> 
> struct xxx_struct {
> 	int			gp_state;
> 
> 	int			gp_count;
> 	wait_queue_head_t	gp_waitq;
> 
> 	int			cb_state;
> 	struct rcu_head		cb_head;
> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> 	return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c	-----------------------------------------------------------------------
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock	gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
> 	bool need_wait, need_sync;
> 
> 	spin_lock_irq(&xxx->xxx_lock);
> 	need_wait = xxx->gp_count++;
> 	need_sync = xxx->gp_state == GP_IDLE;
> 	if (need_sync)
> 		xxx->gp_state = GP_PENDING;
> 	spin_unlock_irq(&xxx->xxx_lock);
> 
> 	BUG_ON(need_wait && need_sync);
> 
> 	} if (need_sync) {
> 		synchronize_sched();
> 		xxx->gp_state = GP_PASSED;
> 		wake_up_all(&xxx->gp_waitq);
> 	} else if (need_wait) {
> 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
> 	} else {
> 		BUG_ON(xxx->gp_state != GP_PASSED);
> 	}
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> 	long flags;
> 
> 	BUG_ON(xxx->gp_state != GP_PASSED);
> 	BUG_ON(xxx->cb_state == CB_IDLE);
> 
> 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> 	if (xxx->gp_count) {
> 		xxx->cb_state = CB_IDLE;

This seems to be when a new xxx_begin() has happened after our last
xxx_end() and the sync_sched() from xxx_begin() merges with the
xxx_end() one and we're done.

> 	} else if (xxx->cb_state == CB_REPLAY) {
> 		xxx->cb_state = CB_PENDING;
> 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);

A later xxx_exit() has happened, and we need to requeue to catch a later
GP.

> 	} else {
> 		xxx->cb_state = CB_IDLE;
> 		xxx->gp_state = GP_IDLE;

Nothing fancy happened and we're done.

> 	}
> 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
> 
> void xxx_exit(struct xxx_struct *xxx)
> {
> 	spin_lock_irq(&xxx->xxx_lock);
> 	if (!--xxx->gp_count) {
> 		if (xxx->cb_state == CB_IDLE) {
> 			xxx->cb_state = CB_PENDING;
> 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 		} else if (xxx->cb_state == CB_PENDING) {
> 			xxx->cb_state = CB_REPLAY;
> 		}
> 	}
> 	spin_unlock_irq(&xxx->xxx_lock);
> }

So I don't immediately see the point of the concurrent write side;
percpu_rwsem wouldn't allow this and afaict neither would
freeze_super().

Other than that; yes this makes sense if you care about write side
performance and I think its solid.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 12:59         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 12:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). 

If you think the percpu_rwsem users can benefit sure.. So far its good I
didn't go the percpu_rwsem route for it looks like we got something
better at the end of it ;-)

> Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
> 
> 	writer:
> 		state = SLOW_MODE;
> 		synchronize_rcu/sched();
> 
> 	reader:
> 		preempt_disable();	// or rcu_read_lock();
> 		if (state != SLOW_MODE)
> 			...
> 
> is quite common.

Well, if we make percpu_rwsem the defacto container of the pattern and
use that throughout, we'd have only a single implementation and don't
need the abstraction.

That said; we could still use the idea proposed; so let me take a look.

> // .h	-----------------------------------------------------------------------
> 
> struct xxx_struct {
> 	int			gp_state;
> 
> 	int			gp_count;
> 	wait_queue_head_t	gp_waitq;
> 
> 	int			cb_state;
> 	struct rcu_head		cb_head;
> };
> 
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> 	return !xxx->gp_state; /* GP_IDLE */
> }
> 
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
> 
> // .c	-----------------------------------------------------------------------
> 
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> 
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> 
> #define xxx_lock	gp_waitq.lock
> 
> void xxx_enter(struct xxx_struct *xxx)
> {
> 	bool need_wait, need_sync;
> 
> 	spin_lock_irq(&xxx->xxx_lock);
> 	need_wait = xxx->gp_count++;
> 	need_sync = xxx->gp_state == GP_IDLE;
> 	if (need_sync)
> 		xxx->gp_state = GP_PENDING;
> 	spin_unlock_irq(&xxx->xxx_lock);
> 
> 	BUG_ON(need_wait && need_sync);
> 
> 	} if (need_sync) {
> 		synchronize_sched();
> 		xxx->gp_state = GP_PASSED;
> 		wake_up_all(&xxx->gp_waitq);
> 	} else if (need_wait) {
> 		wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
> 	} else {
> 		BUG_ON(xxx->gp_state != GP_PASSED);
> 	}
> }
> 
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> 	long flags;
> 
> 	BUG_ON(xxx->gp_state != GP_PASSED);
> 	BUG_ON(xxx->cb_state == CB_IDLE);
> 
> 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> 	if (xxx->gp_count) {
> 		xxx->cb_state = CB_IDLE;

This seems to be when a new xxx_begin() has happened after our last
xxx_end() and the sync_sched() from xxx_begin() merges with the
xxx_end() one and we're done.

> 	} else if (xxx->cb_state == CB_REPLAY) {
> 		xxx->cb_state = CB_PENDING;
> 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);

A later xxx_exit() has happened, and we need to requeue to catch a later
GP.

> 	} else {
> 		xxx->cb_state = CB_IDLE;
> 		xxx->gp_state = GP_IDLE;

Nothing fancy happened and we're done.

> 	}
> 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
> 
> void xxx_exit(struct xxx_struct *xxx)
> {
> 	spin_lock_irq(&xxx->xxx_lock);
> 	if (!--xxx->gp_count) {
> 		if (xxx->cb_state == CB_IDLE) {
> 			xxx->cb_state = CB_PENDING;
> 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 		} else if (xxx->cb_state == CB_PENDING) {
> 			xxx->cb_state = CB_REPLAY;
> 		}
> 	}
> 	spin_unlock_irq(&xxx->xxx_lock);
> }

So I don't immediately see the point of the concurrent write side;
percpu_rwsem wouldn't allow this and afaict neither would
freeze_super().

Other than that; yes this makes sense if you care about write side
performance and I think its solid.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 21:34         ` Steven Rostedt
@ 2013-09-30 13:03           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 13:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner, Linus Torvalds

On 09/29, Steven Rostedt wrote:
>
> On Sun, 29 Sep 2013 20:36:34 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
>
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now). Or freeze_super() (which currently looks buggy), perhaps something
> > else. This pattern
> >
>
> Just so I'm clear to what you are trying to implement... This is to
> handle the case (as Paul said) to see changes to state by RCU and back
> again? That is, it isn't enough to see that the state changed to
> something (like SLOW MODE), but we also need a way to see it change
> back?

Suppose this code was applied as is. Now we can change percpu_rwsem,
see the "patch" below. (please ignore _expedited in the current code).

This immediately makes percpu_up_write() much faster, it no longer
blocks. And the contending writers (or even the same writer which
takes it again) can avoid synchronize_sched() in percpu_down_write().

And to remind, we can add xxx_struct->exclusive (or add the argument
to xxx_enter/exit), and then (with some other changes) we can kill
percpu_rw_semaphore->rw_sem.

> With get_online_cpus(), we need to see the state where it changed to
> "performing hotplug" where holders need to go into the slow path, and
> then also see the state change to "no longe performing hotplug" and the
> holders now go back to fast path. Is this the rational for this email?

The same. cpu_hotplug_begin/end (I mean the code written by Peter) can
be changed to use xxx_enter/exit.

Oleg.

--- x/include/linux/percpu-rwsem.h
+++ x/include/linux/percpu-rwsem.h
@@ -8,8 +8,8 @@
 #include <linux/lockdep.h>
 
 struct percpu_rw_semaphore {
+	xxx_struct		xxx;
 	unsigned int __percpu	*fast_read_ctr;
-	atomic_t		write_ctr;
 	struct rw_semaphore	rw_sem;
 	atomic_t		slow_read_ctr;
 	wait_queue_head_t	write_waitq;
--- x/lib/percpu-rwsem.c
+++ x/lib/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw
 
 	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
 	__init_rwsem(&brw->rw_sem, name, rwsem_key);
-	atomic_set(&brw->write_ctr, 0);
+	xxx_init(&brw->xxx, ...);
 	atomic_set(&brw->slow_read_ctr, 0);
 	init_waitqueue_head(&brw->write_waitq);
 	return 0;
@@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+	might_sleep();
+
+	// pseudo code which needs another simple xxx_ helper
+	if (xxx->gp_state == GP_REPLAY)
+		xxx->gp_state == GP_PENDING;
+	if (xxx->gp_state)
+		synchronize_sched();
+
 	free_percpu(brw->fast_read_ctr);
 	brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp
 	bool success = false;
 
 	preempt_disable();
-	if (likely(!atomic_read(&brw->write_ctr))) {
+	if (likely(xxx_is_idle(&brw->xxx))) {
 		__this_cpu_add(*brw->fast_read_ctr, val);
 		success = true;
 	}
@@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-	/* tell update_fast_ctr() there is a pending writer */
-	atomic_inc(&brw->write_ctr);
-	/*
-	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-	 *    so that update_fast_ctr() can't succeed.
-	 *
-	 * 2. Ensures we see the result of every previous this_cpu_add() in
-	 *    update_fast_ctr().
-	 *
-	 * 3. Ensures that if any reader has exited its critical section via
-	 *    fast-path, it executes a full memory barrier before we return.
-	 *    See R_W case in the comment above update_fast_ctr().
-	 */
-	synchronize_sched_expedited();
+	xxx_enter(&brw->xxx);
 
 	/* exclude other writers, and block the new readers completely */
 	down_write(&brw->rw_sem);
@@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se
 	 * Insert the barrier before the next fast-path in down_read,
 	 * see W_R case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
-	/* the last writer unblocks update_fast_ctr() */
-	atomic_dec(&brw->write_ctr);
+	xxx_exit();
 }


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 13:03           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 13:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Paul McKenney, Thomas Gleixner, Linus Torvalds

On 09/29, Steven Rostedt wrote:
>
> On Sun, 29 Sep 2013 20:36:34 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
>
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now). Or freeze_super() (which currently looks buggy), perhaps something
> > else. This pattern
> >
>
> Just so I'm clear to what you are trying to implement... This is to
> handle the case (as Paul said) to see changes to state by RCU and back
> again? That is, it isn't enough to see that the state changed to
> something (like SLOW MODE), but we also need a way to see it change
> back?

Suppose this code was applied as is. Now we can change percpu_rwsem,
see the "patch" below. (please ignore _expedited in the current code).

This immediately makes percpu_up_write() much faster, it no longer
blocks. And the contending writers (or even the same writer which
takes it again) can avoid synchronize_sched() in percpu_down_write().

And to remind, we can add xxx_struct->exclusive (or add the argument
to xxx_enter/exit), and then (with some other changes) we can kill
percpu_rw_semaphore->rw_sem.

> With get_online_cpus(), we need to see the state where it changed to
> "performing hotplug" where holders need to go into the slow path, and
> then also see the state change to "no longe performing hotplug" and the
> holders now go back to fast path. Is this the rational for this email?

The same. cpu_hotplug_begin/end (I mean the code written by Peter) can
be changed to use xxx_enter/exit.

Oleg.

--- x/include/linux/percpu-rwsem.h
+++ x/include/linux/percpu-rwsem.h
@@ -8,8 +8,8 @@
 #include <linux/lockdep.h>
 
 struct percpu_rw_semaphore {
+	xxx_struct		xxx;
 	unsigned int __percpu	*fast_read_ctr;
-	atomic_t		write_ctr;
 	struct rw_semaphore	rw_sem;
 	atomic_t		slow_read_ctr;
 	wait_queue_head_t	write_waitq;
--- x/lib/percpu-rwsem.c
+++ x/lib/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw
 
 	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
 	__init_rwsem(&brw->rw_sem, name, rwsem_key);
-	atomic_set(&brw->write_ctr, 0);
+	xxx_init(&brw->xxx, ...);
 	atomic_set(&brw->slow_read_ctr, 0);
 	init_waitqueue_head(&brw->write_waitq);
 	return 0;
@@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw
 
 void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 {
+	might_sleep();
+
+	// pseudo code which needs another simple xxx_ helper
+	if (xxx->gp_state == GP_REPLAY)
+		xxx->gp_state == GP_PENDING;
+	if (xxx->gp_state)
+		synchronize_sched();
+
 	free_percpu(brw->fast_read_ctr);
 	brw->fast_read_ctr = NULL; /* catch use after free bugs */
 }
@@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp
 	bool success = false;
 
 	preempt_disable();
-	if (likely(!atomic_read(&brw->write_ctr))) {
+	if (likely(xxx_is_idle(&brw->xxx))) {
 		__this_cpu_add(*brw->fast_read_ctr, val);
 		success = true;
 	}
@@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_
  */
 void percpu_down_write(struct percpu_rw_semaphore *brw)
 {
-	/* tell update_fast_ctr() there is a pending writer */
-	atomic_inc(&brw->write_ctr);
-	/*
-	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
-	 *    so that update_fast_ctr() can't succeed.
-	 *
-	 * 2. Ensures we see the result of every previous this_cpu_add() in
-	 *    update_fast_ctr().
-	 *
-	 * 3. Ensures that if any reader has exited its critical section via
-	 *    fast-path, it executes a full memory barrier before we return.
-	 *    See R_W case in the comment above update_fast_ctr().
-	 */
-	synchronize_sched_expedited();
+	xxx_enter(&brw->xxx);
 
 	/* exclude other writers, and block the new readers completely */
 	down_write(&brw->rw_sem);
@@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se
 	 * Insert the barrier before the next fast-path in down_read,
 	 * see W_R case in the comment above update_fast_ctr().
 	 */
-	synchronize_sched_expedited();
-	/* the last writer unblocks update_fast_ctr() */
-	atomic_dec(&brw->write_ctr);
+	xxx_exit();
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-30 12:59         ` Peter Zijlstra
@ 2013-09-30 14:24           ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 14:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote:

> > 
> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> > 
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> > 
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
> 
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.
> 
> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.
> 
> > 	} else {
> > 		xxx->cb_state = CB_IDLE;
> > 		xxx->gp_state = GP_IDLE;
> 
> Nothing fancy happened and we're done.
> 
> > 	}
> > 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> > }
> > 
> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	if (!--xxx->gp_count) {
> > 		if (xxx->cb_state == CB_IDLE) {
> > 			xxx->cb_state = CB_PENDING;
> > 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 		} else if (xxx->cb_state == CB_PENDING) {
> > 			xxx->cb_state = CB_REPLAY;
> > 		}
> > 	}
> > 	spin_unlock_irq(&xxx->xxx_lock);
> > }
> 
> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().
> 
> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Hmm, wait. I don't see how this is equivalent to:

xxx_end()
{
	synchronize_sched();
	atomic_dec(&xxx->counter);
}

For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
wouldn't we?

Without that there's no guarantee the fast path readers will have a MB
to observe the write critical section, unless I'm completely missing
something obviuos here.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 14:24           ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 14:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote:

> > 
> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> > 
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> > 
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
> 
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.
> 
> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> 
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.
> 
> > 	} else {
> > 		xxx->cb_state = CB_IDLE;
> > 		xxx->gp_state = GP_IDLE;
> 
> Nothing fancy happened and we're done.
> 
> > 	}
> > 	spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> > }
> > 
> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > 	spin_lock_irq(&xxx->xxx_lock);
> > 	if (!--xxx->gp_count) {
> > 		if (xxx->cb_state == CB_IDLE) {
> > 			xxx->cb_state = CB_PENDING;
> > 			call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > 		} else if (xxx->cb_state == CB_PENDING) {
> > 			xxx->cb_state = CB_REPLAY;
> > 		}
> > 	}
> > 	spin_unlock_irq(&xxx->xxx_lock);
> > }
> 
> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().
> 
> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Hmm, wait. I don't see how this is equivalent to:

xxx_end()
{
	synchronize_sched();
	atomic_dec(&xxx->counter);
}

For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
wouldn't we?

Without that there's no guarantee the fast path readers will have a MB
to observe the write critical section, unless I'm completely missing
something obviuos here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-30 14:24           ` Peter Zijlstra
@ 2013-09-30 15:06             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> wouldn't we?
> 
> Without that there's no guarantee the fast path readers will have a MB
> to observe the write critical section, unless I'm completely missing
> something obviuos here.

Duh.. we should be looking at gp_state like Paul said.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 15:06             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-09-30 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> wouldn't we?
> 
> Without that there's no guarantee the fast path readers will have a MB
> to observe the write critical section, unless I'm completely missing
> something obviuos here.

Duh.. we should be looking at gp_state like Paul said.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-30 12:59         ` Peter Zijlstra
@ 2013-09-30 16:38           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/30, Peter Zijlstra wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now).
>
> If you think the percpu_rwsem users can benefit sure.. So far its good I
> didn't go the percpu_rwsem route for it looks like we got something
> better at the end of it ;-)

I think you could simply improve percpu_rwsem instead. Once we add
task_struct->cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin
becomes absolutely congruent.

OTOH, it would be simpler to change hotplug first, then copy-and-paste
the improvents into percpu_rwsem, then see if we can simply convert
cpu_hotplug_begin/end into percpu_down/up_write.

> Well, if we make percpu_rwsem the defacto container of the pattern and
> use that throughout, we'd have only a single implementation

Not sure. I think it can have other users. But even if not, please look
at "struct sb_writers". Yes, I believe it makes sense to use percpu_rwsem
here, but note that it is actually array of semaphores. I do not think
each element needs its own xxx_struct.

> and don't
> need the abstraction.

And even if struct percpu_rw_semaphore will be the only container of
xxx_struct, I think the code looks better and more understandable this
way, exactly because it adds the new abstraction layer. Performance-wise
this should be free.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> >
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
>
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.

Yes,

> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
>
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.

Exactly.

> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().

Oh I disagree. Even ignoring the fact I believe xxx_struct itself
can have more users (I can be wrong of course), I do think that
percpu_down_write_nonexclusive() makes sense (except "exclusive"
should be the argument of percpu_init_rwsem). And in fact the
initial implementation I sent didn't even has the "exclusive" mode.

Please look at uprobes (currently the only user). We do not really
need the global write-lock, we can do the per-uprobe locking. However,
every caller needs to block the percpu_down_read() callers (dup_mmap).

> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Great ;)

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 16:38           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/30, Peter Zijlstra wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now).
>
> If you think the percpu_rwsem users can benefit sure.. So far its good I
> didn't go the percpu_rwsem route for it looks like we got something
> better at the end of it ;-)

I think you could simply improve percpu_rwsem instead. Once we add
task_struct->cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin
becomes absolutely congruent.

OTOH, it would be simpler to change hotplug first, then copy-and-paste
the improvents into percpu_rwsem, then see if we can simply convert
cpu_hotplug_begin/end into percpu_down/up_write.

> Well, if we make percpu_rwsem the defacto container of the pattern and
> use that throughout, we'd have only a single implementation

Not sure. I think it can have other users. But even if not, please look
at "struct sb_writers". Yes, I believe it makes sense to use percpu_rwsem
here, but note that it is actually array of semaphores. I do not think
each element needs its own xxx_struct.

> and don't
> need the abstraction.

And even if struct percpu_rw_semaphore will be the only container of
xxx_struct, I think the code looks better and more understandable this
way, exactly because it adds the new abstraction layer. Performance-wise
this should be free.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > 	struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > 	long flags;
> >
> > 	BUG_ON(xxx->gp_state != GP_PASSED);
> > 	BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > 	spin_lock_irqsave(&xxx->xxx_lock, flags);
> > 	if (xxx->gp_count) {
> > 		xxx->cb_state = CB_IDLE;
>
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.

Yes,

> > 	} else if (xxx->cb_state == CB_REPLAY) {
> > 		xxx->cb_state = CB_PENDING;
> > 		call_rcu_sched(&xxx->cb_head, cb_rcu_func);
>
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.

Exactly.

> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().

Oh I disagree. Even ignoring the fact I believe xxx_struct itself
can have more users (I can be wrong of course), I do think that
percpu_down_write_nonexclusive() makes sense (except "exclusive"
should be the argument of percpu_init_rwsem). And in fact the
initial implementation I sent didn't even has the "exclusive" mode.

Please look at uprobes (currently the only user). We do not really
need the global write-lock, we can do the per-uprobe locking. However,
every caller needs to block the percpu_down_read() callers (dup_mmap).

> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Great ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-30 15:06             ` Peter Zijlstra
@ 2013-09-30 16:58               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/30, Peter Zijlstra wrote:
>
> On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> > For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> > wouldn't we?
> >
> > Without that there's no guarantee the fast path readers will have a MB
> > to observe the write critical section, unless I'm completely missing
> > something obviuos here.
>
> Duh.. we should be looking at gp_state like Paul said.

Yes, yes, that is why we have xxx_is_idle(). Its name is confusing
even ignoring "xxx".

OK, I'll try to invent the naming (but I'd like to hear suggestions ;)
and send the patch. I am going to add "exclusive" and "rcu_domain/ops"
later, currently percpu_rw_semaphore needs ->rw_sem anyway.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-09-30 16:58               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-09-30 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On 09/30, Peter Zijlstra wrote:
>
> On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> > For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> > wouldn't we?
> >
> > Without that there's no guarantee the fast path readers will have a MB
> > to observe the write critical section, unless I'm completely missing
> > something obviuos here.
>
> Duh.. we should be looking at gp_state like Paul said.

Yes, yes, that is why we have xxx_is_idle(). Its name is confusing
even ignoring "xxx".

OK, I'll try to invent the naming (but I'd like to hear suggestions ;)
and send the patch. I am going to add "exclusive" and "rcu_domain/ops"
later, currently percpu_rw_semaphore needs ->rw_sem anyway.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-28 16:31                                               ` Oleg Nesterov
@ 2013-09-30 20:11                                                 ` Rafael J. Wysocki
  -1 siblings, 0 replies; 361+ messages in thread
From: Rafael J. Wysocki @ 2013-09-30 20:11 UTC (permalink / raw)
  To: Oleg Nesterov, Srivatsa S. Bhat
  Cc: Peter Zijlstra, Paul E. McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
> On 09/28, Peter Zijlstra wrote:
> >
> > On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> >
> > > Please note that this wait_event() adds a problem... it doesn't allow
> > > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > > in this case. We can solve this, but this wait_event() complicates
> > > the problem.
> >
> > That seems like a particularly easy fix; something like so?
> 
> Yes, but...
> 
> > @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
> >
> > +	cpu_hotplug_done();
> > +
> > +	for_each_cpu(cpu, frozen_cpus)
> > +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
> 
> This changes the protocol, I simply do not know if it is fine in general
> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
> currently it is possible that CPU_DOWN_PREPARE takes some global lock
> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
> 
> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
> this notification if FROZEN. So yes, probably this is fine, but needs an
> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
> fine to call __cpufreq_remove_dev_prepare() twice without _finish().

To my eyes it will return -EBUSY when it tries to stop an already stopped
governor, which will cause the entire chain to fail I guess.

Srivatsa has touched that code most recently, so he should know better, though.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-09-30 20:11                                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 361+ messages in thread
From: Rafael J. Wysocki @ 2013-09-30 20:11 UTC (permalink / raw)
  To: Oleg Nesterov, Srivatsa S. Bhat
  Cc: Peter Zijlstra, Paul E. McKenney, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
> On 09/28, Peter Zijlstra wrote:
> >
> > On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> >
> > > Please note that this wait_event() adds a problem... it doesn't allow
> > > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > > in this case. We can solve this, but this wait_event() complicates
> > > the problem.
> >
> > That seems like a particularly easy fix; something like so?
> 
> Yes, but...
> 
> > @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
> >
> > +	cpu_hotplug_done();
> > +
> > +	for_each_cpu(cpu, frozen_cpus)
> > +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
> 
> This changes the protocol, I simply do not know if it is fine in general
> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
> currently it is possible that CPU_DOWN_PREPARE takes some global lock
> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
> 
> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
> this notification if FROZEN. So yes, probably this is fine, but needs an
> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
> fine to call __cpufreq_remove_dev_prepare() twice without _finish().

To my eyes it will return -EBUSY when it tries to stop an already stopped
governor, which will cause the entire chain to fail I guess.

Srivatsa has touched that code most recently, so he should know better, though.

Thanks,
Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-27 20:41                                         ` Peter Zijlstra
@ 2013-10-01  3:56                                           ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01  3:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:

[ . . . ]

> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > +	smp_mb(); /* B matches A */
> > > +
> > > +	/*
> > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +	 */
> > >  
> > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > +		return false;
> > >  
> > > +	smp_mb(); /* D matches C */
> > 
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(), note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
> 
> Paul; is there any argument for the MB here as opposed to RMB; and if
> not should we change both these and SRCU?

Given that these memory barriers execute only on the semi-slow path,
why add the complexity of moving from smp_mb() to either smp_rmb()
or smp_wmb()?  Straight smp_mb() is easier to reason about and more
robust against future changes.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01  3:56                                           ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01  3:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:

[ . . . ]

> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > +	smp_mb(); /* B matches A */
> > > +
> > > +	/*
> > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +	 */
> > >  
> > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > +		return false;
> > >  
> > > +	smp_mb(); /* D matches C */
> > 
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(), note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
> 
> Paul; is there any argument for the MB here as opposed to RMB; and if
> not should we change both these and SRCU?

Given that these memory barriers execute only on the semi-slow path,
why add the complexity of moving from smp_mb() to either smp_rmb()
or smp_wmb()?  Straight smp_mb() is easier to reason about and more
robust against future changes.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01  3:56                                           ` Paul E. McKenney
@ 2013-10-01 14:14                                             ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 14:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/30, Paul E. McKenney wrote:
>
> On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
>
> [ . . . ]
>
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +	smp_mb(); /* B matches A */
> > > > +
> > > > +	/*
> > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +	 */
> > > >
> > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +		return false;
> > > >
> > > > +	smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> >
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB; and if
> > not should we change both these and SRCU?
>
> Given that these memory barriers execute only on the semi-slow path,
> why add the complexity of moving from smp_mb() to either smp_rmb()
> or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> robust against future changes.

But otoh this looks misleading, and the comments add more confusion.

But please note another email, it seems to me we can simply kill
cpuhp_seq and all the barriers in cpuhp_readers_active_check().

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 14:14                                             ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 14:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 09/30, Paul E. McKenney wrote:
>
> On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
>
> [ . . . ]
>
> > > > +static bool cpuhp_readers_active_check(void)
> > > >  {
> > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > +	smp_mb(); /* B matches A */
> > > > +
> > > > +	/*
> > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > +	 */
> > > >
> > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > +		return false;
> > > >
> > > > +	smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> >
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB; and if
> > not should we change both these and SRCU?
>
> Given that these memory barriers execute only on the semi-slow path,
> why add the complexity of moving from smp_mb() to either smp_rmb()
> or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> robust against future changes.

But otoh this looks misleading, and the comments add more confusion.

But please note another email, it seems to me we can simply kill
cpuhp_seq and all the barriers in cpuhp_readers_active_check().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 14:14                                             ` Oleg Nesterov
@ 2013-10-01 14:45                                               ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 14:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> On 09/30, Paul E. McKenney wrote:
> >
> > On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > > On 09/26, Peter Zijlstra wrote:
> >
> > [ . . . ]
> >
> > > > > +static bool cpuhp_readers_active_check(void)
> > > > >  {
> > > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > > +
> > > > > +	smp_mb(); /* B matches A */
> > > > > +
> > > > > +	/*
> > > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > > +	 */
> > > > >
> > > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > > +		return false;
> > > > >
> > > > > +	smp_mb(); /* D matches C */
> > > >
> > > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> > >
> > > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > > there are no writes to order.
> > >
> > > Paul; is there any argument for the MB here as opposed to RMB; and if
> > > not should we change both these and SRCU?
> >
> > Given that these memory barriers execute only on the semi-slow path,
> > why add the complexity of moving from smp_mb() to either smp_rmb()
> > or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> > robust against future changes.
> 
> But otoh this looks misleading, and the comments add more confusion.
> 
> But please note another email, it seems to me we can simply kill
> cpuhp_seq and all the barriers in cpuhp_readers_active_check().

If you don't have cpuhp_seq, you need some other way to avoid
counter overflow.  Which might be provided by limited number of
tasks, or, on 64-bit systems, 64-bit counters.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 14:45                                               ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 14:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> On 09/30, Paul E. McKenney wrote:
> >
> > On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > > On 09/26, Peter Zijlstra wrote:
> >
> > [ . . . ]
> >
> > > > > +static bool cpuhp_readers_active_check(void)
> > > > >  {
> > > > > +	unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > > +
> > > > > +	smp_mb(); /* B matches A */
> > > > > +
> > > > > +	/*
> > > > > +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > > +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > > +	 */
> > > > >
> > > > > +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > > +		return false;
> > > > >
> > > > > +	smp_mb(); /* D matches C */
> > > >
> > > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> > >
> > > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > > there are no writes to order.
> > >
> > > Paul; is there any argument for the MB here as opposed to RMB; and if
> > > not should we change both these and SRCU?
> >
> > Given that these memory barriers execute only on the semi-slow path,
> > why add the complexity of moving from smp_mb() to either smp_rmb()
> > or smp_wmb()?  Straight smp_mb() is easier to reason about and more
> > robust against future changes.
> 
> But otoh this looks misleading, and the comments add more confusion.
> 
> But please note another email, it seems to me we can simply kill
> cpuhp_seq and all the barriers in cpuhp_readers_active_check().

If you don't have cpuhp_seq, you need some other way to avoid
counter overflow.  Which might be provided by limited number of
tasks, or, on 64-bit systems, 64-bit counters.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 14:45                                               ` Paul E. McKenney
@ 2013-10-01 14:48                                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 14:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.  Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

How so? PID space is basically limited to 30 bits, so how could we
overflow a 32bit reference counter?

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 14:48                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 14:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.  Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

How so? PID space is basically limited to 30 bits, so how could we
overflow a 32bit reference counter?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 14:45                                               ` Paul E. McKenney
@ 2013-10-01 15:00                                                 ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> >
> > But please note another email, it seems to me we can simply kill
> > cpuhp_seq and all the barriers in cpuhp_readers_active_check().
>
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.

I don't think so. Overflows (espicially "unsigned") should be fine and
in fact we can't avoid them.

Say, a task does get() on CPU_0 and put() on CPU_1, after that we have

	CTR[0] == 1, CTR[1] = (unsigned)-1

iow, the counter was already overflowed (underflowed). But this is fine,
all we care about is  CTR[0] + CTR[1] == 0, and this is only true because
of another overflow.

But probably you meant another thing,

> Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

perhaps you meant that max_threads * max_depth can overflow the counter?
I don't think so... but OK, perhaps this counter should be u_long.

But how cpuhp_seq can help?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 15:00                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> >
> > But please note another email, it seems to me we can simply kill
> > cpuhp_seq and all the barriers in cpuhp_readers_active_check().
>
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.

I don't think so. Overflows (espicially "unsigned") should be fine and
in fact we can't avoid them.

Say, a task does get() on CPU_0 and put() on CPU_1, after that we have

	CTR[0] == 1, CTR[1] = (unsigned)-1

iow, the counter was already overflowed (underflowed). But this is fine,
all we care about is  CTR[0] + CTR[1] == 0, and this is only true because
of another overflow.

But probably you meant another thing,

> Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

perhaps you meant that max_threads * max_depth can overflow the counter?
I don't think so... but OK, perhaps this counter should be u_long.

But how cpuhp_seq can help?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 14:48                                                 ` Peter Zijlstra
@ 2013-10-01 15:24                                                   ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > If you don't have cpuhp_seq, you need some other way to avoid
> > counter overflow.  Which might be provided by limited number of
> > tasks, or, on 64-bit systems, 64-bit counters.
> 
> How so? PID space is basically limited to 30 bits, so how could we
> overflow a 32bit reference counter?

Nesting.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 15:24                                                   ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > If you don't have cpuhp_seq, you need some other way to avoid
> > counter overflow.  Which might be provided by limited number of
> > tasks, or, on 64-bit systems, 64-bit counters.
> 
> How so? PID space is basically limited to 30 bits, so how could we
> overflow a 32bit reference counter?

Nesting.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 15:24                                                   ` Paul E. McKenney
@ 2013-10-01 15:34                                                     ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > > If you don't have cpuhp_seq, you need some other way to avoid
> > > counter overflow.  Which might be provided by limited number of
> > > tasks, or, on 64-bit systems, 64-bit counters.
> >
> > How so? PID space is basically limited to 30 bits, so how could we
> > overflow a 32bit reference counter?
>
> Nesting.

Still it seems that UINT_MAX / PID_MAX_LIMIT has enough room.

But again, OK lets make it ulong. The question is, how cpuhp_seq can
help and why we can't kill it.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 15:34                                                     ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > > If you don't have cpuhp_seq, you need some other way to avoid
> > > counter overflow.  Which might be provided by limited number of
> > > tasks, or, on 64-bit systems, 64-bit counters.
> >
> > How so? PID space is basically limited to 30 bits, so how could we
> > overflow a 32bit reference counter?
>
> Nesting.

Still it seems that UINT_MAX / PID_MAX_LIMIT has enough room.

But again, OK lets make it ulong. The question is, how cpuhp_seq can
help and why we can't kill it.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-29 13:56                                         ` Oleg Nesterov
@ 2013-10-01 15:38                                           ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> On 09/27, Oleg Nesterov wrote:
> >
> > I tried hard to find any hole in this version but failed, I believe it
> > is correct.
> 
> And I still believe it is. But now I am starting to think that we
> don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

Here is one scenario that I believe requires cpuhp_seq:

1.	Task 0 on CPU 0 increments its counter on entry.

2.	Task 1 on CPU 1 starts summing the counters and gets to
	CPU 4.  The sum thus far is 1 (Task 0).

3.	Task 2 on CPU 2 increments its counter on entry.
	Upon completing its entry code, it re-enables preemption.

4.	Task 2 is preempted, and starts running on CPU 5.

5.	Task 2 decrements its counter on exit.

6.	Task 1 continues summing.  Due to the fact that it saw Task 2's
	exit but not its entry, the sum is zero.

One of cpuhp_seq's jobs is to prevent this scenario.

That said, bozo here still hasn't gotten to look at Peter's newest patch,
so perhaps it prevents this scenario some other way, perhaps by your
argument below.

> > We need to ensure 2 things:
> >
> > 1. The reader should notic state = BLOCK or the writer should see
> >    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> >    __get_online_cpus() and in cpu_hotplug_begin().
> >
> >    We do not care if the writer misses some inc(__cpuhp_refcount)
> >    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> >    state = readers_block (and inc(cpuhp_seq) can't help anyway).
> 
> Yes!

OK, I will look over the patch with this in mind.

> > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> >    from __put_online_cpus() (note that the writer can miss the
> >    corresponding inc() if it was done on another CPU, so this dec()
> >    can lead to sum() == 0),
> 
> But this can't happen in this version? Somehow I forgot that
> __get_online_cpus() does inc/get under preempt_disable(), always on
> the same CPU. And thanks to mb's the writer should not miss the
> reader which has already passed the "state != BLOCK" check.
> 
> To simplify the discussion, lets ignore the "readers_fast" state,
> synchronize_sched() logic looks obviously correct. IOW, lets discuss
> only the SLOW -> BLOCK transition.
> 
> 	cput_hotplug_begin()
> 	{
> 		state = BLOCK;
> 
> 		mb();
> 
> 		wait_event(cpuhp_writer,
> 				per_cpu_sum(__cpuhp_refcount) == 0);
> 	}
> 
> should work just fine? Ignoring all details, we have
> 
> 	get_online_cpus()
> 	{
> 	again:
> 		preempt_disable();
> 
> 		__this_cpu_inc(__cpuhp_refcount);
> 
> 		mb();
> 
> 		if (state == BLOCK) {
> 
> 			mb();
> 
> 			__this_cpu_dec(__cpuhp_refcount);
> 			wake_up_all(cpuhp_writer);
> 
> 			preempt_enable();
> 			wait_event(state != BLOCK);
> 			goto again;
> 		}
> 
> 		preempt_enable();
> 	}
> 
> It seems to me that these mb's guarantee all we need, no?
> 
> It looks really simple. The reader can only succed if it doesn't see
> BLOCK, in this case per_cpu_sum() should see the change,
> 
> We have
> 
> 	WRITER					READER on CPU X
> 
> 	state = BLOCK;				__cpuhp_refcount[X]++;
> 
> 	mb();					mb();
> 
> 	...
> 	count += __cpuhp_refcount[X];		if (state != BLOCK)
> 	...						return;
> 
> 						mb();
> 						__cpuhp_refcount[X]--;
> 
> Either reader or writer should notice the STORE we care about.
> 
> If a reader can decrement __cpuhp_refcount, we have 2 cases:
> 
> 	1. It is the reader holding this lock. In this case we
> 	   can't miss the corresponding inc() done by this reader,
> 	   because this reader didn't see BLOCK in the past.
> 
> 	   It is just the
> 
> 			A == B == 0
> 	   	CPU_0			CPU_1
> 	   	-----			-----
> 	   	A = 1;			B = 1;
> 	   	mb();			mb();
> 	   	b = B;			a = A;
> 
> 	   pattern, at least one CPU should see 1 in its a/b.
> 
> 	2. It is the reader which tries to take this lock and
> 	   noticed state == BLOCK. We could miss the result of
> 	   its inc(), but we do not care, this reader is going
> 	   to block.
> 
> 	   _If_ the reader could migrate between inc/dec, then
> 	   yes, we have a problem. Because that dec() could make
> 	   the result of per_cpu_sum() = 0. IOW, we could miss
> 	   inc() but notice dec(). But given that it does this
> 	   on the same CPU this is not possible.
> 
> So why do we need cpuhp_seq?

Good question, I will look again.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 15:38                                           ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> On 09/27, Oleg Nesterov wrote:
> >
> > I tried hard to find any hole in this version but failed, I believe it
> > is correct.
> 
> And I still believe it is. But now I am starting to think that we
> don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

Here is one scenario that I believe requires cpuhp_seq:

1.	Task 0 on CPU 0 increments its counter on entry.

2.	Task 1 on CPU 1 starts summing the counters and gets to
	CPU 4.  The sum thus far is 1 (Task 0).

3.	Task 2 on CPU 2 increments its counter on entry.
	Upon completing its entry code, it re-enables preemption.

4.	Task 2 is preempted, and starts running on CPU 5.

5.	Task 2 decrements its counter on exit.

6.	Task 1 continues summing.  Due to the fact that it saw Task 2's
	exit but not its entry, the sum is zero.

One of cpuhp_seq's jobs is to prevent this scenario.

That said, bozo here still hasn't gotten to look at Peter's newest patch,
so perhaps it prevents this scenario some other way, perhaps by your
argument below.

> > We need to ensure 2 things:
> >
> > 1. The reader should notic state = BLOCK or the writer should see
> >    inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> >    __get_online_cpus() and in cpu_hotplug_begin().
> >
> >    We do not care if the writer misses some inc(__cpuhp_refcount)
> >    in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> >    state = readers_block (and inc(cpuhp_seq) can't help anyway).
> 
> Yes!

OK, I will look over the patch with this in mind.

> > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> >    from __put_online_cpus() (note that the writer can miss the
> >    corresponding inc() if it was done on another CPU, so this dec()
> >    can lead to sum() == 0),
> 
> But this can't happen in this version? Somehow I forgot that
> __get_online_cpus() does inc/get under preempt_disable(), always on
> the same CPU. And thanks to mb's the writer should not miss the
> reader which has already passed the "state != BLOCK" check.
> 
> To simplify the discussion, lets ignore the "readers_fast" state,
> synchronize_sched() logic looks obviously correct. IOW, lets discuss
> only the SLOW -> BLOCK transition.
> 
> 	cput_hotplug_begin()
> 	{
> 		state = BLOCK;
> 
> 		mb();
> 
> 		wait_event(cpuhp_writer,
> 				per_cpu_sum(__cpuhp_refcount) == 0);
> 	}
> 
> should work just fine? Ignoring all details, we have
> 
> 	get_online_cpus()
> 	{
> 	again:
> 		preempt_disable();
> 
> 		__this_cpu_inc(__cpuhp_refcount);
> 
> 		mb();
> 
> 		if (state == BLOCK) {
> 
> 			mb();
> 
> 			__this_cpu_dec(__cpuhp_refcount);
> 			wake_up_all(cpuhp_writer);
> 
> 			preempt_enable();
> 			wait_event(state != BLOCK);
> 			goto again;
> 		}
> 
> 		preempt_enable();
> 	}
> 
> It seems to me that these mb's guarantee all we need, no?
> 
> It looks really simple. The reader can only succed if it doesn't see
> BLOCK, in this case per_cpu_sum() should see the change,
> 
> We have
> 
> 	WRITER					READER on CPU X
> 
> 	state = BLOCK;				__cpuhp_refcount[X]++;
> 
> 	mb();					mb();
> 
> 	...
> 	count += __cpuhp_refcount[X];		if (state != BLOCK)
> 	...						return;
> 
> 						mb();
> 						__cpuhp_refcount[X]--;
> 
> Either reader or writer should notice the STORE we care about.
> 
> If a reader can decrement __cpuhp_refcount, we have 2 cases:
> 
> 	1. It is the reader holding this lock. In this case we
> 	   can't miss the corresponding inc() done by this reader,
> 	   because this reader didn't see BLOCK in the past.
> 
> 	   It is just the
> 
> 			A == B == 0
> 	   	CPU_0			CPU_1
> 	   	-----			-----
> 	   	A = 1;			B = 1;
> 	   	mb();			mb();
> 	   	b = B;			a = A;
> 
> 	   pattern, at least one CPU should see 1 in its a/b.
> 
> 	2. It is the reader which tries to take this lock and
> 	   noticed state == BLOCK. We could miss the result of
> 	   its inc(), but we do not care, this reader is going
> 	   to block.
> 
> 	   _If_ the reader could migrate between inc/dec, then
> 	   yes, we have a problem. Because that dec() could make
> 	   the result of per_cpu_sum() = 0. IOW, we could miss
> 	   inc() but notice dec(). But given that it does this
> 	   on the same CPU this is not possible.
> 
> So why do we need cpuhp_seq?

Good question, I will look again.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 15:38                                           ` Paul E. McKenney
@ 2013-10-01 15:40                                             ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> > On 09/27, Oleg Nesterov wrote:
> > >
> > > I tried hard to find any hole in this version but failed, I believe it
> > > is correct.
> >
> > And I still believe it is. But now I am starting to think that we
> > don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).
>
> Here is one scenario that I believe requires cpuhp_seq:
>
> 1.	Task 0 on CPU 0 increments its counter on entry.
>
> 2.	Task 1 on CPU 1 starts summing the counters and gets to
> 	CPU 4.  The sum thus far is 1 (Task 0).
>
> 3.	Task 2 on CPU 2 increments its counter on entry.
> 	Upon completing its entry code, it re-enables preemption.

afaics at this stage it should notice state = BLOCK and decrement
the same counter on the same CPU before it does preempt_enable().

Because:

> > 	2. It is the reader which tries to take this lock and
> > 	   noticed state == BLOCK. We could miss the result of
> > 	   its inc(), but we do not care, this reader is going
> > 	   to block.
> >
> > 	   _If_ the reader could migrate between inc/dec, then
> > 	   yes, we have a problem. Because that dec() could make
> > 	   the result of per_cpu_sum() = 0. IOW, we could miss
> > 	   inc() but notice dec(). But given that it does this
> > 	   on the same CPU this is not possible.
> >
> > So why do we need cpuhp_seq?
>
> Good question, I will look again.

Thanks! much appreciated.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 15:40                                             ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 15:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On 10/01, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> > On 09/27, Oleg Nesterov wrote:
> > >
> > > I tried hard to find any hole in this version but failed, I believe it
> > > is correct.
> >
> > And I still believe it is. But now I am starting to think that we
> > don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).
>
> Here is one scenario that I believe requires cpuhp_seq:
>
> 1.	Task 0 on CPU 0 increments its counter on entry.
>
> 2.	Task 1 on CPU 1 starts summing the counters and gets to
> 	CPU 4.  The sum thus far is 1 (Task 0).
>
> 3.	Task 2 on CPU 2 increments its counter on entry.
> 	Upon completing its entry code, it re-enables preemption.

afaics at this stage it should notice state = BLOCK and decrement
the same counter on the same CPU before it does preempt_enable().

Because:

> > 	2. It is the reader which tries to take this lock and
> > 	   noticed state == BLOCK. We could miss the result of
> > 	   its inc(), but we do not care, this reader is going
> > 	   to block.
> >
> > 	   _If_ the reader could migrate between inc/dec, then
> > 	   yes, we have a problem. Because that dec() could make
> > 	   the result of per_cpu_sum() = 0. IOW, we could miss
> > 	   inc() but notice dec(). But given that it does this
> > 	   on the same CPU this is not possible.
> >
> > So why do we need cpuhp_seq?
>
> Good question, I will look again.

Thanks! much appreciated.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-30 20:11                                                 ` Rafael J. Wysocki
@ 2013-10-01 17:11                                                   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 17:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 01:41 AM, Rafael J. Wysocki wrote:
> On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
>> On 09/28, Peter Zijlstra wrote:
>>>
>>> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>>>
>>>> Please note that this wait_event() adds a problem... it doesn't allow
>>>> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
>>>> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
>>>> in this case. We can solve this, but this wait_event() complicates
>>>> the problem.
>>>
>>> That seems like a particularly easy fix; something like so?
>>
>> Yes, but...
>>
>>> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>>>
>>> +	cpu_hotplug_done();
>>> +
>>> +	for_each_cpu(cpu, frozen_cpus)
>>> +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
>>
>> This changes the protocol, I simply do not know if it is fine in general
>> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
>> currently it is possible that CPU_DOWN_PREPARE takes some global lock
>> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
>>
>> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
>> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
>> this notification if FROZEN. So yes, probably this is fine, but needs an
>> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
>> fine to call __cpufreq_remove_dev_prepare() twice without _finish().
> 
> To my eyes it will return -EBUSY when it tries to stop an already stopped
> governor, which will cause the entire chain to fail I guess.
>
> Srivatsa has touched that code most recently, so he should know better, though.
> 

Yes it will return -EBUSY, but unfortunately it gets scarier from that
point onwards. When it gets an -EBUSY, __cpufreq_remove_dev_prepare() aborts
its work mid-way and returns, but doesn't bubble up the error to the CPU-hotplug
core. So the CPU hotplug code will continue to take that CPU down, with
further notifications such as CPU_DEAD, and chaos will ensue.

And we can't exactly "fix" this by simply returning the error code to CPU-hotplug
(since that would mean that suspend/resume would _always_ fail). Perhaps we can
teach cpufreq to ignore the error in this particular case (since the governor has
already been stopped and that's precisely what this function wanted to do as well),
but the problems don't seem to end there.

The other issue is that the CPUs in the policy->cpus mask are removed in the
_dev_finish() stage. So if that stage is post-poned like this, then _dev_prepare()
will get thoroughly confused since it also depends on seeing an updated
policy->cpus mask to decide when to nominate a new policy->cpu etc. (And the
cpu nomination code itself might start ping-ponging between CPUs, since none of
the CPUs would have been removed from the policy->cpus mask).

So, to summarize, this change to CPU hotplug code will break cpufreq (and
suspend/resume) as things stand today, but I don't think these problems are
insurmountable though.. 
 
However, as Oleg said, its definitely worth considering whether this proposed
change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
proved to be very useful in certain challenging situations (commit 1aee40ac9c
explains one such example), so IMHO we should be very careful not to undermine
its utility.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 17:11                                                   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 17:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Oleg Nesterov, Peter Zijlstra, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 01:41 AM, Rafael J. Wysocki wrote:
> On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
>> On 09/28, Peter Zijlstra wrote:
>>>
>>> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>>>
>>>> Please note that this wait_event() adds a problem... it doesn't allow
>>>> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
>>>> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
>>>> in this case. We can solve this, but this wait_event() complicates
>>>> the problem.
>>>
>>> That seems like a particularly easy fix; something like so?
>>
>> Yes, but...
>>
>>> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>>>
>>> +	cpu_hotplug_done();
>>> +
>>> +	for_each_cpu(cpu, frozen_cpus)
>>> +		cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
>>
>> This changes the protocol, I simply do not know if it is fine in general
>> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
>> currently it is possible that CPU_DOWN_PREPARE takes some global lock
>> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
>>
>> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
>> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
>> this notification if FROZEN. So yes, probably this is fine, but needs an
>> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
>> fine to call __cpufreq_remove_dev_prepare() twice without _finish().
> 
> To my eyes it will return -EBUSY when it tries to stop an already stopped
> governor, which will cause the entire chain to fail I guess.
>
> Srivatsa has touched that code most recently, so he should know better, though.
> 

Yes it will return -EBUSY, but unfortunately it gets scarier from that
point onwards. When it gets an -EBUSY, __cpufreq_remove_dev_prepare() aborts
its work mid-way and returns, but doesn't bubble up the error to the CPU-hotplug
core. So the CPU hotplug code will continue to take that CPU down, with
further notifications such as CPU_DEAD, and chaos will ensue.

And we can't exactly "fix" this by simply returning the error code to CPU-hotplug
(since that would mean that suspend/resume would _always_ fail). Perhaps we can
teach cpufreq to ignore the error in this particular case (since the governor has
already been stopped and that's precisely what this function wanted to do as well),
but the problems don't seem to end there.

The other issue is that the CPUs in the policy->cpus mask are removed in the
_dev_finish() stage. So if that stage is post-poned like this, then _dev_prepare()
will get thoroughly confused since it also depends on seeing an updated
policy->cpus mask to decide when to nominate a new policy->cpu etc. (And the
cpu nomination code itself might start ping-ponging between CPUs, since none of
the CPUs would have been removed from the policy->cpus mask).

So, to summarize, this change to CPU hotplug code will break cpufreq (and
suspend/resume) as things stand today, but I don't think these problems are
insurmountable though.. 
 
However, as Oleg said, its definitely worth considering whether this proposed
change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
proved to be very useful in certain challenging situations (commit 1aee40ac9c
explains one such example), so IMHO we should be very careful not to undermine
its utility.

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:11                                                   ` Srivatsa S. Bhat
@ 2013-10-01 17:36                                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 17:36 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> However, as Oleg said, its definitely worth considering whether this proposed
> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> proved to be very useful in certain challenging situations (commit 1aee40ac9c
> explains one such example), so IMHO we should be very careful not to undermine
> its utility.

Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
called at some time after the unplug' with no further guarantees. And my
patch preserves that.

Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
doesn't explain it.

What's wrong with leaving a cleanup handle in percpu storage and
effectively doing:

struct cpu_destroy {
	void (*destroy)(void *);
	void *args;
};

DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);

	POST_DEAD:
	{
		struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
		if (x.destroy)
			x.destroy(x.arg);
	}

POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
will succeed; it has to.

The cpufreq situation simply doesn't make any kind of sense to me.



^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 17:36                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 17:36 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> However, as Oleg said, its definitely worth considering whether this proposed
> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> proved to be very useful in certain challenging situations (commit 1aee40ac9c
> explains one such example), so IMHO we should be very careful not to undermine
> its utility.

Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
called at some time after the unplug' with no further guarantees. And my
patch preserves that.

Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
doesn't explain it.

What's wrong with leaving a cleanup handle in percpu storage and
effectively doing:

struct cpu_destroy {
	void (*destroy)(void *);
	void *args;
};

DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);

	POST_DEAD:
	{
		struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
		if (x.destroy)
			x.destroy(x.arg);
	}

POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
will succeed; it has to.

The cpufreq situation simply doesn't make any kind of sense to me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:36                                                     ` Peter Zijlstra
@ 2013-10-01 17:45                                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > However, as Oleg said, its definitely worth considering whether this proposed
> > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > explains one such example), so IMHO we should be very careful not to undermine
> > its utility.
>
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.

I tend to agree with Srivatsa... Without a strong reason it would be better
to preserve the current logic: "some time after" should not be after the
next CPU_DOWN/UP*. But I won't argue too much.

But note that you do not strictly need this change. Just kill cpuhp_waitcount,
then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
another thread, this should likely "join" all synchronize_sched's.

Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
of for_each_online_cpu().

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 17:45                                                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > However, as Oleg said, its definitely worth considering whether this proposed
> > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > explains one such example), so IMHO we should be very careful not to undermine
> > its utility.
>
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.

I tend to agree with Srivatsa... Without a strong reason it would be better
to preserve the current logic: "some time after" should not be after the
next CPU_DOWN/UP*. But I won't argue too much.

But note that you do not strictly need this change. Just kill cpuhp_waitcount,
then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
another thread, this should likely "join" all synchronize_sched's.

Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
of for_each_online_cpu().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:45                                                       ` Oleg Nesterov
@ 2013-10-01 17:56                                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > > However, as Oleg said, its definitely worth considering whether this proposed
> > > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > > explains one such example), so IMHO we should be very careful not to undermine
> > > its utility.
> >
> > Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> > called at some time after the unplug' with no further guarantees. And my
> > patch preserves that.
> 
> I tend to agree with Srivatsa... Without a strong reason it would be better
> to preserve the current logic: "some time after" should not be after the
> next CPU_DOWN/UP*. But I won't argue too much.

Nah, I think breaking it is the right thing :-)

> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> another thread, this should likely "join" all synchronize_sched's.

That would still be 4k * sync_sched() == terribly long.

> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> of for_each_online_cpu().

Right, that's more messy but would work if we cannot teach cpufreq (and
possibly others) to not rely on state you shouldn't rely on anyway.

I tihnk the only guarnatee POST_DEAD should have is that it should be
called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 17:56                                                         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-01 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > > However, as Oleg said, its definitely worth considering whether this proposed
> > > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > > explains one such example), so IMHO we should be very careful not to undermine
> > > its utility.
> >
> > Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> > called at some time after the unplug' with no further guarantees. And my
> > patch preserves that.
> 
> I tend to agree with Srivatsa... Without a strong reason it would be better
> to preserve the current logic: "some time after" should not be after the
> next CPU_DOWN/UP*. But I won't argue too much.

Nah, I think breaking it is the right thing :-)

> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> another thread, this should likely "join" all synchronize_sched's.

That would still be 4k * sync_sched() == terribly long.

> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> of for_each_online_cpu().

Right, that's more messy but would work if we cannot teach cpufreq (and
possibly others) to not rely on state you shouldn't rely on anyway.

I tihnk the only guarnatee POST_DEAD should have is that it should be
called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:56                                                         ` Peter Zijlstra
@ 2013-10-01 18:07                                                           ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> >
> > I tend to agree with Srivatsa... Without a strong reason it would be better
> > to preserve the current logic: "some time after" should not be after the
> > next CPU_DOWN/UP*. But I won't argue too much.
>
> Nah, I think breaking it is the right thing :-)

I don't really agree but I won't argue ;)

> > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > another thread, this should likely "join" all synchronize_sched's.
>
> That would still be 4k * sync_sched() == terribly long.

No? the next xxx_enter() avoids sync_sched() if rcu callback is still
pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

> > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > of for_each_online_cpu().
>
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.

Yes,

> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

See above... This makes POST_DEAD really "special" compared to other
CPU_* events.

And again. Something like a global lock taken by CPU_DOWN_PREPARE and
released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.

But I leave this to you and Srivatsa.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 18:07                                                           ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-01 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> >
> > I tend to agree with Srivatsa... Without a strong reason it would be better
> > to preserve the current logic: "some time after" should not be after the
> > next CPU_DOWN/UP*. But I won't argue too much.
>
> Nah, I think breaking it is the right thing :-)

I don't really agree but I won't argue ;)

> > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > another thread, this should likely "join" all synchronize_sched's.
>
> That would still be 4k * sync_sched() == terribly long.

No? the next xxx_enter() avoids sync_sched() if rcu callback is still
pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

> > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > of for_each_online_cpu().
>
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.

Yes,

> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

See above... This makes POST_DEAD really "special" compared to other
CPU_* events.

And again. Something like a global lock taken by CPU_DOWN_PREPARE and
released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.

But I leave this to you and Srivatsa.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:36                                                     ` Peter Zijlstra
@ 2013-10-01 18:14                                                       ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>> However, as Oleg said, its definitely worth considering whether this proposed
>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>> explains one such example), so IMHO we should be very careful not to undermine
>> its utility.
> 
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.
> 
> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
> doesn't explain it.
>

Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
than that. I was just saying that the cpufreq code would need certain additional
changes/restructuring to accommodate the change in the semantics brought about
by this patch. IOW, it won't work as it is, but it can certainly be fixed.

My other point (unrelated to cpufreq) was this: POST_DEAD of course means
that it will be called after unplug, with hotplug lock dropped. But it also
provides the guarantee (in the existing code), that a *new* hotplug operation
won't start until the POST_DEAD stage is also completed. This patch doesn't seem
to honor that part. The concern I have is in cases like those mentioned by
Oleg - say you take a lock at DOWN_PREPARE and want to drop it at POST_DEAD;
or some other requirement that makes it important to finish a full hotplug cycle
before moving on to the next one. I don't really have such a requirement in mind
at present, but I was just trying to think what we would be losing with this
change...

But to reiterate, I believe cpufreq can be reworked so that it doesn't depend
on things such as the above. But I wonder if dropping that latter guarantee
is going to be OK, going forward.

Regards,
Srivatsa S. Bhat
 
> What's wrong with leaving a cleanup handle in percpu storage and
> effectively doing:
> 
> struct cpu_destroy {
> 	void (*destroy)(void *);
> 	void *args;
> };
> 
> DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);
> 
> 	POST_DEAD:
> 	{
> 		struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
> 		if (x.destroy)
> 			x.destroy(x.arg);
> 	}
> 
> POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
> will succeed; it has to.
> 
> The cpufreq situation simply doesn't make any kind of sense to me.
> 
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 18:14                                                       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>> However, as Oleg said, its definitely worth considering whether this proposed
>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>> explains one such example), so IMHO we should be very careful not to undermine
>> its utility.
> 
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.
> 
> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
> doesn't explain it.
>

Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
than that. I was just saying that the cpufreq code would need certain additional
changes/restructuring to accommodate the change in the semantics brought about
by this patch. IOW, it won't work as it is, but it can certainly be fixed.

My other point (unrelated to cpufreq) was this: POST_DEAD of course means
that it will be called after unplug, with hotplug lock dropped. But it also
provides the guarantee (in the existing code), that a *new* hotplug operation
won't start until the POST_DEAD stage is also completed. This patch doesn't seem
to honor that part. The concern I have is in cases like those mentioned by
Oleg - say you take a lock at DOWN_PREPARE and want to drop it at POST_DEAD;
or some other requirement that makes it important to finish a full hotplug cycle
before moving on to the next one. I don't really have such a requirement in mind
at present, but I was just trying to think what we would be losing with this
change...

But to reiterate, I believe cpufreq can be reworked so that it doesn't depend
on things such as the above. But I wonder if dropping that latter guarantee
is going to be OK, going forward.

Regards,
Srivatsa S. Bhat
 
> What's wrong with leaving a cleanup handle in percpu storage and
> effectively doing:
> 
> struct cpu_destroy {
> 	void (*destroy)(void *);
> 	void *args;
> };
> 
> DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);
> 
> 	POST_DEAD:
> 	{
> 		struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
> 		if (x.destroy)
> 			x.destroy(x.arg);
> 	}
> 
> POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
> will succeed; it has to.
> 
> The cpufreq situation simply doesn't make any kind of sense to me.
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 18:14                                                       ` Srivatsa S. Bhat
@ 2013-10-01 18:56                                                         ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 

And an important reason why this change can be accommodated with not so much
trouble is because you are changing it only in the suspend/resume path, where
userspace has already been frozen, so all hotplug operations are initiated by
the suspend path and that path *alone* (and so we enjoy certain "simplifiers" that
we know before-hand, eg: all of them are CPU offline operations, happening one at
a time, in sequence) and we don't expect any "interference" to this routine ;-).
As a result the number and variety of races that we need to take care of tend to
be far lesser. (For example, we don't have to worry about the deadlock caused by
sysfs-writes that 1aee40ac9c was talking about).

On the other hand, if the proposal was to change the regular hotplug path as well
on the same lines, then I guess it would have been a little more difficult to
adjust to it. For example, in cpufreq, _dev_prepare() sends a STOP to the governor,
whereas a part of _dev_finish() sends a START to it; so we might have races there,
due to which we might proceed with CPU offline with a running governor, depending
on the exact timing of the events. Of course, this problem doesn't occur in the
suspend/resume case, and hence I didn't bring it up in my previous mail.

So this is another reason why I'm a little concerned about POST_DEAD: since this
is a change in semantics, it might be worth asking ourselves whether we'd still
want to go with that change, if we happened to be changing regular hotplug as
well, rather than just the more controlled environment of suspend/resume.
Yes, I know that's not what you proposed, but I feel it might be worth considering
its implications while deciding how to solve the POST_DEAD issue.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 18:56                                                         ` Srivatsa S. Bhat
  0 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 

And an important reason why this change can be accommodated with not so much
trouble is because you are changing it only in the suspend/resume path, where
userspace has already been frozen, so all hotplug operations are initiated by
the suspend path and that path *alone* (and so we enjoy certain "simplifiers" that
we know before-hand, eg: all of them are CPU offline operations, happening one at
a time, in sequence) and we don't expect any "interference" to this routine ;-).
As a result the number and variety of races that we need to take care of tend to
be far lesser. (For example, we don't have to worry about the deadlock caused by
sysfs-writes that 1aee40ac9c was talking about).

On the other hand, if the proposal was to change the regular hotplug path as well
on the same lines, then I guess it would have been a little more difficult to
adjust to it. For example, in cpufreq, _dev_prepare() sends a STOP to the governor,
whereas a part of _dev_finish() sends a START to it; so we might have races there,
due to which we might proceed with CPU offline with a running governor, depending
on the exact timing of the events. Of course, this problem doesn't occur in the
suspend/resume case, and hence I didn't bring it up in my previous mail.

So this is another reason why I'm a little concerned about POST_DEAD: since this
is a change in semantics, it might be worth asking ourselves whether we'd still
want to go with that change, if we happened to be changing regular hotplug as
well, rather than just the more controlled environment of suspend/resume.
Yes, I know that's not what you proposed, but I feel it might be worth considering
its implications while deciding how to solve the POST_DEAD issue.

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 17:56                                                         ` Peter Zijlstra
@ 2013-10-01 19:03                                                           ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Rafael J. Wysocki, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:26 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
>> On 10/01, Peter Zijlstra wrote:
>>>
>>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>>> However, as Oleg said, its definitely worth considering whether this proposed
>>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>>> explains one such example), so IMHO we should be very careful not to undermine
>>>> its utility.
>>>
>>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>>> called at some time after the unplug' with no further guarantees. And my
>>> patch preserves that.
>>
>> I tend to agree with Srivatsa... Without a strong reason it would be better
>> to preserve the current logic: "some time after" should not be after the
>> next CPU_DOWN/UP*. But I won't argue too much.
> 
> Nah, I think breaking it is the right thing :-)
> 
>> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
>> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
>> another thread, this should likely "join" all synchronize_sched's.
> 
> That would still be 4k * sync_sched() == terribly long.
> 
>> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
>> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
>> of for_each_online_cpu().
> 
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.
> 
> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 

Conceptually, that hints at a totally per-cpu implementation of CPU hotplug,
in which what happens to one CPU doesn't affect the others in the hotplug
path.. and yeah, that sounds very tempting! ;-) but I guess that will
need to be preceded by a massive rework of many of the existing hotplug
callbacks ;-)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 19:03                                                           ` Srivatsa S. Bhat
  0 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-01 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Rafael J. Wysocki, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:26 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
>> On 10/01, Peter Zijlstra wrote:
>>>
>>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>>> However, as Oleg said, its definitely worth considering whether this proposed
>>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>>> explains one such example), so IMHO we should be very careful not to undermine
>>>> its utility.
>>>
>>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>>> called at some time after the unplug' with no further guarantees. And my
>>> patch preserves that.
>>
>> I tend to agree with Srivatsa... Without a strong reason it would be better
>> to preserve the current logic: "some time after" should not be after the
>> next CPU_DOWN/UP*. But I won't argue too much.
> 
> Nah, I think breaking it is the right thing :-)
> 
>> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
>> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
>> another thread, this should likely "join" all synchronize_sched's.
> 
> That would still be 4k * sync_sched() == terribly long.
> 
>> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
>> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
>> of for_each_online_cpu().
> 
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.
> 
> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 

Conceptually, that hints at a totally per-cpu implementation of CPU hotplug,
in which what happens to one CPU doesn't affect the others in the hotplug
path.. and yeah, that sounds very tempting! ;-) but I guess that will
need to be preceded by a massive rework of many of the existing hotplug
callbacks ;-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 18:07                                                           ` Oleg Nesterov
@ 2013-10-01 19:05                                                             ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 19:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar, tony.luck, bp

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > >
> > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > to preserve the current logic: "some time after" should not be after the
> > > next CPU_DOWN/UP*. But I won't argue too much.
> >
> > Nah, I think breaking it is the right thing :-)
> 
> I don't really agree but I won't argue ;)

The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
guys who would need to complain, given that they seem to have the only
use in 3.11.

							Thanx, Paul

> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
> 
> > > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > > of for_each_online_cpu().
> >
> > Right, that's more messy but would work if we cannot teach cpufreq (and
> > possibly others) to not rely on state you shouldn't rely on anyway.
> 
> Yes,
> 
> > I tihnk the only guarnatee POST_DEAD should have is that it should be
> > called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 
> See above... This makes POST_DEAD really "special" compared to other
> CPU_* events.
> 
> And again. Something like a global lock taken by CPU_DOWN_PREPARE and
> released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.
> 
> But I leave this to you and Srivatsa.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 19:05                                                             ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 19:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar, tony.luck, bp

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > >
> > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > to preserve the current logic: "some time after" should not be after the
> > > next CPU_DOWN/UP*. But I won't argue too much.
> >
> > Nah, I think breaking it is the right thing :-)
> 
> I don't really agree but I won't argue ;)

The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
guys who would need to complain, given that they seem to have the only
use in 3.11.

							Thanx, Paul

> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
> 
> > > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > > of for_each_online_cpu().
> >
> > Right, that's more messy but would work if we cannot teach cpufreq (and
> > possibly others) to not rely on state you shouldn't rely on anyway.
> 
> Yes,
> 
> > I tihnk the only guarnatee POST_DEAD should have is that it should be
> > called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
> 
> See above... This makes POST_DEAD really "special" compared to other
> CPU_* events.
> 
> And again. Something like a global lock taken by CPU_DOWN_PREPARE and
> released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.
> 
> But I leave this to you and Srivatsa.
> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-09-26 11:10                                 ` Peter Zijlstra
@ 2013-10-01 20:40                                   ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 01:10:42PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> > A couple of nits and some commentary, but if there are races, they are
> > quite subtle.  ;-)
> 
> *whee*..
> 
> I made one little change in the logic; I moved the waitcount increment
> to before the __put_online_cpus() call, such that the writer will have
> to wait for us to wake up before trying again -- not for us to actually
> have acquired the read lock, for that we'd need to mess up
> __get_online_cpus() a bit more.
> 
> Complete patch below.

OK, looks like Oleg is correct, the cpuhp_seq can be dispensed with.

I still don't see anything wrong with it, so time for a serious stress
test on a large system.  ;-)

Additional commentary interspersed.

							Thanx, Paul

> ---
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The current implementation of get_online_cpus() is global of nature
> and thus not suited for any kind of common usage.
> 
> Re-implement the current recursive r/w cpu hotplug lock such that the
> read side locks are as light as possible.
> 
> The current cpu hotplug lock is entirely reader biased; but since
> readers are expensive there aren't a lot of them about and writer
> starvation isn't a particular problem.
> 
> However by making the reader side more usable there is a fair chance
> it will get used more and thus the starvation issue becomes a real
> possibility.
> 
> Therefore this new implementation is fair, alternating readers and
> writers; this however requires per-task state to allow the reader
> recursion.
> 
> Many comments are contributed by Paul McKenney, and many previous
> attempts were shown to be inadequate by both Paul and Oleg; many
> thanks to them for persisting to poke holes in my attempts.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/cpu.h   |   58 +++++++++++++
>  include/linux/sched.h |    3 
>  kernel/cpu.c          |  209 +++++++++++++++++++++++++++++++++++---------------
>  kernel/sched/core.c   |    2 
>  4 files changed, 208 insertions(+), 64 deletions(-)

I stripped the removed lines to keep my eyes from going buggy.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_state;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader recursion */
> +	/* The value was >= 1 and remains so, reordering causes no harm. */
> +	if (current->cpuhp_ref++)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_state)) {
> +		/* The barrier here is supplied by synchronize_sched(). */

I guess I shouldn't complain about the comment given where it came
from, but...

A more accurate comment would say that we are in an RCU-sched read-side
critical section, so the writer cannot both change __cpuhp_state from
readers_fast and start checking counters while we are here.  So if we see
!__cpuhp_state, we know that the writer won't be checking until we past
the preempt_enable() and that once the synchronize_sched() is done,
the writer will see anything we did within this RCU-sched read-side
critical section.

(The writer -can- change __cpuhp_state from readers_slow to readers_block
while we are in this read-side critical section and then start summing
counters, but that corresponds to a different "if" statement.)

> +		__this_cpu_inc(__cpuhp_refcount);
> +	} else {
> +		__get_online_cpus(); /* Unconditional memory barrier. */
> +	}
> +	preempt_enable();
> +	/*
> +	 * The barrier() from preempt_enable() prevents the compiler from
> +	 * bleeding the critical section out.
> +	 */
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	/* The value was >= 1 and remains so, reordering causes no harm. */
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	/*
> +	 * The barrier() in preempt_disable() prevents the compiler from
> +	 * bleeding the critical section out.
> +	 */
> +	preempt_disable();
> +	if (likely(!__cpuhp_state)) {
> +		/* The barrier here is supplied by synchronize_sched().  */

Same here, both for the implied self-criticism and the more complete story.

Due to the basic RCU guarantee, the writer cannot both change __cpuhp_state
and start checking counters while we are in this RCU-sched read-side
critical section.  And again, if the synchronize_sched() had to wait on
us (or if we were early enough that no waiting was needed), then once
the synchronize_sched() completes, the writer will see anything that we
did within this RCU-sched read-side critical section.

> +		__this_cpu_dec(__cpuhp_refcount);
> +	} else {
> +		__put_online_cpus(); /* Unconditional memory barrier. */
> +	}
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_state;
> +EXPORT_SYMBOL_GPL(__cpuhp_state);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
> +}
> +
> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	// __this_cpu_inc(cpuhp_seq);

Deleting the above per Oleg's suggestion.  We still need the preceding
memory barrier.

> +
> +	if (unlikely(__cpuhp_state == readers_block)) {
> +		/*
> +		 * Make sure an outgoing writer sees the waitcount to ensure
> +		 * we make progress.
> +		 */
> +		atomic_inc(&cpuhp_waitcount);
> +		__put_online_cpus();

The decrement happens on the same CPU as the increment, avoiding the
increment-on-one-CPU-and-decrement-on-another problem.

And yes, if the reader misses the writer's assignment of readers_block
to __cpuhp_state, then the writer is guaranteed to see the reader's
increment.  Conversely, any readers that increment their __cpuhp_refcount
after the writer looks are guaranteed to see the readers_block value,
which in turn means that they are guaranteed to immediately decrement
their __cpuhp_refcount, so that it doesn't matter that the writer
missed them.

Unfortunately, this trick does not apply back to SRCU, at least not
without adding a second memory barrier to the srcu_read_lock() path
(one to separate reading the index from incrementing the counter and
another to separate incrementing the counter from the critical section.
Can't have everything, I guess!

> +
> +		/*
> +		 * We either call schedule() in the wait, or we'll fall through
> +		 * and reschedule on the preempt_enable() in get_online_cpus().
> +		 */
> +		preempt_enable_no_resched();
> +		__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
> +		preempt_disable();
> +
> +		if (atomic_dec_and_test(&cpuhp_waitcount))
> +			wake_up_all(&cpuhp_writer);

I still don't see why this is a wake_up_all() given that there can be
only one writer.  Not that it makes much difference, but...

> +
> +		goto again;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> 
> +void __put_online_cpus(void)
>  {
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */
> +	/*
> +	 * In other words, if they see our decrement (presumably to aggregate
> +	 * zero, as that is the only time it matters) they will also see our
> +	 * critical section.
> +	 */
> +	this_cpu_dec(__cpuhp_refcount);
> 
> +	/* Prod writer to recheck readers_active */
> +	wake_up_all(&cpuhp_writer);
>  }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> +
> +#define per_cpu_sum(var)						\
> +({ 									\
> + 	typeof(var) __sum = 0;						\
> + 	int cpu;							\
> + 	for_each_possible_cpu(cpu)					\
> + 		__sum += per_cpu(var, cpu);				\
> + 	__sum;								\
> +)}
> 
> +/*
> + * See srcu_readers_active_idx_check() for a rather more detailed explanation.
> + */
> +static bool cpuhp_readers_active_check(void)
>  {
> +	// unsigned int seq = per_cpu_sum(cpuhp_seq);

Delete the above per Oleg's suggestion.

> +
> +	smp_mb(); /* B matches A */
> +
> +	/*
> +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> +	 */
> 
> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
> 
> +	smp_mb(); /* D matches C */
> 
> +	/*
> +	 * On equality, we know that there could not be any "sneak path" pairs
> +	 * where we see a decrement but not the corresponding increment for a
> +	 * given reader. If we saw its decrement, the memory barriers guarantee
> +	 * that we now see its cpuhp_seq increment.
> +	 */
> +
> +	// return per_cpu_sum(cpuhp_seq) == seq;

Delete the above per Oleg's suggestion, but actually need to replace with
"return true;".  We should be able to get rid of the first memory barrier
(B matches A) because the smp_mb() in cpu_hotplug_begin() covers it, but we
cannot git rid of the second memory barrier (D matches C).

>  }
> 
>  /*
> + * This will notify new readers to block and wait for all active readers to
> + * complete.
>   */
>  void cpu_hotplug_begin(void)
>  {
> +	/*
> +	 * Since cpu_hotplug_begin() is always called after invoking
> +	 * cpu_maps_update_begin(), we can be sure that only one writer is
> +	 * active.
> +	 */
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> +	/* Allow reader-in-writer recursion. */
> +	current->cpuhp_ref++;
> +
> +	/* Notify readers to take the slow path. */
> +	__cpuhp_state = readers_slow;
> +
> +	/* See percpu_down_write(); guarantees all readers take the slow path */
> +	synchronize_sched();
> +
> +	/*
> +	 * Notify new readers to block; up until now, and thus throughout the
> +	 * longish synchronize_sched() above, new readers could still come in.
> +	 */
> +	__cpuhp_state = readers_block;
> +
> +	smp_mb(); /* E matches A */
> +
> +	/*
> +	 * If they don't see our writer of readers_block to __cpuhp_state,
> +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> +	 * therefore will wait for them.
> +	 */
> +
> +	/* Wait for all now active readers to complete. */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> +	/* Signal the writer is done, no fast path yet. */
> +	__cpuhp_state = readers_slow;
> +	wake_up_all(&cpuhp_readers);

And one reason that we cannot just immediately flip to readers_fast
is that new readers might fail to see the results of this writer's
critical section.

> +
> +	/*
> +	 * The wait_event()/wake_up_all() prevents the race where the readers
> +	 * are delayed between fetching __cpuhp_state and blocking.
> +	 */
> +
> +	/* See percpu_up_write(); readers will no longer attempt to block. */
> +	synchronize_sched();
> +
> +	/* Let 'em rip */
> +	__cpuhp_state = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-01 20:40                                   ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-01 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Ingo Molnar, Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt

On Thu, Sep 26, 2013 at 01:10:42PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> > A couple of nits and some commentary, but if there are races, they are
> > quite subtle.  ;-)
> 
> *whee*..
> 
> I made one little change in the logic; I moved the waitcount increment
> to before the __put_online_cpus() call, such that the writer will have
> to wait for us to wake up before trying again -- not for us to actually
> have acquired the read lock, for that we'd need to mess up
> __get_online_cpus() a bit more.
> 
> Complete patch below.

OK, looks like Oleg is correct, the cpuhp_seq can be dispensed with.

I still don't see anything wrong with it, so time for a serious stress
test on a large system.  ;-)

Additional commentary interspersed.

							Thanx, Paul

> ---
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Sep 17 16:17:11 CEST 2013
> 
> The current implementation of get_online_cpus() is global of nature
> and thus not suited for any kind of common usage.
> 
> Re-implement the current recursive r/w cpu hotplug lock such that the
> read side locks are as light as possible.
> 
> The current cpu hotplug lock is entirely reader biased; but since
> readers are expensive there aren't a lot of them about and writer
> starvation isn't a particular problem.
> 
> However by making the reader side more usable there is a fair chance
> it will get used more and thus the starvation issue becomes a real
> possibility.
> 
> Therefore this new implementation is fair, alternating readers and
> writers; this however requires per-task state to allow the reader
> recursion.
> 
> Many comments are contributed by Paul McKenney, and many previous
> attempts were shown to be inadequate by both Paul and Oleg; many
> thanks to them for persisting to poke holes in my attempts.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/cpu.h   |   58 +++++++++++++
>  include/linux/sched.h |    3 
>  kernel/cpu.c          |  209 +++++++++++++++++++++++++++++++++++---------------
>  kernel/sched/core.c   |    2 
>  4 files changed, 208 insertions(+), 64 deletions(-)

I stripped the removed lines to keep my eyes from going buggy.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
>  #include <linux/node.h>
>  #include <linux/compiler.h>
>  #include <linux/cpumask.h>
> +#include <linux/percpu.h>
> 
>  struct device;
> 
> @@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
>  #ifdef CONFIG_HOTPLUG_CPU
>  /* Stop CPUs going up and down. */
> 
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
>  extern void cpu_hotplug_begin(void);
>  extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_state;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> +	might_sleep();
> +
> +	/* Support reader recursion */
> +	/* The value was >= 1 and remains so, reordering causes no harm. */
> +	if (current->cpuhp_ref++)
> +		return;
> +
> +	preempt_disable();
> +	if (likely(!__cpuhp_state)) {
> +		/* The barrier here is supplied by synchronize_sched(). */

I guess I shouldn't complain about the comment given where it came
from, but...

A more accurate comment would say that we are in an RCU-sched read-side
critical section, so the writer cannot both change __cpuhp_state from
readers_fast and start checking counters while we are here.  So if we see
!__cpuhp_state, we know that the writer won't be checking until we past
the preempt_enable() and that once the synchronize_sched() is done,
the writer will see anything we did within this RCU-sched read-side
critical section.

(The writer -can- change __cpuhp_state from readers_slow to readers_block
while we are in this read-side critical section and then start summing
counters, but that corresponds to a different "if" statement.)

> +		__this_cpu_inc(__cpuhp_refcount);
> +	} else {
> +		__get_online_cpus(); /* Unconditional memory barrier. */
> +	}
> +	preempt_enable();
> +	/*
> +	 * The barrier() from preempt_enable() prevents the compiler from
> +	 * bleeding the critical section out.
> +	 */
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> +	/* The value was >= 1 and remains so, reordering causes no harm. */
> +	if (--current->cpuhp_ref)
> +		return;
> +
> +	/*
> +	 * The barrier() in preempt_disable() prevents the compiler from
> +	 * bleeding the critical section out.
> +	 */
> +	preempt_disable();
> +	if (likely(!__cpuhp_state)) {
> +		/* The barrier here is supplied by synchronize_sched().  */

Same here, both for the implied self-criticism and the more complete story.

Due to the basic RCU guarantee, the writer cannot both change __cpuhp_state
and start checking counters while we are in this RCU-sched read-side
critical section.  And again, if the synchronize_sched() had to wait on
us (or if we were early enough that no waiting was needed), then once
the synchronize_sched() completes, the writer will see anything that we
did within this RCU-sched read-side critical section.

> +		__this_cpu_dec(__cpuhp_refcount);
> +	} else {
> +		__put_online_cpus(); /* Unconditional memory barrier. */
> +	}
> +	preempt_enable();
> +}
> +
>  extern void cpu_hotplug_disable(void);
>  extern void cpu_hotplug_enable(void);
>  #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
> @@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
> 
>  #else		/* CONFIG_HOTPLUG_CPU */
> 
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
>  static inline void cpu_hotplug_begin(void) {}
>  static inline void cpu_hotplug_done(void) {}
>  #define get_online_cpus()	do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> +	int		cpuhp_ref;
> +#endif
>  };
> 
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> 
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_state;
> +EXPORT_SYMBOL_GPL(__cpuhp_state);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> +	p->cpuhp_ref = 0;
> +}
> +
> +void __get_online_cpus(void)
> +{
> +again:
> +	/* See __srcu_read_lock() */
> +	__this_cpu_inc(__cpuhp_refcount);
> +	smp_mb(); /* A matches B, E */
> +	// __this_cpu_inc(cpuhp_seq);

Deleting the above per Oleg's suggestion.  We still need the preceding
memory barrier.

> +
> +	if (unlikely(__cpuhp_state == readers_block)) {
> +		/*
> +		 * Make sure an outgoing writer sees the waitcount to ensure
> +		 * we make progress.
> +		 */
> +		atomic_inc(&cpuhp_waitcount);
> +		__put_online_cpus();

The decrement happens on the same CPU as the increment, avoiding the
increment-on-one-CPU-and-decrement-on-another problem.

And yes, if the reader misses the writer's assignment of readers_block
to __cpuhp_state, then the writer is guaranteed to see the reader's
increment.  Conversely, any readers that increment their __cpuhp_refcount
after the writer looks are guaranteed to see the readers_block value,
which in turn means that they are guaranteed to immediately decrement
their __cpuhp_refcount, so that it doesn't matter that the writer
missed them.

Unfortunately, this trick does not apply back to SRCU, at least not
without adding a second memory barrier to the srcu_read_lock() path
(one to separate reading the index from incrementing the counter and
another to separate incrementing the counter from the critical section.
Can't have everything, I guess!

> +
> +		/*
> +		 * We either call schedule() in the wait, or we'll fall through
> +		 * and reschedule on the preempt_enable() in get_online_cpus().
> +		 */
> +		preempt_enable_no_resched();
> +		__wait_event(cpuhp_readers, __cpuhp_state != readers_block);
> +		preempt_disable();
> +
> +		if (atomic_dec_and_test(&cpuhp_waitcount))
> +			wake_up_all(&cpuhp_writer);

I still don't see why this is a wake_up_all() given that there can be
only one writer.  Not that it makes much difference, but...

> +
> +		goto again;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> 
> +void __put_online_cpus(void)
>  {
> +	/* See __srcu_read_unlock() */
> +	smp_mb(); /* C matches D */
> +	/*
> +	 * In other words, if they see our decrement (presumably to aggregate
> +	 * zero, as that is the only time it matters) they will also see our
> +	 * critical section.
> +	 */
> +	this_cpu_dec(__cpuhp_refcount);
> 
> +	/* Prod writer to recheck readers_active */
> +	wake_up_all(&cpuhp_writer);
>  }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> +
> +#define per_cpu_sum(var)						\
> +({ 									\
> + 	typeof(var) __sum = 0;						\
> + 	int cpu;							\
> + 	for_each_possible_cpu(cpu)					\
> + 		__sum += per_cpu(var, cpu);				\
> + 	__sum;								\
> +)}
> 
> +/*
> + * See srcu_readers_active_idx_check() for a rather more detailed explanation.
> + */
> +static bool cpuhp_readers_active_check(void)
>  {
> +	// unsigned int seq = per_cpu_sum(cpuhp_seq);

Delete the above per Oleg's suggestion.

> +
> +	smp_mb(); /* B matches A */
> +
> +	/*
> +	 * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> +	 * we are guaranteed to also see its __cpuhp_refcount increment.
> +	 */
> 
> +	if (per_cpu_sum(__cpuhp_refcount) != 0)
> +		return false;
> 
> +	smp_mb(); /* D matches C */
> 
> +	/*
> +	 * On equality, we know that there could not be any "sneak path" pairs
> +	 * where we see a decrement but not the corresponding increment for a
> +	 * given reader. If we saw its decrement, the memory barriers guarantee
> +	 * that we now see its cpuhp_seq increment.
> +	 */
> +
> +	// return per_cpu_sum(cpuhp_seq) == seq;

Delete the above per Oleg's suggestion, but actually need to replace with
"return true;".  We should be able to get rid of the first memory barrier
(B matches A) because the smp_mb() in cpu_hotplug_begin() covers it, but we
cannot git rid of the second memory barrier (D matches C).

>  }
> 
>  /*
> + * This will notify new readers to block and wait for all active readers to
> + * complete.
>   */
>  void cpu_hotplug_begin(void)
>  {
> +	/*
> +	 * Since cpu_hotplug_begin() is always called after invoking
> +	 * cpu_maps_update_begin(), we can be sure that only one writer is
> +	 * active.
> +	 */
> +	lockdep_assert_held(&cpu_add_remove_lock);
> 
> +	/* Allow reader-in-writer recursion. */
> +	current->cpuhp_ref++;
> +
> +	/* Notify readers to take the slow path. */
> +	__cpuhp_state = readers_slow;
> +
> +	/* See percpu_down_write(); guarantees all readers take the slow path */
> +	synchronize_sched();
> +
> +	/*
> +	 * Notify new readers to block; up until now, and thus throughout the
> +	 * longish synchronize_sched() above, new readers could still come in.
> +	 */
> +	__cpuhp_state = readers_block;
> +
> +	smp_mb(); /* E matches A */
> +
> +	/*
> +	 * If they don't see our writer of readers_block to __cpuhp_state,
> +	 * then we are guaranteed to see their __cpuhp_refcount increment, and
> +	 * therefore will wait for them.
> +	 */
> +
> +	/* Wait for all now active readers to complete. */
> +	wait_event(cpuhp_writer, cpuhp_readers_active_check());
>  }
> 
>  void cpu_hotplug_done(void)
>  {
> +	/* Signal the writer is done, no fast path yet. */
> +	__cpuhp_state = readers_slow;
> +	wake_up_all(&cpuhp_readers);

And one reason that we cannot just immediately flip to readers_fast
is that new readers might fail to see the results of this writer's
critical section.

> +
> +	/*
> +	 * The wait_event()/wake_up_all() prevents the race where the readers
> +	 * are delayed between fetching __cpuhp_state and blocking.
> +	 */
> +
> +	/* See percpu_up_write(); readers will no longer attempt to block. */
> +	synchronize_sched();
> +
> +	/* Let 'em rip */
> +	__cpuhp_state = readers_fast;
> +	current->cpuhp_ref--;
> +
> +	/*
> +	 * Wait for any pending readers to be running. This ensures readers
> +	 * after writer and avoids writers starving readers.
> +	 */
> +	wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
>  }
> 
>  /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
>  	INIT_LIST_HEAD(&p->numa_entry);
>  	p->numa_group = NULL;
>  #endif /* CONFIG_NUMA_BALANCING */
> +
> +	cpu_hotplug_init_task(p);
>  }
> 
>  #ifdef CONFIG_NUMA_BALANCING
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 18:07                                                           ` Oleg Nesterov
@ 2013-10-02  9:08                                                             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02  9:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

Hmm,. not in the version you posted; there xxx_enter() would only not do
the sync_sched if there's a concurrent 'writer', in which case it will
wait for it.

You only avoid the sync_sched in xxx_exit() and potentially join in the
sync_sched() of a next xxx_begin().

So with that scheme:

  for (i= ; i<4096; i++) {
    xxx_begin();
    xxx_exit();
  }

Will get 4096 sync_sched() calls from the xxx_begin() and all but the
last xxx_exit() will 'drop' the rcu callback.

And given the construct; I'm not entirely sure you can do away with the
sync_sched() in between. While its clear to me you can merge the two
into one; leaving it out entirely doesn't seem right.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02  9:08                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02  9:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
> 
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

Hmm,. not in the version you posted; there xxx_enter() would only not do
the sync_sched if there's a concurrent 'writer', in which case it will
wait for it.

You only avoid the sync_sched in xxx_exit() and potentially join in the
sync_sched() of a next xxx_begin().

So with that scheme:

  for (i= ; i<4096; i++) {
    xxx_begin();
    xxx_exit();
  }

Will get 4096 sync_sched() calls from the xxx_begin() and all but the
last xxx_exit() will 'drop' the rcu callback.

And given the construct; I'm not entirely sure you can do away with the
sync_sched() in between. While its clear to me you can merge the two
into one; leaving it out entirely doesn't seem right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 18:14                                                       ` Srivatsa S. Bhat
@ 2013-10-02 10:14                                                         ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-02 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 


Ok, so I thought a bit more about the changes you are proposing, and I agree
that they would be beneficial in the long run, especially given that it can
eventually lead to a more stream-lined hotplug process where different CPUs
can be hotplugged independently without waiting on each other, like you
mentioned in your other mail. So I'm fine with the new POST_DEAD guarantees
you are proposing - that they are run after unplug, and will be completed
before UP_PREPARE of the same CPU. And its also very convenient that we need
to fix only cpufreq to accommodate this change.

So below is a quick untested patch that modifies the cpufreq hotplug
callbacks appropriately. With this, cpufreq should be able to handle the
POST_DEAD changes, irrespective of whether we do that in the regular path
or in the suspend/resume path. (Because, I've restructured it in such a way
that the races that I had mentioned earlier are totally avoided. That is,
the POST_DEAD handler now performs only the bare-minimal final cleanup, which
doesn't race with or depend on anything else).



diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 04548f7..0a33c1a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1165,7 +1165,7 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 					bool frozen)
 {
 	unsigned int cpu = dev->id, cpus;
-	int new_cpu, ret;
+	int new_cpu, ret = 0;
 	unsigned long flags;
 	struct cpufreq_policy *policy;
 
@@ -1200,9 +1200,10 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 			policy->governor->name, CPUFREQ_NAME_LEN);
 #endif
 
-	lock_policy_rwsem_read(cpu);
+	lock_policy_rwsem_write(cpu);
 	cpus = cpumask_weight(policy->cpus);
-	unlock_policy_rwsem_read(cpu);
+	cpumask_clear_cpu(cpu, policy->cpus);
+	unlock_policy_rwsem_write(cpu);
 
 	if (cpu != policy->cpu) {
 		if (!frozen)
@@ -1220,7 +1221,23 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 		}
 	}
 
-	return 0;
+	/* If no target, nothing more to do */
+	if (!cpufreq_driver->target)
+		return 0;
+
+	/* If cpu is last user of policy, cleanup the policy governor */
+	if (cpus == 1) {
+		ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+		if (ret)
+			pr_err("%s: Failed to exit governor\n",	__func__);
+	} else {
+		if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
+				(ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
+			pr_err("%s: Failed to start governor\n", __func__);
+		}
+	}
+
+	return ret;
 }
 
 static int __cpufreq_remove_dev_finish(struct device *dev,
@@ -1243,25 +1260,12 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 		return -EINVAL;
 	}
 
-	WARN_ON(lock_policy_rwsem_write(cpu));
+	WARN_ON(lock_policy_rwsem_read(cpu));
 	cpus = cpumask_weight(policy->cpus);
-
-	if (cpus > 1)
-		cpumask_clear_cpu(cpu, policy->cpus);
-	unlock_policy_rwsem_write(cpu);
+	unlock_policy_rwsem_read(cpu);
 
 	/* If cpu is last user of policy, free policy */
-	if (cpus == 1) {
-		if (cpufreq_driver->target) {
-			ret = __cpufreq_governor(policy,
-					CPUFREQ_GOV_POLICY_EXIT);
-			if (ret) {
-				pr_err("%s: Failed to exit governor\n",
-						__func__);
-				return ret;
-			}
-		}
-
+	if (cpus == 0) {
 		if (!frozen) {
 			lock_policy_rwsem_read(cpu);
 			kobj = &policy->kobj;
@@ -1294,15 +1298,6 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 
 		if (!frozen)
 			cpufreq_policy_free(policy);
-	} else {
-		if (cpufreq_driver->target) {
-			if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
-					(ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
-				pr_err("%s: Failed to start governor\n",
-						__func__);
-				return ret;
-			}
-		}
 	}
 
 	per_cpu(cpufreq_cpu_data, cpu) = NULL;



Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 10:14                                                         ` Srivatsa S. Bhat
  0 siblings, 0 replies; 361+ messages in thread
From: Srivatsa S. Bhat @ 2013-10-02 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Paul E. McKenney, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
> 
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
> 


Ok, so I thought a bit more about the changes you are proposing, and I agree
that they would be beneficial in the long run, especially given that it can
eventually lead to a more stream-lined hotplug process where different CPUs
can be hotplugged independently without waiting on each other, like you
mentioned in your other mail. So I'm fine with the new POST_DEAD guarantees
you are proposing - that they are run after unplug, and will be completed
before UP_PREPARE of the same CPU. And its also very convenient that we need
to fix only cpufreq to accommodate this change.

So below is a quick untested patch that modifies the cpufreq hotplug
callbacks appropriately. With this, cpufreq should be able to handle the
POST_DEAD changes, irrespective of whether we do that in the regular path
or in the suspend/resume path. (Because, I've restructured it in such a way
that the races that I had mentioned earlier are totally avoided. That is,
the POST_DEAD handler now performs only the bare-minimal final cleanup, which
doesn't race with or depend on anything else).



diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 04548f7..0a33c1a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1165,7 +1165,7 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 					bool frozen)
 {
 	unsigned int cpu = dev->id, cpus;
-	int new_cpu, ret;
+	int new_cpu, ret = 0;
 	unsigned long flags;
 	struct cpufreq_policy *policy;
 
@@ -1200,9 +1200,10 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 			policy->governor->name, CPUFREQ_NAME_LEN);
 #endif
 
-	lock_policy_rwsem_read(cpu);
+	lock_policy_rwsem_write(cpu);
 	cpus = cpumask_weight(policy->cpus);
-	unlock_policy_rwsem_read(cpu);
+	cpumask_clear_cpu(cpu, policy->cpus);
+	unlock_policy_rwsem_write(cpu);
 
 	if (cpu != policy->cpu) {
 		if (!frozen)
@@ -1220,7 +1221,23 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
 		}
 	}
 
-	return 0;
+	/* If no target, nothing more to do */
+	if (!cpufreq_driver->target)
+		return 0;
+
+	/* If cpu is last user of policy, cleanup the policy governor */
+	if (cpus == 1) {
+		ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+		if (ret)
+			pr_err("%s: Failed to exit governor\n",	__func__);
+	} else {
+		if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
+				(ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
+			pr_err("%s: Failed to start governor\n", __func__);
+		}
+	}
+
+	return ret;
 }
 
 static int __cpufreq_remove_dev_finish(struct device *dev,
@@ -1243,25 +1260,12 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 		return -EINVAL;
 	}
 
-	WARN_ON(lock_policy_rwsem_write(cpu));
+	WARN_ON(lock_policy_rwsem_read(cpu));
 	cpus = cpumask_weight(policy->cpus);
-
-	if (cpus > 1)
-		cpumask_clear_cpu(cpu, policy->cpus);
-	unlock_policy_rwsem_write(cpu);
+	unlock_policy_rwsem_read(cpu);
 
 	/* If cpu is last user of policy, free policy */
-	if (cpus == 1) {
-		if (cpufreq_driver->target) {
-			ret = __cpufreq_governor(policy,
-					CPUFREQ_GOV_POLICY_EXIT);
-			if (ret) {
-				pr_err("%s: Failed to exit governor\n",
-						__func__);
-				return ret;
-			}
-		}
-
+	if (cpus == 0) {
 		if (!frozen) {
 			lock_policy_rwsem_read(cpu);
 			kobj = &policy->kobj;
@@ -1294,15 +1298,6 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
 
 		if (!frozen)
 			cpufreq_policy_free(policy);
-	} else {
-		if (cpufreq_driver->target) {
-			if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
-					(ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
-				pr_err("%s: Failed to start governor\n",
-						__func__);
-				return ret;
-			}
-		}
 	}
 
 	per_cpu(cpufreq_cpu_data, cpu) = NULL;



Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02  9:08                                                             ` Peter Zijlstra
@ 2013-10-02 12:13                                                               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > > another thread, this should likely "join" all synchronize_sched's.
> > >
> > > That would still be 4k * sync_sched() == terribly long.
> >
> > No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> > pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
>
> Hmm,. not in the version you posted; there xxx_enter() would only not do
> the sync_sched if there's a concurrent 'writer', in which case it will
> wait for it.

No, please see below.

> You only avoid the sync_sched in xxx_exit() and potentially join in the
> sync_sched() of a next xxx_begin().
>
> So with that scheme:
>
>   for (i= ; i<4096; i++) {
>     xxx_begin();
>     xxx_exit();
>   }
>
> Will get 4096 sync_sched() calls from the xxx_begin() and all but the
> last xxx_exit() will 'drop' the rcu callback.

No, the code above should call sync_sched() only once, no matter what
this code does between _enter and _exit. This was one of the points.

To clarify, of course I mean the "likely" case. Say, a long preemption
after _exit can lead to another sync_sched().

	void xxx_enter(struct xxx_struct *xxx)
	{
		bool need_wait, need_sync;

		spin_lock_irq(&xxx->xxx_lock);
		need_wait = xxx->gp_count++;
		need_sync = xxx->gp_state == GP_IDLE;
		if (need_sync)
			xxx->gp_state = GP_PENDING;
		spin_unlock_irq(&xxx->xxx_lock);

		BUG_ON(need_wait && need_sync);

		if (need_sync) {
			synchronize_sched();
			xxx->gp_state = GP_PASSED;
			wake_up_all(&xxx->gp_waitq);
		} else if (need_wait) {
			wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
		} else {
			BUG_ON(xxx->gp_state != GP_PASSED);
		}
	}

The 1st iteration:

	xxx_enter() does synchronize_sched() and sets gp_state = GP_PASSED.

	xxx_exit() starts the rcu callback, but gp_state is still PASSED.

all other iterations in the "likely" case:

	xxx_enter() should likely come before the pending callback fires
	and clears gp_state. In this case we only increment ->gp_count
	(this "disables" the rcu callback) and do nothing more, gp_state
	is still GP_PASSED.

	xxx_exit() does another call_rcu_sched(), or does the
	CP_PENDING -> CB_REPLAY change. The latter is the same as "start
	another callback".

In short: unless a gp elapses between _exit() and _enter(), the next
_enter() does nothing and avoids synchronize_sched().

> And given the construct; I'm not entirely sure you can do away with the
> sync_sched() in between. While its clear to me you can merge the two
> into one; leaving it out entirely doesn't seem right.

Could you explain?

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 12:13                                                               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > > another thread, this should likely "join" all synchronize_sched's.
> > >
> > > That would still be 4k * sync_sched() == terribly long.
> >
> > No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> > pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
>
> Hmm,. not in the version you posted; there xxx_enter() would only not do
> the sync_sched if there's a concurrent 'writer', in which case it will
> wait for it.

No, please see below.

> You only avoid the sync_sched in xxx_exit() and potentially join in the
> sync_sched() of a next xxx_begin().
>
> So with that scheme:
>
>   for (i= ; i<4096; i++) {
>     xxx_begin();
>     xxx_exit();
>   }
>
> Will get 4096 sync_sched() calls from the xxx_begin() and all but the
> last xxx_exit() will 'drop' the rcu callback.

No, the code above should call sync_sched() only once, no matter what
this code does between _enter and _exit. This was one of the points.

To clarify, of course I mean the "likely" case. Say, a long preemption
after _exit can lead to another sync_sched().

	void xxx_enter(struct xxx_struct *xxx)
	{
		bool need_wait, need_sync;

		spin_lock_irq(&xxx->xxx_lock);
		need_wait = xxx->gp_count++;
		need_sync = xxx->gp_state == GP_IDLE;
		if (need_sync)
			xxx->gp_state = GP_PENDING;
		spin_unlock_irq(&xxx->xxx_lock);

		BUG_ON(need_wait && need_sync);

		if (need_sync) {
			synchronize_sched();
			xxx->gp_state = GP_PASSED;
			wake_up_all(&xxx->gp_waitq);
		} else if (need_wait) {
			wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
		} else {
			BUG_ON(xxx->gp_state != GP_PASSED);
		}
	}

The 1st iteration:

	xxx_enter() does synchronize_sched() and sets gp_state = GP_PASSED.

	xxx_exit() starts the rcu callback, but gp_state is still PASSED.

all other iterations in the "likely" case:

	xxx_enter() should likely come before the pending callback fires
	and clears gp_state. In this case we only increment ->gp_count
	(this "disables" the rcu callback) and do nothing more, gp_state
	is still GP_PASSED.

	xxx_exit() does another call_rcu_sched(), or does the
	CP_PENDING -> CB_REPLAY change. The latter is the same as "start
	another callback".

In short: unless a gp elapses between _exit() and _enter(), the next
_enter() does nothing and avoids synchronize_sched().

> And given the construct; I'm not entirely sure you can do away with the
> sync_sched() in between. While its clear to me you can merge the two
> into one; leaving it out entirely doesn't seem right.

Could you explain?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-01 19:05                                                             ` Paul E. McKenney
@ 2013-10-02 12:16                                                               ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 12:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar, tony.luck, bp

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > On 10/01, Peter Zijlstra wrote:
> > >
> > > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > > to preserve the current logic: "some time after" should not be after the
> > > > next CPU_DOWN/UP*. But I won't argue too much.
> > >
> > > Nah, I think breaking it is the right thing :-)
> >
> > I don't really agree but I won't argue ;)
>
> The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
> guys who would need to complain, given that they seem to have the only
> use in 3.11.

mce_cpu_callback() is fine, it ignores POST_DEAD if CPU_TASKS_FROZEN.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 12:16                                                               ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 12:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar, tony.luck, bp

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > On 10/01, Peter Zijlstra wrote:
> > >
> > > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > > to preserve the current logic: "some time after" should not be after the
> > > > next CPU_DOWN/UP*. But I won't argue too much.
> > >
> > > Nah, I think breaking it is the right thing :-)
> >
> > I don't really agree but I won't argue ;)
>
> The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
> guys who would need to complain, given that they seem to have the only
> use in 3.11.

mce_cpu_callback() is fine, it ignores POST_DEAD if CPU_TASKS_FROZEN.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 12:13                                                               ` Oleg Nesterov
@ 2013-10-02 12:25                                                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 12:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> > And given the construct; I'm not entirely sure you can do away with the
> > sync_sched() in between. While its clear to me you can merge the two
> > into one; leaving it out entirely doesn't seem right.
> 
> Could you explain?

Somehow I thought the fastpath got enabled; it doesn't since we never
hit GP_IDLE, so we don't actually need that.

You're right.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 12:25                                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 12:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> > And given the construct; I'm not entirely sure you can do away with the
> > sync_sched() in between. While its clear to me you can merge the two
> > into one; leaving it out entirely doesn't seem right.
> 
> Could you explain?

Somehow I thought the fastpath got enabled; it doesn't since we never
hit GP_IDLE, so we don't actually need that.

You're right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 12:13                                                               ` Oleg Nesterov
@ 2013-10-02 13:31                                                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 13:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> In short: unless a gp elapses between _exit() and _enter(), the next
> _enter() does nothing and avoids synchronize_sched().

That does however make the entire scheme entirely writer biased;
increasing the need for the waitcount thing I have. Otherwise we'll
starve pending readers.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 13:31                                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 13:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> In short: unless a gp elapses between _exit() and _enter(), the next
> _enter() does nothing and avoids synchronize_sched().

That does however make the entire scheme entirely writer biased;
increasing the need for the waitcount thing I have. Otherwise we'll
starve pending readers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 13:31                                                                 ` Peter Zijlstra
@ 2013-10-02 14:00                                                                   ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > In short: unless a gp elapses between _exit() and _enter(), the next
> > _enter() does nothing and avoids synchronize_sched().
>
> That does however make the entire scheme entirely writer biased;

Well, this makes the scheme "a bit more" writer biased, but this is
exactly what we want in this case.

We do not block the readers after xxx_exit() entirely, but we do want
to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
transitions.

Lets even forget about disable_nonboot_cpus(), lets consider
percpu_rwsem-like logic "in general".

Yes, it is heavily optimizied for readers. But if the writers come in
a batch, or the same writer does down_write + up_write twice or more,
I think state == FAST is pointless in between (if we can avoid it).
This is the rare case (the writers should be rare), but if it happens
it makes sense to optimize the writers too. And again, even

	for (;;) {
		percpu_down_write();
		percpu_up_write();
	}

should not completely block the readers.

IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
a win. If the next down_write/xxx_enter "knows" that the readers are
still in SLOW mode because gp was not completed yet, why should we
add the artificial delay?

As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
outside of the loop, this is the same thing.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 14:00                                                                   ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > In short: unless a gp elapses between _exit() and _enter(), the next
> > _enter() does nothing and avoids synchronize_sched().
>
> That does however make the entire scheme entirely writer biased;

Well, this makes the scheme "a bit more" writer biased, but this is
exactly what we want in this case.

We do not block the readers after xxx_exit() entirely, but we do want
to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
transitions.

Lets even forget about disable_nonboot_cpus(), lets consider
percpu_rwsem-like logic "in general".

Yes, it is heavily optimizied for readers. But if the writers come in
a batch, or the same writer does down_write + up_write twice or more,
I think state == FAST is pointless in between (if we can avoid it).
This is the rare case (the writers should be rare), but if it happens
it makes sense to optimize the writers too. And again, even

	for (;;) {
		percpu_down_write();
		percpu_up_write();
	}

should not completely block the readers.

IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
a win. If the next down_write/xxx_enter "knows" that the readers are
still in SLOW mode because gp was not completed yet, why should we
add the artificial delay?

As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
outside of the loop, this is the same thing.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-09-29 18:36       ` Oleg Nesterov
@ 2013-10-02 14:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 14:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds



^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-10-02 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 14:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 14:00                                                                   ` Oleg Nesterov
@ 2013-10-02 15:17                                                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> And again, even
> 
> 	for (;;) {
> 		percpu_down_write();
> 		percpu_up_write();
> 	}
> 
> should not completely block the readers.

Sure there's a tiny window, but don't forget that a reader will have to
wait for the gp_state cacheline to transfer to shared state and the
per-cpu refcount cachelines to be brought back into exclusive mode and
the above can be aggressive enough that by that time we'll observe
state == blocked again.

So I don't think that in practise a reader will get in.

Also, since the write side is exposed to userspace; you've got an
effective DoS.

So I'll stick to waitcount -- as you can see in the patches I've just
posted.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 15:17                                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-02 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> And again, even
> 
> 	for (;;) {
> 		percpu_down_write();
> 		percpu_up_write();
> 	}
> 
> should not completely block the readers.

Sure there's a tiny window, but don't forget that a reader will have to
wait for the gp_state cacheline to transfer to shared state and the
per-cpu refcount cachelines to be brought back into exclusive mode and
the above can be aggressive enough that by that time we'll observe
state == blocked again.

So I don't think that in practise a reader will get in.

Also, since the write side is exposed to userspace; you've got an
effective DoS.

So I'll stick to waitcount -- as you can see in the patches I've just
posted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 15:17                                                                     ` Peter Zijlstra
@ 2013-10-02 16:31                                                                       ` Oleg Nesterov
  -1 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> > And again, even
> >
> > 	for (;;) {
> > 		percpu_down_write();
> > 		percpu_up_write();
> > 	}
> >
> > should not completely block the readers.
>
> Sure there's a tiny window, but don't forget that a reader will have to
> wait for the gp_state cacheline to transfer to shared state and the
> per-cpu refcount cachelines to be brought back into exclusive mode and
> the above can be aggressive enough that by that time we'll observe
> state == blocked again.

Sure, but don't forget that other callers of cpu_down() do a lot more
work before/after they actually call cpu_hotplug_begin/end().

> So I'll stick to waitcount -- as you can see in the patches I've just
> posted.

I still do not believe we need this waitcount "in practice" ;)

But even if I am right this is minor and we can reconsider this later,
so please forget.

Oleg.


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 16:31                                                                       ` Oleg Nesterov
  0 siblings, 0 replies; 361+ messages in thread
From: Oleg Nesterov @ 2013-10-02 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Paul E. McKenney,
	Mel Gorman, Rik van Riel, Srikar Dronamraju, Ingo Molnar,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Thomas Gleixner, Steven Rostedt, Viresh Kumar

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> > And again, even
> >
> > 	for (;;) {
> > 		percpu_down_write();
> > 		percpu_up_write();
> > 	}
> >
> > should not completely block the readers.
>
> Sure there's a tiny window, but don't forget that a reader will have to
> wait for the gp_state cacheline to transfer to shared state and the
> per-cpu refcount cachelines to be brought back into exclusive mode and
> the above can be aggressive enough that by that time we'll observe
> state == blocked again.

Sure, but don't forget that other callers of cpu_down() do a lot more
work before/after they actually call cpu_hotplug_begin/end().

> So I'll stick to waitcount -- as you can see in the patches I've just
> posted.

I still do not believe we need this waitcount "in practice" ;)

But even if I am right this is minor and we can reconsider this later,
so please forget.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
  2013-10-02 14:00                                                                   ` Oleg Nesterov
@ 2013-10-02 17:52                                                                     ` Paul E. McKenney
  -1 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-02 17:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> >
> > On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > > In short: unless a gp elapses between _exit() and _enter(), the next
> > > _enter() does nothing and avoids synchronize_sched().
> >
> > That does however make the entire scheme entirely writer biased;
> 
> Well, this makes the scheme "a bit more" writer biased, but this is
> exactly what we want in this case.
> 
> We do not block the readers after xxx_exit() entirely, but we do want
> to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
> transitions.

Yes -- should help -a- -lot- for bulk write-side operations, such as
onlining all CPUs at boot time.  ;-)

							Thanx, Paul

> Lets even forget about disable_nonboot_cpus(), lets consider
> percpu_rwsem-like logic "in general".
> 
> Yes, it is heavily optimizied for readers. But if the writers come in
> a batch, or the same writer does down_write + up_write twice or more,
> I think state == FAST is pointless in between (if we can avoid it).
> This is the rare case (the writers should be rare), but if it happens
> it makes sense to optimize the writers too. And again, even
> 
> 	for (;;) {
> 		percpu_down_write();
> 		percpu_up_write();
> 	}
> 
> should not completely block the readers.
> 
> IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
> a win. If the next down_write/xxx_enter "knows" that the readers are
> still in SLOW mode because gp was not completed yet, why should we
> add the artificial delay?
> 
> As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
> outside of the loop, this is the same thing.
> 
> Oleg.
> 


^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()
@ 2013-10-02 17:52                                                                     ` Paul E. McKenney
  0 siblings, 0 replies; 361+ messages in thread
From: Paul E. McKenney @ 2013-10-02 17:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Srivatsa S. Bhat, Rafael J. Wysocki, Mel Gorman,
	Rik van Riel, Srikar Dronamraju, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Thomas Gleixner, Steven Rostedt,
	Viresh Kumar

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> >
> > On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > > In short: unless a gp elapses between _exit() and _enter(), the next
> > > _enter() does nothing and avoids synchronize_sched().
> >
> > That does however make the entire scheme entirely writer biased;
> 
> Well, this makes the scheme "a bit more" writer biased, but this is
> exactly what we want in this case.
> 
> We do not block the readers after xxx_exit() entirely, but we do want
> to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
> transitions.

Yes -- should help -a- -lot- for bulk write-side operations, such as
onlining all CPUs at boot time.  ;-)

							Thanx, Paul

> Lets even forget about disable_nonboot_cpus(), lets consider
> percpu_rwsem-like logic "in general".
> 
> Yes, it is heavily optimizied for readers. But if the writers come in
> a batch, or the same writer does down_write + up_write twice or more,
> I think state == FAST is pointless in between (if we can avoid it).
> This is the rare case (the writers should be rare), but if it happens
> it makes sense to optimize the writers too. And again, even
> 
> 	for (;;) {
> 		percpu_down_write();
> 		percpu_up_write();
> 	}
> 
> should not completely block the readers.
> 
> IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
> a win. If the next down_write/xxx_enter "knows" that the readers are
> still in SLOW mode because gp was not completed yet, why should we
> add the artificial delay?
> 
> As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
> outside of the loop, this is the same thing.
> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-10-02 14:41         ` Peter Zijlstra
@ 2013-10-03  7:04           ` Ingo Molnar
  -1 siblings, 0 replies; 361+ messages in thread
From: Ingo Molnar @ 2013-10-03  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> 

Fully agreed! :-)

	Ingo

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-10-03  7:04           ` Ingo Molnar
  0 siblings, 0 replies; 361+ messages in thread
From: Ingo Molnar @ 2013-10-03  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> 

Fully agreed! :-)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
  2013-10-03  7:04           ` Ingo Molnar
@ 2013-10-03  7:43             ` Peter Zijlstra
  -1 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-03  7:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 
> 
> Fully agreed! :-)

haha.. never realized I send that email completely empty. It was
supposed to contain the patch I later send as 2/3.

^ permalink raw reply	[flat|nested] 361+ messages in thread

* Re: [RFC] introduce synchronize_sched_{enter,exit}()
@ 2013-10-03  7:43             ` Peter Zijlstra
  0 siblings, 0 replies; 361+ messages in thread
From: Peter Zijlstra @ 2013-10-03  7:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML, Paul McKenney,
	Thomas Gleixner, Steven Rostedt, Linus Torvalds

On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 
> 
> Fully agreed! :-)

haha.. never realized I send that email completely empty. It was
supposed to contain the patch I later send as 2/3.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 361+ messages in thread

end of thread, other threads:[~2013-10-03  7:43 UTC | newest]

Thread overview: 361+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-10  9:31 [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7 Mel Gorman
2013-09-10  9:31 ` Mel Gorman
2013-09-10  9:31 ` [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-11  0:58   ` Joonsoo Kim
2013-09-11  0:58     ` Joonsoo Kim
2013-09-11  3:11   ` Hillf Danton
2013-09-11  3:11     ` Hillf Danton
2013-09-13  8:11     ` Mel Gorman
2013-09-13  8:11       ` Mel Gorman
2013-09-10  9:31 ` [PATCH 02/50] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 03/50] sched, numa: Comment fixlets Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 04/50] mm: numa: Do not account for a hinting fault if we raced Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 05/50] mm: Wait for THP migrations to complete during NUMA hinting faults Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 06/50] mm: Prevent parallel splits during THP migration Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-16 12:36   ` Peter Zijlstra
2013-09-16 12:36     ` Peter Zijlstra
2013-09-16 13:39     ` Rik van Riel
2013-09-16 13:39       ` Rik van Riel
2013-09-16 14:54       ` Peter Zijlstra
2013-09-16 14:54         ` Peter Zijlstra
2013-09-16 16:11         ` Mel Gorman
2013-09-16 16:11           ` Mel Gorman
2013-09-16 16:37           ` Peter Zijlstra
2013-09-16 16:37             ` Peter Zijlstra
2013-09-10  9:31 ` [PATCH 08/50] mm: numa: Sanitize task_numa_fault() callsites Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 09/50] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 10/50] sched: numa: Mitigate chance that same task always updates PTEs Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 11/50] sched: numa: Continue PTE scanning even if migrate rate limited Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 12/50] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 13/50] sched: numa: Initialise numa_next_scan properly Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-16 15:18   ` Peter Zijlstra
2013-09-16 15:18     ` Peter Zijlstra
2013-09-16 15:40     ` Mel Gorman
2013-09-16 15:40       ` Mel Gorman
2013-09-10  9:31 ` [PATCH 15/50] sched: numa: Correct adjustment of numa_scan_period Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 16/50] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-16 16:35   ` Peter Zijlstra
2013-09-16 16:35     ` Peter Zijlstra
2013-09-17 17:00     ` Mel Gorman
2013-09-17 17:00       ` Mel Gorman
2013-09-10  9:31 ` [PATCH 18/50] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:31 ` [PATCH 19/50] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-09-10  9:31   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 20/50] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 21/50] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 22/50] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 23/50] sched: Resist moving tasks towards nodes with fewer hinting faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 24/50] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 25/50] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 26/50] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-12  2:10   ` Hillf Danton
2013-09-12  2:10     ` Hillf Danton
2013-09-13  8:11     ` Mel Gorman
2013-09-13  8:11       ` Mel Gorman
2013-09-10  9:32 ` [PATCH 28/50] sched: Remove check that skips small VMAs Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 29/50] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 30/50] sched: Do not migrate memory immediately after switching node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 31/50] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 32/50] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 33/50] sched: numa: increment numa_migrate_seq when task runs in correct location Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-17  2:02   ` 答复: " 张天飞
2013-09-17  8:05     ` ????: " Mel Gorman
2013-09-17  8:05       ` Mel Gorman
2013-09-17  8:22       ` Figo.zhang
2013-09-10  9:32 ` [PATCH 35/50] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 36/50] stop_machine: Introduce stop_two_cpus() Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 37/50] sched: Introduce migrate_swap() Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-17 14:30   ` [PATCH] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
2013-09-17 14:30     ` Peter Zijlstra
2013-09-17 16:20     ` Mel Gorman
2013-09-17 16:20       ` Mel Gorman
2013-09-17 16:45       ` Peter Zijlstra
2013-09-17 16:45         ` Peter Zijlstra
2013-09-18 15:49         ` Peter Zijlstra
2013-09-18 15:49           ` Peter Zijlstra
2013-09-19 14:32           ` Peter Zijlstra
2013-09-19 14:32             ` Peter Zijlstra
2013-09-21 16:34             ` Oleg Nesterov
2013-09-21 16:34               ` Oleg Nesterov
2013-09-21 19:13               ` Oleg Nesterov
2013-09-21 19:13                 ` Oleg Nesterov
2013-09-23  9:29               ` Peter Zijlstra
2013-09-23  9:29                 ` Peter Zijlstra
2013-09-23 17:32                 ` Oleg Nesterov
2013-09-23 17:32                   ` Oleg Nesterov
2013-09-24 20:24                   ` Peter Zijlstra
2013-09-24 20:24                     ` Peter Zijlstra
2013-09-24 21:02                     ` Peter Zijlstra
2013-09-24 21:02                       ` Peter Zijlstra
2013-09-25 15:55                     ` Oleg Nesterov
2013-09-25 15:55                       ` Oleg Nesterov
2013-09-25 16:59                       ` Paul E. McKenney
2013-09-25 16:59                         ` Paul E. McKenney
2013-09-25 17:43                       ` Peter Zijlstra
2013-09-25 17:43                         ` Peter Zijlstra
2013-09-25 17:50                         ` Oleg Nesterov
2013-09-25 17:50                           ` Oleg Nesterov
2013-09-25 18:40                           ` Peter Zijlstra
2013-09-25 18:40                             ` Peter Zijlstra
2013-09-25 21:22                             ` Paul E. McKenney
2013-09-25 21:22                               ` Paul E. McKenney
2013-09-26 11:10                               ` Peter Zijlstra
2013-09-26 11:10                                 ` Peter Zijlstra
     [not found]                                 ` <20130926155321.GA4342@redhat.com>
2013-09-26 16:13                                   ` Peter Zijlstra
2013-09-26 16:13                                     ` Peter Zijlstra
2013-09-26 16:14                                     ` Oleg Nesterov
2013-09-26 16:14                                       ` Oleg Nesterov
2013-09-26 16:40                                       ` Peter Zijlstra
2013-09-26 16:40                                         ` Peter Zijlstra
2013-09-26 16:58                                 ` Oleg Nesterov
2013-09-26 16:58                                   ` Oleg Nesterov
2013-09-26 17:50                                   ` Peter Zijlstra
2013-09-26 17:50                                     ` Peter Zijlstra
2013-09-27 18:15                                     ` Oleg Nesterov
2013-09-27 18:15                                       ` Oleg Nesterov
2013-09-27 20:41                                       ` Peter Zijlstra
2013-09-27 20:41                                         ` Peter Zijlstra
2013-09-28 12:48                                         ` Oleg Nesterov
2013-09-28 12:48                                           ` Oleg Nesterov
2013-09-28 14:47                                           ` Peter Zijlstra
2013-09-28 14:47                                             ` Peter Zijlstra
2013-09-28 16:31                                             ` Oleg Nesterov
2013-09-28 16:31                                               ` Oleg Nesterov
2013-09-30 20:11                                               ` Rafael J. Wysocki
2013-09-30 20:11                                                 ` Rafael J. Wysocki
2013-10-01 17:11                                                 ` Srivatsa S. Bhat
2013-10-01 17:11                                                   ` Srivatsa S. Bhat
2013-10-01 17:36                                                   ` Peter Zijlstra
2013-10-01 17:36                                                     ` Peter Zijlstra
2013-10-01 17:45                                                     ` Oleg Nesterov
2013-10-01 17:45                                                       ` Oleg Nesterov
2013-10-01 17:56                                                       ` Peter Zijlstra
2013-10-01 17:56                                                         ` Peter Zijlstra
2013-10-01 18:07                                                         ` Oleg Nesterov
2013-10-01 18:07                                                           ` Oleg Nesterov
2013-10-01 19:05                                                           ` Paul E. McKenney
2013-10-01 19:05                                                             ` Paul E. McKenney
2013-10-02 12:16                                                             ` Oleg Nesterov
2013-10-02 12:16                                                               ` Oleg Nesterov
2013-10-02  9:08                                                           ` Peter Zijlstra
2013-10-02  9:08                                                             ` Peter Zijlstra
2013-10-02 12:13                                                             ` Oleg Nesterov
2013-10-02 12:13                                                               ` Oleg Nesterov
2013-10-02 12:25                                                               ` Peter Zijlstra
2013-10-02 12:25                                                                 ` Peter Zijlstra
2013-10-02 13:31                                                               ` Peter Zijlstra
2013-10-02 13:31                                                                 ` Peter Zijlstra
2013-10-02 14:00                                                                 ` Oleg Nesterov
2013-10-02 14:00                                                                   ` Oleg Nesterov
2013-10-02 15:17                                                                   ` Peter Zijlstra
2013-10-02 15:17                                                                     ` Peter Zijlstra
2013-10-02 16:31                                                                     ` Oleg Nesterov
2013-10-02 16:31                                                                       ` Oleg Nesterov
2013-10-02 17:52                                                                   ` Paul E. McKenney
2013-10-02 17:52                                                                     ` Paul E. McKenney
2013-10-01 19:03                                                         ` Srivatsa S. Bhat
2013-10-01 19:03                                                           ` Srivatsa S. Bhat
2013-10-01 18:14                                                     ` Srivatsa S. Bhat
2013-10-01 18:14                                                       ` Srivatsa S. Bhat
2013-10-01 18:56                                                       ` Srivatsa S. Bhat
2013-10-01 18:56                                                         ` Srivatsa S. Bhat
2013-10-02 10:14                                                       ` Srivatsa S. Bhat
2013-10-02 10:14                                                         ` Srivatsa S. Bhat
2013-09-28 20:46                                           ` Paul E. McKenney
2013-09-28 20:46                                             ` Paul E. McKenney
2013-10-01  3:56                                         ` Paul E. McKenney
2013-10-01  3:56                                           ` Paul E. McKenney
2013-10-01 14:14                                           ` Oleg Nesterov
2013-10-01 14:14                                             ` Oleg Nesterov
2013-10-01 14:45                                             ` Paul E. McKenney
2013-10-01 14:45                                               ` Paul E. McKenney
2013-10-01 14:48                                               ` Peter Zijlstra
2013-10-01 14:48                                                 ` Peter Zijlstra
2013-10-01 15:24                                                 ` Paul E. McKenney
2013-10-01 15:24                                                   ` Paul E. McKenney
2013-10-01 15:34                                                   ` Oleg Nesterov
2013-10-01 15:34                                                     ` Oleg Nesterov
2013-10-01 15:00                                               ` Oleg Nesterov
2013-10-01 15:00                                                 ` Oleg Nesterov
2013-09-29 13:56                                       ` Oleg Nesterov
2013-09-29 13:56                                         ` Oleg Nesterov
2013-10-01 15:38                                         ` Paul E. McKenney
2013-10-01 15:38                                           ` Paul E. McKenney
2013-10-01 15:40                                           ` Oleg Nesterov
2013-10-01 15:40                                             ` Oleg Nesterov
2013-10-01 20:40                                 ` Paul E. McKenney
2013-10-01 20:40                                   ` Paul E. McKenney
2013-09-23 14:50             ` Steven Rostedt
2013-09-23 14:50               ` Steven Rostedt
2013-09-23 14:54               ` Peter Zijlstra
2013-09-23 14:54                 ` Peter Zijlstra
2013-09-23 15:13                 ` Steven Rostedt
2013-09-23 15:13                   ` Steven Rostedt
2013-09-23 15:22                   ` Peter Zijlstra
2013-09-23 15:22                     ` Peter Zijlstra
2013-09-23 15:59                     ` Steven Rostedt
2013-09-23 15:59                       ` Steven Rostedt
2013-09-23 16:02                       ` Peter Zijlstra
2013-09-23 16:02                         ` Peter Zijlstra
2013-09-23 15:50                   ` Paul E. McKenney
2013-09-23 15:50                     ` Paul E. McKenney
2013-09-23 16:01                     ` Peter Zijlstra
2013-09-23 16:01                       ` Peter Zijlstra
2013-09-23 17:04                       ` Paul E. McKenney
2013-09-23 17:04                         ` Paul E. McKenney
2013-09-23 17:30                         ` Peter Zijlstra
2013-09-23 17:30                           ` Peter Zijlstra
2013-09-23 17:50             ` Oleg Nesterov
2013-09-23 17:50               ` Oleg Nesterov
2013-09-24 12:38               ` Peter Zijlstra
2013-09-24 12:38                 ` Peter Zijlstra
2013-09-24 14:42                 ` Paul E. McKenney
2013-09-24 14:42                   ` Paul E. McKenney
2013-09-24 16:09                   ` Peter Zijlstra
2013-09-24 16:09                     ` Peter Zijlstra
2013-09-24 16:31                     ` Oleg Nesterov
2013-09-24 16:31                       ` Oleg Nesterov
2013-09-24 21:09                     ` Paul E. McKenney
2013-09-24 21:09                       ` Paul E. McKenney
2013-09-24 16:03                 ` Oleg Nesterov
2013-09-24 16:03                   ` Oleg Nesterov
2013-09-24 16:43                   ` Steven Rostedt
2013-09-24 16:43                     ` Steven Rostedt
2013-09-24 17:06                     ` Oleg Nesterov
2013-09-24 17:06                       ` Oleg Nesterov
2013-09-24 17:47                       ` Paul E. McKenney
2013-09-24 17:47                         ` Paul E. McKenney
2013-09-24 18:00                         ` Oleg Nesterov
2013-09-24 18:00                           ` Oleg Nesterov
2013-09-24 20:35                           ` Peter Zijlstra
2013-09-24 20:35                             ` Peter Zijlstra
2013-09-25 15:16                             ` Oleg Nesterov
2013-09-25 15:16                               ` Oleg Nesterov
2013-09-25 15:35                               ` Peter Zijlstra
2013-09-25 15:35                                 ` Peter Zijlstra
2013-09-25 16:33                                 ` Oleg Nesterov
2013-09-25 16:33                                   ` Oleg Nesterov
2013-09-24 16:49                   ` Paul E. McKenney
2013-09-24 16:49                     ` Paul E. McKenney
2013-09-24 16:54                     ` Peter Zijlstra
2013-09-24 16:54                       ` Peter Zijlstra
2013-09-24 17:02                       ` Oleg Nesterov
2013-09-24 17:02                         ` Oleg Nesterov
2013-09-24 16:51                   ` Peter Zijlstra
2013-09-24 16:51                     ` Peter Zijlstra
2013-09-24 16:39                 ` Steven Rostedt
2013-09-24 16:39                   ` Steven Rostedt
2013-09-29 18:36     ` [RFC] introduce synchronize_sched_{enter,exit}() Oleg Nesterov
2013-09-29 18:36       ` Oleg Nesterov
2013-09-29 20:01       ` Paul E. McKenney
2013-09-29 20:01         ` Paul E. McKenney
2013-09-30 12:42         ` Oleg Nesterov
2013-09-30 12:42           ` Oleg Nesterov
2013-09-29 21:34       ` Steven Rostedt
2013-09-29 21:34         ` Steven Rostedt
2013-09-30 13:03         ` Oleg Nesterov
2013-09-30 13:03           ` Oleg Nesterov
2013-09-30 12:59       ` Peter Zijlstra
2013-09-30 12:59         ` Peter Zijlstra
2013-09-30 14:24         ` Peter Zijlstra
2013-09-30 14:24           ` Peter Zijlstra
2013-09-30 15:06           ` Peter Zijlstra
2013-09-30 15:06             ` Peter Zijlstra
2013-09-30 16:58             ` Oleg Nesterov
2013-09-30 16:58               ` Oleg Nesterov
2013-09-30 16:38         ` Oleg Nesterov
2013-09-30 16:38           ` Oleg Nesterov
2013-10-02 14:41       ` Peter Zijlstra
2013-10-02 14:41         ` Peter Zijlstra
2013-10-03  7:04         ` Ingo Molnar
2013-10-03  7:04           ` Ingo Molnar
2013-10-03  7:43           ` Peter Zijlstra
2013-10-03  7:43             ` Peter Zijlstra
2013-09-17 14:32   ` [PATCH 37/50] sched: Introduce migrate_swap() Peter Zijlstra
2013-09-17 14:32     ` Peter Zijlstra
2013-09-10  9:32 ` [PATCH 38/50] sched: numa: Use a system-wide search to find swap/migration candidates Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 39/50] sched: numa: Favor placing a task on the preferred node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 40/50] mm: numa: Change page last {nid,pid} into {cpu,pid} Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-12 12:42   ` Hillf Danton
2013-09-12 12:42     ` Hillf Danton
2013-09-12 14:40     ` Mel Gorman
2013-09-12 14:40       ` Mel Gorman
2013-09-12 12:45   ` Hillf Danton
2013-09-12 12:45     ` Hillf Danton
2013-09-10  9:32 ` [PATCH 42/50] sched: numa: Report a NUMA task group ID Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 43/50] mm: numa: Do not group on RO pages Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 44/50] sched: numa: stay on the same node if CLONE_VM Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 45/50] sched: numa: use group fault statistics in numa placement Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-20  9:55   ` Peter Zijlstra
2013-09-20  9:55     ` Peter Zijlstra
2013-09-20 12:31     ` Mel Gorman
2013-09-20 12:31       ` Mel Gorman
2013-09-20 12:36       ` Peter Zijlstra
2013-09-20 12:36         ` Peter Zijlstra
2013-09-20 13:31       ` Mel Gorman
2013-09-20 13:31         ` Mel Gorman
2013-09-10  9:32 ` [PATCH 47/50] sched: numa: add debugging Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 48/50] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 49/50] sched: numa: fix task or group comparison Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-10  9:32 ` [PATCH 50/50] sched: numa: Avoid migrating tasks that are placed on their preferred node Mel Gorman
2013-09-10  9:32   ` Mel Gorman
2013-09-11  2:03 ` [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7 Rik van Riel
2013-09-14  2:57 ` Bob Liu
2013-09-14  2:57   ` Bob Liu
2013-09-30 10:30   ` Mel Gorman
2013-09-30 10:30     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.