linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
@ 2020-02-24  9:52 Mel Gorman
  2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
                   ` (14 more replies)
  0 siblings, 15 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

The only differences in V6 are due to Vincent's latest patch series.

This is V5 which includes the latest versions of Vincent's patch
addressing review feedback. Patches 4-9 are Vincent's work plus one
important performance fix. Vincent's patches were retested and while
not presented in detail, it was mostly an improvement.

Changelog since V5:
o Import Vincent's latest patch set

Changelog since V4:
o The V4 send was the completely wrong versions of the patches and
  is useless.

Changelog since V3:
o Remove stray tab						(Valentin)
o Update comment about allowing a move when src is imbalanced	(Hillf)
o Updated "sched/pelt: Add a new runnable average signal"	(Vincent)

Changelog since V2:
o Rebase on top of Vincent's series again
o Fix a missed rcu_read_unlock
o Reduce overhead of tracepoint

Changelog since V1:
o Rebase on top of Vincent's series and rework

Note: The baseline for this series is tip/sched/core as of February
	12th rebased on top of v5.6-rc1. The series includes patches from
	Vincent as I needed to add a fix and build on top of it. Vincent's
	series on its own introduces performance regressions for *some*
	but not *all* machines so it's easily missed. This series overall
	is close to performance-neutral with some gains depending on the
	machine. However, the end result does less work on NUMA balancing
	and the fact that both the NUMA balancer and load balancer uses
	similar logic makes it much easier to understand.

The NUMA balancer makes placement decisions on tasks that partially
take the load balancer into account and vice versa but there are
inconsistencies. This can result in placement decisions that override
each other leading to unnecessary migrations -- both task placement
and page placement. This series reconciles many of the decisions --
partially Vincent's work with some fixes and optimisations on top to
merge our two series.

The first patch is unrelated. It's picked up by tip but was not present in
the tree at the time of the fork. I'm including it here because I tested
with it.

The second and third patches are tracing only and was needed to get
sensible data out of ftrace with respect to task placement for NUMA
balancing. The NUMA balancer is *far* easier to analyse with the
patches and informed how the series should be developed.

Patches 4-5 are Vincent's and use very similar code patterns and logic
between NUMA and load balancer. Patch 6 is a fix to Vincent's work that
is necessary to avoid serious imbalances being introduced by the NUMA
balancer. Patches 7-9 are also Vincents and while I have not reviewed
them closely myself, others have.

The rest of the series are a mix of optimisations and improvements, one
of which stops the NUMA balancer fighting with itself.

Note that this is not necessarily a universal performance win although
performance results are generally ok (small gains/losses depending on
the machine and workload). However, task migrations, page migrations,
variability and overall overhead are generally reduced.

The main reference workload I used was specjbb running one JVM per node
which typically would be expected to split evenly. It's an interesting
workload because the number of "warehouses" does not linearly related
to the number of running tasks due to the creation of GC threads
and other interfering activity. The mmtests configuration used is
jvm-specjbb2005-multi with two runs -- one with ftrace enabling relevant
scheduler tracepoints.

An example of the headline performance of the series is below and the
tested kernels are

baseline-v3r1	Patches 1-3 for the tracing
loadavg-v3	Patches 1-5 (Add half of Vincent's work)
lbidle-v6	Patches 1-6 Vincent's work with a fix on top
classify-v6	Patches 1-9 Rest of Vincent's work
stopsearch-v6	All patches

                               5.6.0-rc1              5.6.0-rc1              5.6.0-rc1              5.6.0-rc1              5.6.0-rc1
                             baseline-v3             loadavg-v3              lbidle-v3            classify-v6          stopsearch-v6
Hmean     tput-1     43593.49 (   0.00%)    41616.85 (  -4.53%)    43657.25 (   0.15%)    38110.46 * -12.58%*    42213.29 (  -3.17%)
Hmean     tput-2     95692.84 (   0.00%)    93196.89 *  -2.61%*    92287.78 *  -3.56%*    89077.29 (  -6.91%)    96474.49 *   0.82%*
Hmean     tput-3    143813.12 (   0.00%)   134447.05 *  -6.51%*   134587.84 *  -6.41%*   133706.98 (  -7.03%)   144279.90 (   0.32%)
Hmean     tput-4    190702.67 (   0.00%)   176533.79 *  -7.43%*   182278.42 *  -4.42%*   181405.19 (  -4.88%)   189948.10 (  -0.40%)
Hmean     tput-5    230242.39 (   0.00%)   209059.51 *  -9.20%*   223219.06 (  -3.05%)   227188.16 (  -1.33%)   225220.39 (  -2.18%)
Hmean     tput-6    274868.74 (   0.00%)   246470.42 * -10.33%*   258387.09 *  -6.00%*   264252.76 (  -3.86%)   271429.49 (  -1.25%)
Hmean     tput-7    312281.15 (   0.00%)   284564.06 *  -8.88%*   296446.00 *  -5.07%*   302682.72 (  -3.07%)   309187.26 (  -0.99%)
Hmean     tput-8    347261.31 (   0.00%)   332019.39 *  -4.39%*   331202.25 *  -4.62%*   339469.52 (  -2.24%)   345504.60 (  -0.51%)
Hmean     tput-9    387336.25 (   0.00%)   352219.62 *  -9.07%*   370222.03 *  -4.42%*   367077.01 (  -5.23%)   381610.17 (  -1.48%)
Hmean     tput-10   421586.76 (   0.00%)   397304.22 (  -5.76%)   405458.01 (  -3.83%)   416689.66 (  -1.16%)   415549.97 (  -1.43%)
Hmean     tput-11   459422.43 (   0.00%)   398023.51 * -13.36%*   441999.08 (  -3.79%)   449912.39 (  -2.07%)   454458.04 (  -1.08%)
Hmean     tput-12   499087.97 (   0.00%)   400914.35 * -19.67%*   475755.59 (  -4.68%)   493678.32 (  -1.08%)   493936.79 (  -1.03%)
Hmean     tput-13   536335.59 (   0.00%)   406101.41 * -24.28%*   514858.97 (  -4.00%)   528496.01 (  -1.46%)   530662.68 (  -1.06%)
Hmean     tput-14   571542.75 (   0.00%)   478797.13 * -16.23%*   551716.00 (  -3.47%)   553771.29 (  -3.11%)   565915.55 (  -0.98%)
Hmean     tput-15   601412.81 (   0.00%)   534776.98 * -11.08%*   580105.28 (  -3.54%)   597513.89 (  -0.65%)   596192.34 (  -0.87%)
Hmean     tput-16   629817.55 (   0.00%)   407294.29 * -35.33%*   615606.40 (  -2.26%)   630044.12 (   0.04%)   627806.13 (  -0.32%)
Hmean     tput-17   667025.18 (   0.00%)   457416.34 * -31.42%*   626074.81 (  -6.14%)   659706.41 (  -1.10%)   658350.40 (  -1.30%)
Hmean     tput-18   688148.21 (   0.00%)   518534.45 * -24.65%*   663161.87 (  -3.63%)   675616.08 (  -1.82%)   682224.35 (  -0.86%)
Hmean     tput-19   705092.87 (   0.00%)   466530.37 * -33.83%*   689430.29 (  -2.22%)   691050.89 (  -1.99%)   705532.41 (   0.06%)
Hmean     tput-20   711481.44 (   0.00%)   564355.80 * -20.68%*   692170.67 (  -2.71%)   717866.36 (   0.90%)   716243.50 (   0.67%)
Hmean     tput-21   739790.92 (   0.00%)   508542.10 * -31.26%*   712348.91 (  -3.71%)   724666.68 (  -2.04%)   723361.87 (  -2.22%)
Hmean     tput-22   730593.57 (   0.00%)   540881.37 ( -25.97%)   709794.02 (  -2.85%)   727177.54 (  -0.47%)   721353.36 (  -1.26%)
Hmean     tput-23   738401.59 (   0.00%)   561474.46 * -23.96%*   702869.93 (  -4.81%)   720954.73 (  -2.36%)   720813.53 (  -2.38%)
Hmean     tput-24   731301.95 (   0.00%)   582929.73 * -20.29%*   704337.59 (  -3.69%)   717204.03 *  -1.93%*   714131.38 *  -2.35%*
Hmean     tput-25   734414.40 (   0.00%)   591635.13 ( -19.44%)   702334.30 (  -4.37%)   720272.39 (  -1.93%)   714245.12 (  -2.75%)
Hmean     tput-26   724774.17 (   0.00%)   701310.59 (  -3.24%)   700771.85 (  -3.31%)   718084.92 (  -0.92%)   712988.02 (  -1.63%)
Hmean     tput-27   713484.55 (   0.00%)   632795.43 ( -11.31%)   692213.36 (  -2.98%)   710432.96 (  -0.43%)   703087.86 (  -1.46%)
Hmean     tput-28   723111.86 (   0.00%)   697438.61 (  -3.55%)   695934.68 (  -3.76%)   708413.26 (  -2.03%)   703449.60 (  -2.72%)
Hmean     tput-29   714690.69 (   0.00%)   675820.16 (  -5.44%)   689400.90 (  -3.54%)   698436.85 (  -2.27%)   699981.24 (  -2.06%)
Hmean     tput-30   711106.03 (   0.00%)   699748.68 (  -1.60%)   688439.96 (  -3.19%)   698258.70 (  -1.81%)   691636.96 (  -2.74%)
Hmean     tput-31   701632.39 (   0.00%)   698807.56 (  -0.40%)   682588.20 (  -2.71%)   696608.99 (  -0.72%)   691015.36 (  -1.51%)
Hmean     tput-32   703479.77 (   0.00%)   679020.34 (  -3.48%)   674057.11 *  -4.18%*   690706.86 (  -1.82%)   684958.62 (  -2.63%)
Hmean     tput-33   691594.71 (   0.00%)   686583.04 (  -0.72%)   673382.64 (  -2.63%)   687319.97 (  -0.62%)   683367.65 (  -1.19%)
Hmean     tput-34   693435.51 (   0.00%)   685137.16 (  -1.20%)   674883.97 (  -2.68%)   684897.97 (  -1.23%)   674923.39 (  -2.67%)
Hmean     tput-35   688036.06 (   0.00%)   682612.92 (  -0.79%)   668159.93 (  -2.89%)   679301.53 (  -1.27%)   678117.69 (  -1.44%)
Hmean     tput-36   678957.95 (   0.00%)   670160.33 (  -1.30%)   662395.36 (  -2.44%)   672165.17 (  -1.00%)   668512.57 (  -1.54%)
Hmean     tput-37   679748.70 (   0.00%)   675428.41 (  -0.64%)   666970.33 (  -1.88%)   674127.70 (  -0.83%)   667644.78 (  -1.78%)
Hmean     tput-38   669969.62 (   0.00%)   670976.06 (   0.15%)   660499.74 (  -1.41%)   670848.38 (   0.13%)   666646.89 (  -0.50%)
Hmean     tput-39   669291.41 (   0.00%)   665367.66 (  -0.59%)   649337.71 (  -2.98%)   659685.61 (  -1.44%)   658818.08 (  -1.56%)
Hmean     tput-40   668074.80 (   0.00%)   672478.06 (   0.66%)   661273.87 (  -1.02%)   665147.36 (  -0.44%)   660279.43 (  -1.17%)

Note the regression with the first two patches of Vincent's work
(loadavg-v3) followed by lbidle-v3 which mostly restores the performance
and the final version keeping things close to performance neutral (showing
a mix but within noise). This is not universal as a different 2-socket
machine with fewer cores and older CPUs showed no difference. EPYC 1 and
EPYC 2 were both affected by the regression as well as a 4-socket Intel
box but again, the full series is mostly performance neutral for specjbb
but with less NUMA balancing work.

While not presented here, the full series also shows that the throughput
measured by each JVM is less variable.

The high-level NUMA stats from /proc/vmstat look like this

                                      5.6.0-rc1      5.6.0-rc1      5.6.0-rc1      5.6.0-rc1      5.6.0-rc1
                                    baseline-v3     loadavg-v3      lbidle-v3    classify-v3  stopsearch-v3
Ops NUMA alloc hit                    878062.00      882981.00      957762.00      961630.00      880821.00
Ops NUMA alloc miss                        0.00           0.00           0.00           0.00           0.00
Ops NUMA interleave hit               225582.00      237785.00      242554.00      234671.00      234818.00
Ops NUMA alloc local                  764932.00      763850.00      835939.00      843950.00      763005.00
Ops NUMA base-page range updates     2517600.00     3707398.00     2889034.00     2442203.00     3303790.00
Ops NUMA PTE updates                 1754720.00     1672198.00     1569610.00     1356763.00     1591662.00
Ops NUMA PMD updates                    1490.00        3975.00        2577.00        2120.00        3344.00
Ops NUMA hint faults                 1678620.00     1586860.00     1475303.00     1285152.00     1512208.00
Ops NUMA hint local faults %         1461203.00     1389234.00     1181790.00     1085517.00     1411194.00
Ops NUMA hint local percent               87.05          87.55          80.10          84.47          93.32
Ops NUMA pages migrated                69473.00       62504.00      121893.00       80802.00       46266.00
Ops AutoNUMA cost                       8412.04        7961.44        7399.05        6444.39        7585.05

Overall, the local hints percentage is slightly better but crucially,
it's done with much less page migrations.

A separate run gathered information from ftrace and analysed it
offline. This is based on an earlier version of the series but the changes
are not significant enough to warrant a rerun as there are no changes in
the NUMA balancing optimisations.

                                             5.6.0-rc1       5.6.0-rc1
                                           baseline-v2   stopsearch-v2
Ops Migrate failed no CPU                      1871.00          689.00
Ops Migrate failed move to   idle                 0.00            0.00
Ops Migrate failed swap task fail               872.00          568.00
Ops Task Migrated swapped                      6702.00         3344.00
Ops Task Migrated swapped local NID               0.00            0.00
Ops Task Migrated swapped within group         1094.00          124.00
Ops Task Migrated idle CPU                    14409.00        14610.00
Ops Task Migrated idle CPU local NID              0.00            0.00
Ops Task Migrate retry                         2355.00         1074.00
Ops Task Migrate retry success                    0.00            0.00
Ops Task Migrate retry failed                  2355.00         1074.00
Ops Load Balance cross NUMA                 1248401.00      1261853.00

"Migrate failed no CPU" is the times when NUMA balancing did not
find a suitable page on a preferred node. This is increased because
the series avoids making decisions that the LB would override.

"Migrate failed swap task fail" is when migrate_swap fails and it
can fail for a lot of reasons.

"Task Migrated swapped" is lower which would would be a concern but in
this test, locality was higher unlike the test with tracing disabled.
This event triggers when two tasks are swapped to keep load neutral or
improved from the perspective of the load balancer. The series attempts
to swap tasks that both move to their preferred node.

"Task Migrated idle CPU" is similar and while the the series does try to
avoid NUMA Balancer and LB fighting each other, it also continues to
obey overall CPU load balancer.

"Task Migrate retry failed" happens when NUMA balancing makes multiple
attempts to place a task on a preferred node. It is slightly reduced here
but it would generally be expected to happen to maintain CPU load balance.

A variety of other workloads were evaluated and appear to be mostly
neutral or improved. netperf running on localhost shows gains between 1-8%
depending on the machine. hackbench is a mixed bag -- small regressions
on one machine around 1-2% depending on the group count but up to 15%
gain on another machine. dbench looks generally ok, very small performance
gains. pgbench looks ok, small gains and losses, much of which is within
the noise. schbench (Facebook workload that is sensitive to wakeup
latencies) is mostly good.  The autonuma benchmark also generally looks
good, most differences are within the noise but with higher locality
success rates and fewer page migrations. Other long lived workloads are
still running but I'm not expecting many surprises.

 include/linux/sched.h        |  31 ++-
 include/trace/events/sched.h |  49 ++--
 kernel/sched/core.c          |  13 -
 kernel/sched/debug.c         |  17 +-
 kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
 kernel/sched/pelt.c          |  59 ++--
 kernel/sched/sched.h         |  42 ++-
 7 files changed, 535 insertions(+), 302 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

The following XFS commit:

  8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery")

changed the logic from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the
time that the dbench performance dropped quite a lot and CPU migrations
were increased.

The current pattern of the task migration is straight-forward. With XFS,
an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker.
This runs on a nearby CPU and on completion, dbench wakes up on its old CPU
as it is still idle and no migration occurs. dbench then queues the real
IO on the blk_mq_requeue_work() work item which runs on a bound kworker
which is forced to run on the same CPU as dbench. When IO completes,
the bound kworker wakes dbench but as the kworker is a bound but,
real task, the CPU is not considered idle and dbench gets migrated by
select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs
for a while but ultimately it starts a round-robin of all CPUs sharing
the same LLC. High-frequency migration on each IO completion has poor
performance overall. It has negative implications both in commication
costs and power management. mpstat confirmed that at low thread counts
that all CPUs sharing an LLC has low level of activity.

Note that even if the CIL patch was reverted, there still would
be migrations but the impact is less noticeable. It turns out that
individually the scheduler, XFS, blk-mq and workqueues all made sensible
decisions but in combination, the overall effect was sub-optimal.

This patch special cases the IO issue/completion pattern and allows
a bound kworker waker and a task wakee to stack on the same CPU if
there is a strong chance they are directly related. The expectation
is that the kworker is likely going back to sleep shortly. This is not
guaranteed as the IO could be queued asynchronously but there is a very
strong relationship between the task and kworker in this case that would
justify stacking on the same CPU instead of migrating. There should be
few concerns about kworker starvation given that the special casing is
only when the kworker is the waker.

DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem

UMA machine with 8 cores sharing LLC
                          5.5.0-rc7              5.5.0-rc7
                  tipsched-20200124           kworkerstack
Amean     1        22.63 (   0.00%)       20.54 *   9.23%*
Amean     2        25.56 (   0.00%)       23.40 *   8.44%*
Amean     4        28.63 (   0.00%)       27.85 *   2.70%*
Amean     8        37.66 (   0.00%)       37.68 (  -0.05%)
Amean     64      469.47 (   0.00%)      468.26 (   0.26%)
Stddev    1         1.00 (   0.00%)        0.72 (  28.12%)
Stddev    2         1.62 (   0.00%)        1.97 ( -21.54%)
Stddev    4         2.53 (   0.00%)        3.58 ( -41.19%)
Stddev    8         5.30 (   0.00%)        5.20 (   1.92%)
Stddev    64       86.36 (   0.00%)       94.53 (  -9.46%)

NUMA machine, 48 CPUs total, 24 CPUs share cache
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Amean     1         58.69 (   0.00%)       30.21 *  48.53%*
Amean     2         60.90 (   0.00%)       35.29 *  42.05%*
Amean     4         66.77 (   0.00%)       46.55 *  30.28%*
Amean     8         81.41 (   0.00%)       68.46 *  15.91%*
Amean     16       113.29 (   0.00%)      107.79 *   4.85%*
Amean     32       199.10 (   0.00%)      198.22 *   0.44%*
Amean     64       478.99 (   0.00%)      477.06 *   0.40%*
Amean     128     1345.26 (   0.00%)     1372.64 *  -2.04%*
Stddev    1          2.64 (   0.00%)        4.17 ( -58.08%)
Stddev    2          4.35 (   0.00%)        5.38 ( -23.73%)
Stddev    4          6.77 (   0.00%)        6.56 (   3.00%)
Stddev    8         11.61 (   0.00%)       10.91 (   6.04%)
Stddev    16        18.63 (   0.00%)       19.19 (  -3.01%)
Stddev    32        38.71 (   0.00%)       38.30 (   1.06%)
Stddev    64       100.28 (   0.00%)       91.24 (   9.02%)
Stddev    128      186.87 (   0.00%)      160.34 (  14.20%)

Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related

Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.

dbench4 Throughput (misleading but traditional)
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Hmean     1        321.41 (   0.00%)      617.82 *  92.22%*
Hmean     2        622.87 (   0.00%)     1066.80 *  71.27%*
Hmean     4       1134.56 (   0.00%)     1623.74 *  43.12%*
Hmean     8       1869.96 (   0.00%)     2212.67 *  18.33%*
Hmean     16      2673.11 (   0.00%)     2806.13 *   4.98%*
Hmean     32      3032.74 (   0.00%)     3039.54 (   0.22%)
Hmean     64      2514.25 (   0.00%)     2498.96 *  -0.61%*
Hmean     128     1778.49 (   0.00%)     1746.05 *  -1.82%*

Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 11 -----------
 kernel/sched/fair.c  | 14 ++++++++++++++
 kernel/sched/sched.h | 13 +++++++++++++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 377ec26e9159..e94819d573be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1447,17 +1447,6 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_SMP
 
-static inline bool is_per_cpu_kthread(struct task_struct *p)
-{
-	if (!(p->flags & PF_KTHREAD))
-		return false;
-
-	if (p->nr_cpus_allowed != 1)
-		return false;
-
-	return true;
-}
-
 /*
  * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7e11b1bb64c..ef3eb36ba5c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5970,6 +5970,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
 		return prev;
 
+	/*
+	 * Allow a per-cpu kthread to stack with the wakee if the
+	 * kworker thread and the tasks previous CPUs are the same.
+	 * The assumption is that the wakee queued work for the
+	 * per-cpu kthread that is now complete and the wakeup is
+	 * essentially a sync wakeup. An obvious example of this
+	 * pattern is IO completions.
+	 */
+	if (is_per_cpu_kthread(current) &&
+	    prev == smp_processor_id() &&
+	    this_rq()->nr_running <= 1) {
+		return prev;
+	}
+
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
 	if (recent_used_cpu != prev &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 878910e8b299..2be96f9e5b1e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2484,3 +2484,16 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+#ifdef CONFIG_SMP
+static inline bool is_per_cpu_kthread(struct task_struct *p)
+{
+	if (!(p->flags & PF_KTHREAD))
+		return false;
+
+	if (p->nr_cpus_allowed != 1)
+		return false;
+
+	return true;
+}
+#endif
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
  2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node. The case where no candidate CPU could be found is
not traced which is an important gap. The tracepoint is not fired when
the task is not allowed to run on any CPU on the preferred node or the
task is already running on the target CPU but neither are interesting
corner cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef3eb36ba5c4..d41a2b37694f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1848,8 +1848,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/* No better CPU than the current one was found. */
-	if (env.best_cpu == -1)
+	if (env.best_cpu == -1) {
+		trace_sched_stick_numa(p, env.src_cpu, -1);
 		return -EAGAIN;
+	}
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
  2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
  2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node but from the trace, it's possibile to tell the
difference between "no CPU found", "migration to idle CPU failed" and
"tasks could not be swapped". Extend the tracepoint accordingly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/trace/events/sched.h | 49 ++++++++++++++++++++++++--------------------
 kernel/sched/fair.c          |  6 +++---
 2 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..f5b75c5fef7e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -487,7 +487,11 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
-DECLARE_EVENT_CLASS(sched_move_task_template,
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+TRACE_EVENT(sched_move_numa,
 
 	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
 
@@ -519,23 +523,7 @@ DECLARE_EVENT_CLASS(sched_move_task_template,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
-/*
- * Tracks migration of tasks from one runqueue to another. Can be used to
- * detect if automatic NUMA balancing is bouncing between nodes
- */
-DEFINE_EVENT(sched_move_task_template, sched_move_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
-
-	TP_ARGS(tsk, src_cpu, dst_cpu)
-);
-
-DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
-
-	TP_ARGS(tsk, src_cpu, dst_cpu)
-);
-
-TRACE_EVENT(sched_swap_numa,
+DECLARE_EVENT_CLASS(sched_numa_pair_template,
 
 	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
 		 struct task_struct *dst_tsk, int dst_cpu),
@@ -561,11 +549,11 @@ TRACE_EVENT(sched_swap_numa,
 		__entry->src_ngid	= task_numa_group_id(src_tsk);
 		__entry->src_cpu	= src_cpu;
 		__entry->src_nid	= cpu_to_node(src_cpu);
-		__entry->dst_pid	= task_pid_nr(dst_tsk);
-		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
-		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
+		__entry->dst_pid	= dst_tsk ? task_pid_nr(dst_tsk) : 0;
+		__entry->dst_tgid	= dst_tsk ? task_tgid_nr(dst_tsk) : 0;
+		__entry->dst_ngid	= dst_tsk ? task_numa_group_id(dst_tsk) : 0;
 		__entry->dst_cpu	= dst_cpu;
-		__entry->dst_nid	= cpu_to_node(dst_cpu);
+		__entry->dst_nid	= dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1;
 	),
 
 	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
@@ -575,6 +563,23 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
+DEFINE_EVENT(sched_numa_pair_template, sched_stick_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
+);
+
+DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
+);
+
+
 /*
  * Tracepoint for waking a polling cpu without an IPI.
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d41a2b37694f..6c866fb2129c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1849,7 +1849,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 	/* No better CPU than the current one was found. */
 	if (env.best_cpu == -1) {
-		trace_sched_stick_numa(p, env.src_cpu, -1);
+		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
 		return -EAGAIN;
 	}
 
@@ -1858,7 +1858,7 @@ static int task_numa_migrate(struct task_struct *p)
 		ret = migrate_task_to(p, env.best_cpu);
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
-			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
+			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
 		return ret;
 	}
 
@@ -1866,7 +1866,7 @@ static int task_numa_migrate(struct task_struct *p)
 	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
-		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
+		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
 	put_task_struct(env.best_task);
 	return ret;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (2 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

From: Vincent Guittot <vincent.guittot@linaro.org>

The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.

Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.

In addition, the same steps ordering is used when updating a cfs_rq:
- update_load_avg
- update_cfs_group
- update *h_nr_running

This reordering enables the use of h_nr_running in PELT algorithm.

No functional and performance changes are expected and have been noticed
during tests.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6c866fb2129c..4395951b1530 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5261,32 +5261,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto enqueue_throttle;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 	}
 
+enqueue_throttle:
 	if (!se) {
 		add_nr_running(rq, 1);
 		/*
@@ -5347,17 +5346,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto dequeue_throttle;
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5375,16 +5370,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 	}
 
+dequeue_throttle:
 	if (!se)
 		sub_nr_running(rq, 1);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (3 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

From: Vincent Guittot <vincent.guittot@linaro.org>

Similarly to what has been done for the normal load balancer, we can
replace runnable_load_avg by load_avg in numa load balancing and track the
other statistics like the utilization and the number of running tasks to
get to better view of the current state of a node.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 70 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4395951b1530..975d7c1554de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
+/*
+ * 'numa_type' describes the node at the moment of load balancing.
+ */
+enum numa_type {
+	/* The node has spare capacity that can be used to run more tasks.  */
+	node_has_spare = 0,
+	/*
+	 * The node is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevertheless, some tasks might wait before running.
+	 */
+	node_fully_busy,
+	/*
+	 * The node is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
+	node_overloaded
+};
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
 	unsigned long load;
-
+	unsigned long util;
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
+	unsigned int nr_running;
+	unsigned int weight;
+	enum numa_type node_type;
 };
 
-/*
- * XXX borrowed from update_sg_lb_stats
- */
-static void update_numa_stats(struct numa_stats *ns, int nid)
-{
-	int cpu;
-
-	memset(ns, 0, sizeof(*ns));
-	for_each_cpu(cpu, cpumask_of_node(nid)) {
-		struct rq *rq = cpu_rq(cpu);
-
-		ns->load += cpu_runnable_load(rq);
-		ns->compute_capacity += capacity_of(cpu);
-	}
-
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1521,6 +1518,47 @@ struct task_numa_env {
 	int best_cpu;
 };
 
+static unsigned long cpu_load(struct rq *rq);
+static unsigned long cpu_util(int cpu);
+
+static inline enum
+numa_type numa_classify(unsigned int imbalance_pct,
+			 struct numa_stats *ns)
+{
+	if ((ns->nr_running > ns->weight) &&
+	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
+		return node_overloaded;
+
+	if ((ns->nr_running < ns->weight) ||
+	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
+		return node_has_spare;
+
+	return node_fully_busy;
+}
+
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct task_numa_env *env,
+			      struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->load += cpu_load(rq);
+		ns->util += cpu_util(cpu);
+		ns->nr_running += rq->cfs.h_nr_running;
+		ns->compute_capacity += capacity_of(cpu);
+	}
+
+	ns->weight = cpumask_weight(cpumask_of_node(nid));
+
+	ns->node_type = numa_classify(env->imbalance_pct, ns);
+}
+
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
@@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
+
+	/* If dst node has spare capacity, there is no real load imbalance */
+	if (env->dst_stats.node_type == node_has_spare)
+		return false;
+
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
@@ -3687,11 +3730,6 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.runnable_load_avg;
-}
-
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (4 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

The standard load balancer generally tries to keep the number of running
tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
tasks to move if there is spare capacity but this causes a conflict and
utilisation between NUMA nodes gets badly skewed. This patch uses similar
logic between the NUMA balancer and load balancer when deciding if a task
migrating to its preferred node can use an idle CPU.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 975d7c1554de..c5e0f8b584e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1520,6 +1520,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
-
-	/* If dst node has spare capacity, there is no real load imbalance */
-	if (env->dst_stats.node_type == node_has_spare)
-		return false;
-
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1757,19 +1753,42 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
-	long src_load, dst_load, load;
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
-
 	/*
-	 * If the improvement from just moving env->p direction is better
-	 * than swapping tasks around, check if a move is possible.
+	 * If dst node has spare capacity, then check if there is an
+	 * imbalance that would be overruled by the load balancer.
 	 */
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
+	if (env->dst_stats.node_type == node_has_spare) {
+		unsigned int imbalance;
+		int src_running, dst_running;
+
+		/*
+		 * Would movement cause an imbalance? Note that if src has
+		 * more running tasks that the imbalance is ignored as the
+		 * move improves the imbalance from the perspective of the
+		 * CPU load balancer.
+		 * */
+		src_running = env->src_stats.nr_running - 1;
+		dst_running = env->dst_stats.nr_running + 1;
+		imbalance = max(0, dst_running - src_running);
+		imbalance = adjust_numa_imbalance(imbalance, src_running);
+
+		/* Use idle CPU if there is no imbalance */
+		if (!imbalance)
+			maymove = true;
+	} else {
+		long src_load, dst_load, load;
+		/*
+		 * If the improvement from just moving env->p direction is better
+		 * than swapping tasks around, check if a move is possible.
+		 */
+		load = task_h_load(env->p);
+		dst_load = env->dst_stats.load + load;
+		src_load = env->src_stats.load - load;
+		maymove = !load_too_imbalanced(src_load, dst_load, env);
+	}
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
@@ -8695,6 +8714,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+{
+	unsigned int imbalance_min;
+
+	/*
+	 * Allow a small imbalance based on a simple pair of communicating
+	 * tasks that remain local when the source domain is almost idle.
+	 */
+	imbalance_min = 2;
+	if (src_nr_running <= imbalance_min)
+		return 0;
+
+	return imbalance;
+}
+
 /**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
@@ -8791,24 +8825,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA) {
-			unsigned int imbalance_min;
-
-			/*
-			 * Compute an allowed imbalance based on a simple
-			 * pair of communicating tasks that should remain
-			 * local and ignore them.
-			 *
-			 * NOTE: Generally this would have been based on
-			 * the domain size and this was evaluated. However,
-			 * the benefit is similar across a range of workloads
-			 * and machines but scaling by the domain size adds
-			 * the risk that lower domains have to be rebalanced.
-			 */
-			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
-				env->imbalance = 0;
-		}
+		if (env->sd->flags & SD_NUMA)
+			env->imbalance = adjust_numa_imbalance(env->imbalance,
+						busiest->sum_nr_running);
 
 		return;
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 07/13] sched/pelt: Remove unused runnable load average
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (5 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

From: Vincent Guittot <vincent.guittot@linaro.org>

Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched.h |   5 +-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |   8 ----
 kernel/sched/fair.c   | 130 +++++++-------------------------------------------
 kernel/sched/pelt.c   |  62 ++++++++++--------------
 kernel/sched/sched.h  |   7 +--
 6 files changed, 43 insertions(+), 171 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..037eaffabc24 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -357,7 +357,7 @@ struct util_est {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
- * (see __update_load_avg() in kernel/sched/fair.c).
+ * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
@@ -401,11 +401,9 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
-	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -449,7 +447,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94819d573be..8e6f38073ab3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,7 +761,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -774,7 +773,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 879d3ccf3806..cfecaad387c0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -402,11 +402,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,11 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -947,13 +942,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
-	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c5e0f8b584e6..1ea730fbb806 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -741,9 +741,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+		sa->load_avg = scale_load_down(se->load.weight);
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2898,25 +2896,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 #ifdef CONFIG_SMP
-static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight += se->runnable_weight;
-
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
-}
-
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight -= se->runnable_weight;
-
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
-}
-
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -2932,28 +2911,22 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 #else
 static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
 		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2961,16 +2934,13 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
+	if (se->on_rq)
 		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
-	}
+
 }
 
 void reweight_task(struct task_struct *p, int prio)
@@ -2980,7 +2950,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -3092,50 +3062,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	 */
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
-
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3147,7 +3073,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
 	if (!gcfs_rq)
 		return;
@@ -3156,16 +3082,15 @@ static void update_cfs_group(struct sched_entity *se)
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3290,11 +3215,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial and simply copies the running
- * sum over (but still wrong, because the group entity and group rq do not have
- * their PELT windows aligned).
+ * Per the above update_tg_cfs_util() is trivial  * and simply copies the
+ * running sum over (but still wrong, because the group entity and group rq do
+ * not have their PELT windows aligned).
  *
- * However, update_tg_cfs_runnable() is more complex. So we have:
+ * However, update_tg_cfs_load() is more complex. So we have:
  *
  *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
  *
@@ -3375,11 +3300,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 }
 
 static inline void
-update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3427,20 +3352,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-
-	if (se->on_rq) {
-		delta_sum = runnable_load_sum -
-				se_weight(se) * se->avg.runnable_load_sum;
-		delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3468,7 +3379,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
-	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
 	trace_pelt_se_tp(se);
@@ -3613,8 +3524,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
@@ -4075,14 +3984,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	update_cfs_group(se);
-	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -4159,13 +4066,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -7650,9 +7555,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b79b360..3eb0ed333dcb 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,8 +121,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -148,8 +146,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -186,7 +182,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, int running)
 {
 	u64 delta;
 
@@ -222,7 +218,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		runnable = running = 0;
+		running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -231,14 +227,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+___update_load_avg(struct sched_avg *sa, unsigned long load)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -246,7 +242,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -254,17 +249,13 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  * sched_entity:
  *
  *   task:
- *     se_runnable() == se_weight()
+ *     se_weight()   = se->load.weight
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
  *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
- *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *   load_sum := runnable
+ *   load_avg = se_weight(se) * load_sum
  *
  * XXX collapse load_sum and runnable_load_sum
  *
@@ -272,15 +263,12 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+	if (___update_load_sum(now, &se->avg, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -290,10 +278,9 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -306,10 +293,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1, 1);
+		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
 		return 1;
 	}
@@ -322,20 +308,19 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
  *
- *   load_avg and runnable_load_avg are not supported and meaningless.
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1, 1);
+		___update_load_avg(&rq->avg_rt, 1);
 		trace_pelt_rt_tp(rq);
 		return 1;
 	}
@@ -348,18 +333,19 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1, 1);
+		___update_load_avg(&rq->avg_dl, 1);
 		trace_pelt_dl_tp(rq);
 		return 1;
 	}
@@ -373,7 +359,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
@@ -401,16 +389,14 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * rq->clock += delta with delta >= running
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
-				0,
 				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
-				1,
 				1,
 				1);
 
 	if (ret) {
-		___update_load_avg(&rq->avg_irq, 1, 1);
+		___update_load_avg(&rq->avg_irq, 1);
 		trace_pelt_irq_tp(rq);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2be96f9e5b1e..4f653ef89f1f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -489,7 +489,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
@@ -688,8 +687,10 @@ struct dl_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
+
 #else
 #define entity_is_task(se)	1
+
 #endif
 
 #ifdef CONFIG_SMP
@@ -701,10 +702,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
 
 static inline bool sched_asym_prefer(int a, int b)
 {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 08/13] sched/pelt: Add a new runnable average signal
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (6 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

From: Vincent Guittot <vincent.guittot@linaro.org>

Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched.h | 26 ++++++++++-------
 kernel/sched/debug.c  |  9 ++++--
 kernel/sched/fair.c   | 77 ++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/pelt.c   | 39 ++++++++++++++++++--------
 kernel/sched/sched.h  | 22 ++++++++++++++-
 5 files changed, 142 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 037eaffabc24..2e9199bf947b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,28 +356,30 @@ struct util_est {
 } __attribute__((__aligned__(sizeof(u64))));
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series
+ * The load/runnable/util_avg accumulates an infinite geometric series
  * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
  *   load_avg = runnable% * scale_load_down(load)
  *
- * where runnable% is the time ratio that a sched_entity is runnable.
- * For cfs_rq, it is the aggregated load_avg of all runnable and
- * blocked sched_entities.
+ * [runnable_avg definition]
+ *
+ *   runnable_avg = runnable% * SCHED_CAPACITY_SCALE
  *
  * [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
  *
- * where running% is the time ratio that a sched_entity is running on
- * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
- * and blocked sched_entities.
+ * where runnable% is the time ratio that a sched_entity is runnable and
+ * running% the time ratio that a sched_entity is running.
+ *
+ * For cfs_rq, they are the aggregated values of all runnable and blocked
+ * sched_entities.
  *
- * load_avg and util_avg don't direcly factor frequency scaling and CPU
- * capacity scaling. The scaling is done through the rq_clock_pelt that
- * is used for computing those signals (see update_rq_clock_pelt())
+ * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that is used
+ * for computing those signals (see update_rq_clock_pelt())
  *
  * N.B., the above ratios (runnable% and running%) themselves are in the
  * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
@@ -401,9 +403,11 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -467,6 +471,8 @@ struct sched_entity {
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
+	/* cached value of my_q->h_nr_running */
+	unsigned long			runnable_weight;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfecaad387c0..8331bc04aea2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -405,6 +405,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,6 +525,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
+			cfs_rq->avg.runnable_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
+			cfs_rq->removed.runnable_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
@@ -944,8 +947,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ea730fbb806..24fbbb588df2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -794,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 		}
 	}
 
+	sa->runnable_avg = cpu_scale;
+
 	if (p->sched_class != &fair_sched_class) {
 		/*
 		 * For !fair tasks do:
@@ -3215,9 +3217,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial  * and simply copies the
- * running sum over (but still wrong, because the group entity and group rq do
- * not have their PELT windows aligned).
+ * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial
+ * and simply copies the running/runnable sum over (but still wrong, because
+ * the group entity and group rq do not have their PELT windows aligned).
  *
  * However, update_tg_cfs_load() is more complex. So we have:
  *
@@ -3299,6 +3301,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static inline void
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+{
+	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
+	/* Set new sched_entity's runnable */
+	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
+	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq runnable */
+	add_positive(&cfs_rq->avg.runnable_avg, delta);
+	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
+}
+
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
@@ -3379,6 +3407,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
@@ -3449,7 +3478,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3460,7 +3489,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
+		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3472,7 +3501,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
+		r = removed_runnable;
+		sub_positive(&sa->runnable_avg, r);
+		sub_positive(&sa->runnable_sum, r * divider);
+
+		/*
+		 * removed_runnable is the unweighted version of removed_load so we
+		 * can use it to estimate removed_load_sum.
+		 */
+		add_tg_cfs_propagate(cfs_rq,
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
 
 		decayed = 1;
 	}
@@ -3518,6 +3556,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 */
 	se->avg.util_sum = se->avg.util_avg * divider;
 
+	se->avg.runnable_sum = se->avg.runnable_avg * divider;
+
 	se->avg.load_sum = divider;
 	if (se_weight(se)) {
 		se->avg.load_sum =
@@ -3527,6 +3567,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
+	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
@@ -3548,6 +3590,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
+	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
@@ -3654,10 +3698,15 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
-	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
+	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
+static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->avg.runnable_avg;
+}
+
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
@@ -3984,11 +4033,13 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	se_update_runnable(se);
 	update_cfs_group(se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -4066,11 +4117,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
+	se_update_runnable(se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5241,6 +5294,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running++;
@@ -5338,6 +5392,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
@@ -5410,6 +5465,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable(struct rq *rq)
+{
+	return cfs_rq_runnable_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7555,6 +7615,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
+	if (cfs_rq->avg.runnable_sum)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3eb0ed333dcb..c40d57a2a248 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, int running)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,6 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_sum =
+			decay_load(sa->runnable_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -146,6 +148,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -182,7 +186,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, int running)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -218,7 +222,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		running = 0;
+		runnable = running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -227,7 +231,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, running))
+	if (!accumulate_sum(delta, sa, load, runnable, running))
 		return 0;
 
 	return 1;
@@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_avg = div_u64(sa->runnable_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -250,24 +255,30 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  *   task:
  *     se_weight()   = se->load.weight
+ *     se_runnable() = !!on_rq
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = grq->h_nr_running
+ *
+ *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
+ *   runnable_avg = runnable_sum
  *
  *   load_sum := runnable
  *   load_avg = se_weight(se) * load_sum
  *
- * XXX collapse load_sum and runnable_load_sum
- *
  * cfq_rq:
  *
+ *   runnable_sum = \Sum se->avg.runnable_sum
+ *   runnable_avg = \Sum se->avg.runnable_avg
+ *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0)) {
+	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
@@ -278,7 +289,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+				cfs_rq->curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
@@ -293,6 +305,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
@@ -310,13 +323,14 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
+				running,
 				running,
 				running)) {
 
@@ -335,13 +349,14 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
+				running,
 				running,
 				running)) {
 
@@ -361,7 +376,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
@@ -389,9 +404,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * rq->clock += delta with delta >= running
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
+				0,
 				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
+				1,
 				1,
 				1);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4f653ef89f1f..5f6f5de03764 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -527,7 +527,7 @@ struct cfs_rq {
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
-		unsigned long	runnable_sum;
+		unsigned long	runnable_avg;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -688,9 +688,29 @@ struct dl_rq {
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
 
+static inline void se_update_runnable(struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		se->runnable_weight = se->my_q->h_nr_running;
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return !!se->on_rq;
+	else
+		return se->runnable_weight;
+}
+
 #else
 #define entity_is_task(se)	1
 
+static inline void se_update_runnable(struct sched_entity *se) {}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return !!se->on_rq;
+}
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (7 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

From: Vincent Guittot <vincent.guittot@linaro.org>

Take into account the new runnable_avg signal to classify a group and to
mitigate the volatility of util_avg in face of intensive migration or
new task with random utilization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24fbbb588df2..8ce9a04e7efb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5470,6 +5470,24 @@ static unsigned long cpu_runnable(struct rq *rq)
 	return cfs_rq_runnable_avg(&rq->cfs);
 }
 
+static unsigned long cpu_runnable_without(struct rq *rq, struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq;
+	unsigned int runnable;
+
+	/* Task has no contribution or is new */
+	if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
+		return cpu_runnable(rq);
+
+	cfs_rq = &rq->cfs;
+	runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+
+	/* Discount task's runnable from CPU's runnable */
+	lsub_positive(&runnable, p->se.avg.runnable_avg);
+
+	return runnable;
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7753,7 +7771,8 @@ struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
-	unsigned long group_util; /* Total utilization of the group */
+	unsigned long group_util; /* Total utilization over the CPUs of the group */
+	unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
 	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
@@ -7974,6 +7993,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return false;
+
 	if ((sgs->group_capacity * 100) >
 			(sgs->group_util * imbalance_pct))
 		return true;
@@ -7999,6 +8022,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 			(sgs->group_util * imbalance_pct))
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return true;
+
 	return false;
 }
 
@@ -8093,6 +8120,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
+		sgs->group_runnable += cpu_runnable(rq);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
@@ -8368,6 +8396,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 		sgs->group_load += cpu_load_without(rq, p);
 		sgs->group_util += cpu_util_without(i, p);
+		sgs->group_runnable += cpu_runnable_without(rq, p);
 		local = task_running_on_cpu(i, p);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running - local;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (8 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

task_numa_find_cpu can scan a node multiple times. Minimally it scans to
gather statistics and later to find a suitable target. In some cases, the
second scan will simply pick an idle CPU if the load is not imbalanced.

This patch caches information on an idle core while gathering statistics
and uses it immediately if load is not imbalanced to avoid a second scan
of the node runqueues. Preference is given to an idle core rather than an
idle SMT sibling to avoid packing HT siblings due to linearly scanning the
node cpumask.

As a side-effect, even when the second scan is necessary, the importance
of using select_idle_sibling is much reduced because information on idle
CPUs is cached and can be reused.

Note that this patch actually makes is harder to move to an idle CPU
as multiple tasks can race for the same idle CPU due to a race checking
numa_migrate_on. This is addressed in the next patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 119 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 102 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce9a04e7efb..e285aa0cdd4f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1500,8 +1500,29 @@ struct numa_stats {
 	unsigned int nr_running;
 	unsigned int weight;
 	enum numa_type node_type;
+	int idle_cpu;
 };
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+		if (cpu == sibling)
+			continue;
+
+		if (!idle_cpu(cpu))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
+/* Forward declarations of select_idle_sibling helpers */
+static inline bool test_idle_cores(int cpu, bool def);
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1537,15 +1558,39 @@ numa_type numa_classify(unsigned int imbalance_pct,
 	return node_fully_busy;
 }
 
+static inline int numa_idle_core(int idle_core, int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (!static_branch_likely(&sched_smt_present) ||
+	    idle_core >= 0 || !test_idle_cores(cpu, false))
+		return idle_core;
+
+	/*
+	 * Prefer cores instead of packing HT siblings
+	 * and triggering future load balancing.
+	 */
+	if (is_core_idle(cpu))
+		idle_core = cpu;
+#endif
+
+	return idle_core;
+}
+
 /*
- * XXX borrowed from update_sg_lb_stats
+ * Gather all necessary information to make NUMA balancing placement
+ * decisions that are compatible with standard load balancer. This
+ * borrows code and logic from update_sg_lb_stats but sharing a
+ * common implementation is impractical.
  */
 static void update_numa_stats(struct task_numa_env *env,
-			      struct numa_stats *ns, int nid)
+			      struct numa_stats *ns, int nid,
+			      bool find_idle)
 {
-	int cpu;
+	int cpu, idle_core = -1;
 
 	memset(ns, 0, sizeof(*ns));
+	ns->idle_cpu = -1;
+
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
@@ -1553,11 +1598,25 @@ static void update_numa_stats(struct task_numa_env *env,
 		ns->util += cpu_util(cpu);
 		ns->nr_running += rq->cfs.h_nr_running;
 		ns->compute_capacity += capacity_of(cpu);
+
+		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
+			if (READ_ONCE(rq->numa_migrate_on) ||
+			    !cpumask_test_cpu(cpu, env->p->cpus_ptr))
+				continue;
+
+			if (ns->idle_cpu == -1)
+				ns->idle_cpu = cpu;
+
+			idle_core = numa_idle_core(idle_core, cpu);
+		}
 	}
 
 	ns->weight = cpumask_weight(cpumask_of_node(nid));
 
 	ns->node_type = numa_classify(env->imbalance_pct, ns);
+
+	if (idle_core >= 0)
+		ns->idle_cpu = idle_core;
 }
 
 static void task_numa_assign(struct task_numa_env *env,
@@ -1566,7 +1625,7 @@ static void task_numa_assign(struct task_numa_env *env,
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
 	/* Bail out if run-queue part of active NUMA balance. */
-	if (xchg(&rq->numa_migrate_on, 1))
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
 		return;
 
 	/*
@@ -1730,19 +1789,39 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 assign:
-	/*
-	 * One idle CPU per node is evaluated for a task numa move.
-	 * Call select_idle_sibling to maybe find a better one.
-	 */
+	/* Evaluate an idle CPU for a task numa move. */
 	if (!cur) {
+		int cpu = env->dst_stats.idle_cpu;
+
+		/* Nothing cached so current CPU went idle since the search. */
+		if (cpu < 0)
+			cpu = env->dst_cpu;
+
 		/*
-		 * select_idle_siblings() uses an per-CPU cpumask that
-		 * can be used from IRQ context.
+		 * If the CPU is no longer truly idle and the previous best CPU
+		 * is, keep using it.
 		 */
-		local_irq_disable();
-		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+		if (!idle_cpu(cpu) && env->best_cpu >= 0 &&
+		    idle_cpu(env->best_cpu)) {
+			cpu = env->best_cpu;
+		}
+
+		/*
+		 * Use select_idle_sibling if the previously found idle CPU is
+		 * not idle any more.
+		 */
+		if (!idle_cpu(cpu)) {
+			/*
+			 * select_idle_siblings() uses an per-CPU cpumask that
+			 * can be used from IRQ context.
+			 */
+			local_irq_disable();
+			cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
-		local_irq_enable();
+			local_irq_enable();
+		}
+
+		env->dst_cpu = cpu;
 	}
 
 	task_numa_assign(env, cur, imp);
@@ -1776,8 +1855,14 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		imbalance = adjust_numa_imbalance(imbalance, src_running);
 
 		/* Use idle CPU if there is no imbalance */
-		if (!imbalance)
+		if (!imbalance) {
 			maymove = true;
+			if (env->dst_stats.idle_cpu >= 0) {
+				env->dst_cpu = env->dst_stats.idle_cpu;
+				task_numa_assign(env, NULL, 0);
+				return;
+			}
+		}
 	} else {
 		long src_load, dst_load, load;
 		/*
@@ -1850,10 +1935,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env, &env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid, false);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1886,7 +1971,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (9 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

Multiple tasks can attempt to select and idle CPU but fail because
numa_migrate_on is already set and the migration fails. Instead of failing,
scan for an alternative idle CPU. select_idle_sibling is not used because
it requires IRQs to be disabled and it ignores numa_migrate_on allowing
multiple tasks to stack. This scan may still fail if there are idle
candidate CPUs due to races but if this occurs, it's best that a task
stay on an available CPU that move to a contended one.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e285aa0cdd4f..ee14cd9815ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1624,15 +1624,34 @@ static void task_numa_assign(struct task_numa_env *env,
 {
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
-	/* Bail out if run-queue part of active NUMA balance. */
-	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
+	/* Check if run-queue part of active NUMA balance. */
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) {
+		int cpu;
+		int start = env->dst_cpu;
+
+		/* Find alternative idle CPU. */
+		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) {
+			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
+			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+				continue;
+			}
+
+			env->dst_cpu = cpu;
+			rq = cpu_rq(env->dst_cpu);
+			if (!xchg(&rq->numa_migrate_on, 1))
+				goto assign;
+		}
+
+		/* Failed to find an alternative idle CPU */
 		return;
+	}
 
+assign:
 	/*
 	 * Clear previous best_cpu/rq numa-migrate flag, since task now
 	 * found a better CPU to move/swap.
 	 */
-	if (env->best_cpu != -1) {
+	if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) {
 		rq = cpu_rq(env->best_cpu);
 		WRITE_ONCE(rq->numa_migrate_on, 0);
 	}
@@ -1806,21 +1825,6 @@ static void task_numa_compare(struct task_numa_env *env,
 			cpu = env->best_cpu;
 		}
 
-		/*
-		 * Use select_idle_sibling if the previously found idle CPU is
-		 * not idle any more.
-		 */
-		if (!idle_cpu(cpu)) {
-			/*
-			 * select_idle_siblings() uses an per-CPU cpumask that
-			 * can be used from IRQ context.
-			 */
-			local_irq_disable();
-			cpu = select_idle_sibling(env->p, env->src_cpu,
-						   env->dst_cpu);
-			local_irq_enable();
-		}
-
 		env->dst_cpu = cpu;
 	}
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (10 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

When swapping tasks for NUMA balancing, it is preferred that tasks move
to or remain on their preferred node. When considering an imbalance,
encourage tasks to move to their preferred node and discourage tasks from
moving away from their preferred node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee14cd9815ec..63f22b1a5f0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1741,18 +1741,27 @@ static void task_numa_compare(struct task_numa_env *env,
 			goto unlock;
 	}
 
+	/* Skip this swap candidate if cannot move to the source cpu. */
+	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
+		goto unlock;
+
+	/*
+	 * Skip this swap candidate if it is not moving to its preferred
+	 * node and the best task is.
+	 */
+	if (env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid &&
+	    cur->numa_preferred_nid != env->src_nid) {
+		goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
 	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
-	 */
-	/* Skip this swap candidate if cannot move to the source cpu */
-	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
-		goto unlock;
-
-	/*
+	 *
 	 * If dst and source tasks are in the same NUMA group, or not
 	 * in any group then look only at task weights.
 	 */
@@ -1779,12 +1788,34 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
+	/* Discourage picking a task already on its preferred node */
+	if (cur->numa_preferred_nid == env->dst_nid)
+		imp -= imp / 16;
+
+	/*
+	 * Encourage picking a task that moves to its preferred node.
+	 * This potentially makes imp larger than it's maximum of
+	 * 1998 (see SMALLIMP and task_weight for why) but in this
+	 * case, it does not matter.
+	 */
+	if (cur->numa_preferred_nid == env->src_nid)
+		imp += imp / 8;
+
 	if (maymove && moveimp > imp && moveimp > env->best_imp) {
 		imp = moveimp;
 		cur = NULL;
 		goto assign;
 	}
 
+	/*
+	 * Prefer swapping with a task moving to its preferred node over a
+	 * task that is not.
+	 */
+	if (env->best_task && cur->numa_preferred_nid == env->src_nid &&
+	    env->best_task->numa_preferred_nid != env->src_nid) {
+		goto assign;
+	}
+
 	/*
 	 * If the NUMA importance is less than SMALLIMP,
 	 * task migration might only result in ping pong
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (11 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
@ 2020-02-24  9:52 ` Mel Gorman
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
  2020-03-09 19:12 ` Phil Auld
  14 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-24  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.

o A task can move to an idle CPU and an idle CPU is found
o A swap candidate is found that would move to its preferred domain

This patch terminates the search when either condition is met.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63f22b1a5f0b..3f51586365f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	long moveimp = imp;
 	long load;
+	bool stopsearch = false;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.
 	 */
-	if (cur == env->p)
+	if (cur == env->p) {
+		stopsearch = true;
 		goto unlock;
+	}
 
 	if (!cur) {
 		if (maymove && moveimp >= env->best_imp)
@@ -1860,8 +1863,27 @@ static void task_numa_compare(struct task_numa_env *env,
 	}
 
 	task_numa_assign(env, cur, imp);
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
+		stopsearch = true;
+
+	/*
+	 * If a swap candidate must be identified and the current best task
+	 * moves its preferred node then stop the search.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		stopsearch = true;
+	}
 unlock:
 	rcu_read_unlock();
+
+	return stopsearch;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1916,7 +1938,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (12 preceding siblings ...)
  2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
@ 2020-02-24 15:16 ` Ingo Molnar
  2020-02-25 11:59   ` Mel Gorman
  2020-03-09 19:12 ` Phil Auld
  14 siblings, 1 reply; 86+ messages in thread
From: Ingo Molnar @ 2020-02-24 15:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> The only differences in V6 are due to Vincent's latest patch series.
> 
> This is V5 which includes the latest versions of Vincent's patch
> addressing review feedback. Patches 4-9 are Vincent's work plus one
> important performance fix. Vincent's patches were retested and while
> not presented in detail, it was mostly an improvement.
> 
> Changelog since V5:
> o Import Vincent's latest patch set

>  include/linux/sched.h        |  31 ++-
>  include/trace/events/sched.h |  49 ++--
>  kernel/sched/core.c          |  13 -
>  kernel/sched/debug.c         |  17 +-
>  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
>  kernel/sched/pelt.c          |  59 ++--
>  kernel/sched/sched.h         |  42 ++-
>  7 files changed, 535 insertions(+), 302 deletions(-)

Applied to tip:sched/core for v5.7 inclusion, thanks Mel and Vincent!

Please base future iterations on top of a0f03b617c3b (current 
sched/core).

Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found
  2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Valentin Schneider,
	Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a0f03b617c3b2644d3d47bf7d9e60aed01bd5b10
Gitweb:        https://git.kernel.org/tip/a0f03b617c3b2644d3d47bf7d9e60aed01bd5b10
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:23 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:40 +01:00

sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found

When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.

 o A task can move to an idle CPU and an idle CPU is found
 o A swap candidate is found that would move to its preferred domain

This patch terminates the search when either condition is met.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-14-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1ac01..fcc9686 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	long moveimp = imp;
 	long load;
+	bool stopsearch = false;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.
 	 */
-	if (cur == env->p)
+	if (cur == env->p) {
+		stopsearch = true;
 		goto unlock;
+	}
 
 	if (!cur) {
 		if (maymove && moveimp >= env->best_imp)
@@ -1860,8 +1863,27 @@ assign:
 	}
 
 	task_numa_assign(env, cur, imp);
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
+		stopsearch = true;
+
+	/*
+	 * If a swap candidate must be identified and the current best task
+	 * moves its preferred node then stop the search.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		stopsearch = true;
+	}
 unlock:
 	rcu_read_unlock();
+
+	return stopsearch;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1916,7 +1938,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks
  2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Valentin Schneider,
	Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ff7db0bf24db919f69121bf5df8f3cb6d79f49af
Gitweb:        https://git.kernel.org/tip/ff7db0bf24db919f69121bf5df8f3cb6d79f49af
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:20 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:38 +01:00

sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks

task_numa_find_cpu() can scan a node multiple times. Minimally it scans to
gather statistics and later to find a suitable target. In some cases, the
second scan will simply pick an idle CPU if the load is not imbalanced.

This patch caches information on an idle core while gathering statistics
and uses it immediately if load is not imbalanced to avoid a second scan
of the node runqueues. Preference is given to an idle core rather than an
idle SMT sibling to avoid packing HT siblings due to linearly scanning the
node cpumask.

As a side-effect, even when the second scan is necessary, the importance
of using select_idle_sibling is much reduced because information on idle
CPUs is cached and can be reused.

Note that this patch actually makes is harder to move to an idle CPU
as multiple tasks can race for the same idle CPU due to a race checking
numa_migrate_on. This is addressed in the next patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-11-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 119 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 102 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87521ac..2da21f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1500,8 +1500,29 @@ struct numa_stats {
 	unsigned int nr_running;
 	unsigned int weight;
 	enum numa_type node_type;
+	int idle_cpu;
 };
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+		if (cpu == sibling)
+			continue;
+
+		if (!idle_cpu(cpu))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
+/* Forward declarations of select_idle_sibling helpers */
+static inline bool test_idle_cores(int cpu, bool def);
+
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1537,15 +1558,39 @@ numa_type numa_classify(unsigned int imbalance_pct,
 	return node_fully_busy;
 }
 
+static inline int numa_idle_core(int idle_core, int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (!static_branch_likely(&sched_smt_present) ||
+	    idle_core >= 0 || !test_idle_cores(cpu, false))
+		return idle_core;
+
+	/*
+	 * Prefer cores instead of packing HT siblings
+	 * and triggering future load balancing.
+	 */
+	if (is_core_idle(cpu))
+		idle_core = cpu;
+#endif
+
+	return idle_core;
+}
+
 /*
- * XXX borrowed from update_sg_lb_stats
+ * Gather all necessary information to make NUMA balancing placement
+ * decisions that are compatible with standard load balancer. This
+ * borrows code and logic from update_sg_lb_stats but sharing a
+ * common implementation is impractical.
  */
 static void update_numa_stats(struct task_numa_env *env,
-			      struct numa_stats *ns, int nid)
+			      struct numa_stats *ns, int nid,
+			      bool find_idle)
 {
-	int cpu;
+	int cpu, idle_core = -1;
 
 	memset(ns, 0, sizeof(*ns));
+	ns->idle_cpu = -1;
+
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
@@ -1553,11 +1598,25 @@ static void update_numa_stats(struct task_numa_env *env,
 		ns->util += cpu_util(cpu);
 		ns->nr_running += rq->cfs.h_nr_running;
 		ns->compute_capacity += capacity_of(cpu);
+
+		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
+			if (READ_ONCE(rq->numa_migrate_on) ||
+			    !cpumask_test_cpu(cpu, env->p->cpus_ptr))
+				continue;
+
+			if (ns->idle_cpu == -1)
+				ns->idle_cpu = cpu;
+
+			idle_core = numa_idle_core(idle_core, cpu);
+		}
 	}
 
 	ns->weight = cpumask_weight(cpumask_of_node(nid));
 
 	ns->node_type = numa_classify(env->imbalance_pct, ns);
+
+	if (idle_core >= 0)
+		ns->idle_cpu = idle_core;
 }
 
 static void task_numa_assign(struct task_numa_env *env,
@@ -1566,7 +1625,7 @@ static void task_numa_assign(struct task_numa_env *env,
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
 	/* Bail out if run-queue part of active NUMA balance. */
-	if (xchg(&rq->numa_migrate_on, 1))
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
 		return;
 
 	/*
@@ -1730,19 +1789,39 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 assign:
-	/*
-	 * One idle CPU per node is evaluated for a task numa move.
-	 * Call select_idle_sibling to maybe find a better one.
-	 */
+	/* Evaluate an idle CPU for a task numa move. */
 	if (!cur) {
+		int cpu = env->dst_stats.idle_cpu;
+
+		/* Nothing cached so current CPU went idle since the search. */
+		if (cpu < 0)
+			cpu = env->dst_cpu;
+
 		/*
-		 * select_idle_siblings() uses an per-CPU cpumask that
-		 * can be used from IRQ context.
+		 * If the CPU is no longer truly idle and the previous best CPU
+		 * is, keep using it.
 		 */
-		local_irq_disable();
-		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+		if (!idle_cpu(cpu) && env->best_cpu >= 0 &&
+		    idle_cpu(env->best_cpu)) {
+			cpu = env->best_cpu;
+		}
+
+		/*
+		 * Use select_idle_sibling if the previously found idle CPU is
+		 * not idle any more.
+		 */
+		if (!idle_cpu(cpu)) {
+			/*
+			 * select_idle_siblings() uses an per-CPU cpumask that
+			 * can be used from IRQ context.
+			 */
+			local_irq_disable();
+			cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
-		local_irq_enable();
+			local_irq_enable();
+		}
+
+		env->dst_cpu = cpu;
 	}
 
 	task_numa_assign(env, cur, imp);
@@ -1776,8 +1855,14 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		imbalance = adjust_numa_imbalance(imbalance, src_running);
 
 		/* Use idle CPU if there is no imbalance */
-		if (!imbalance)
+		if (!imbalance) {
 			maymove = true;
+			if (env->dst_stats.idle_cpu >= 0) {
+				env->dst_cpu = env->dst_stats.idle_cpu;
+				task_numa_assign(env, NULL, 0);
+				return;
+			}
+		}
 	} else {
 		long src_load, dst_load, load;
 		/*
@@ -1850,10 +1935,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env, &env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid, false);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1886,7 +1971,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Bias swapping tasks based on their preferred node
  2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Valentin Schneider,
	Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     88cca72c9673e631b63eca7a1dba4a9722a3f414
Gitweb:        https://git.kernel.org/tip/88cca72c9673e631b63eca7a1dba4a9722a3f414
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:22 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:39 +01:00

sched/numa: Bias swapping tasks based on their preferred node

When swapping tasks for NUMA balancing, it is preferred that tasks move
to or remain on their preferred node. When considering an imbalance,
encourage tasks to move to their preferred node and discourage tasks from
moving away from their preferred node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-13-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 050c1b1..8c1ac01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1741,18 +1741,27 @@ static void task_numa_compare(struct task_numa_env *env,
 			goto unlock;
 	}
 
+	/* Skip this swap candidate if cannot move to the source cpu. */
+	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
+		goto unlock;
+
+	/*
+	 * Skip this swap candidate if it is not moving to its preferred
+	 * node and the best task is.
+	 */
+	if (env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid &&
+	    cur->numa_preferred_nid != env->src_nid) {
+		goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
 	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
-	 */
-	/* Skip this swap candidate if cannot move to the source cpu */
-	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
-		goto unlock;
-
-	/*
+	 *
 	 * If dst and source tasks are in the same NUMA group, or not
 	 * in any group then look only at task weights.
 	 */
@@ -1779,6 +1788,19 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
+	/* Discourage picking a task already on its preferred node */
+	if (cur->numa_preferred_nid == env->dst_nid)
+		imp -= imp / 16;
+
+	/*
+	 * Encourage picking a task that moves to its preferred node.
+	 * This potentially makes imp larger than it's maximum of
+	 * 1998 (see SMALLIMP and task_weight for why) but in this
+	 * case, it does not matter.
+	 */
+	if (cur->numa_preferred_nid == env->src_nid)
+		imp += imp / 8;
+
 	if (maymove && moveimp > imp && moveimp > env->best_imp) {
 		imp = moveimp;
 		cur = NULL;
@@ -1786,6 +1808,15 @@ static void task_numa_compare(struct task_numa_env *env,
 	}
 
 	/*
+	 * Prefer swapping with a task moving to its preferred node over a
+	 * task that is not.
+	 */
+	if (env->best_task && cur->numa_preferred_nid == env->src_nid &&
+	    env->best_task->numa_preferred_nid != env->src_nid) {
+		goto assign;
+	}
+
+	/*
 	 * If the NUMA importance is less than SMALLIMP,
 	 * task migration might only result in ping pong
 	 * of tasks and also hurt performance due to cache

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance
  2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Valentin Schneider,
	Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5fb52dd93a2fe9a738f730de9da108bd1f6c30d0
Gitweb:        https://git.kernel.org/tip/5fb52dd93a2fe9a738f730de9da108bd1f6c30d0
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:21 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:39 +01:00

sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance

Multiple tasks can attempt to select and idle CPU but fail because
numa_migrate_on is already set and the migration fails. Instead of failing,
scan for an alternative idle CPU. select_idle_sibling is not used because
it requires IRQs to be disabled and it ignores numa_migrate_on allowing
multiple tasks to stack. This scan may still fail if there are idle
candidate CPUs due to races but if this occurs, it's best that a task
stay on an available CPU that move to a contended one.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-12-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2da21f4..050c1b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1624,15 +1624,34 @@ static void task_numa_assign(struct task_numa_env *env,
 {
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
-	/* Bail out if run-queue part of active NUMA balance. */
-	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
+	/* Check if run-queue part of active NUMA balance. */
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) {
+		int cpu;
+		int start = env->dst_cpu;
+
+		/* Find alternative idle CPU. */
+		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) {
+			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
+			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+				continue;
+			}
+
+			env->dst_cpu = cpu;
+			rq = cpu_rq(env->dst_cpu);
+			if (!xchg(&rq->numa_migrate_on, 1))
+				goto assign;
+		}
+
+		/* Failed to find an alternative idle CPU */
 		return;
+	}
 
+assign:
 	/*
 	 * Clear previous best_cpu/rq numa-migrate flag, since task now
 	 * found a better CPU to move/swap.
 	 */
-	if (env->best_cpu != -1) {
+	if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) {
 		rq = cpu_rq(env->best_cpu);
 		WRITE_ONCE(rq->numa_migrate_on, 0);
 	}
@@ -1806,21 +1825,6 @@ assign:
 			cpu = env->best_cpu;
 		}
 
-		/*
-		 * Use select_idle_sibling if the previously found idle CPU is
-		 * not idle any more.
-		 */
-		if (!idle_cpu(cpu)) {
-			/*
-			 * select_idle_siblings() uses an per-CPU cpumask that
-			 * can be used from IRQ context.
-			 */
-			local_irq_disable();
-			cpu = select_idle_sibling(env->p, env->src_cpu,
-						   env->dst_cpu);
-			local_irq_enable();
-		}
-
 		env->dst_cpu = cpu;
 	}
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/pelt: Remove unused runnable load average
  2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Vincent Guittot
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Mel Gorman, Ingo Molnar, Dietmar Eggemann,
	Peter Zijlstra, Juri Lelli, Valentin Schneider, Phil Auld,
	Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     0dacee1bfa70e171be3a12a30414c228453048d2
Gitweb:        https://git.kernel.org/tip/0dacee1bfa70e171be3a12a30414c228453048d2
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Mon, 24 Feb 2020 09:52:17 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:36 +01:00

sched/pelt: Remove unused runnable load average

Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net
---
 include/linux/sched.h |   5 +--
 kernel/sched/core.c   |   2 +-
 kernel/sched/debug.c  |   8 +---
 kernel/sched/fair.c   | 130 +++++------------------------------------
 kernel/sched/pelt.c   |  62 +++++++-------------
 kernel/sched/sched.h  |   7 +--
 6 files changed, 43 insertions(+), 171 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0427849..037eaff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -357,7 +357,7 @@ struct util_est {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series
- * (see __update_load_avg() in kernel/sched/fair.c).
+ * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
@@ -401,11 +401,9 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
-	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -449,7 +447,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94819d..8e6f380 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,7 +761,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -774,7 +773,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 879d3cc..cfecaad 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -402,11 +402,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,11 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -947,13 +942,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
-	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a3c66f..b0fb3d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -741,9 +741,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+		sa->load_avg = scale_load_down(se->load.weight);
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2899,25 +2897,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_SMP
 static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight += se->runnable_weight;
-
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
-}
-
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	cfs_rq->runnable_weight -= se->runnable_weight;
-
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
-}
-
-static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	cfs_rq->avg.load_avg += se->avg.load_avg;
@@ -2932,28 +2911,22 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 #else
 static inline void
-enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
-dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
 		account_entity_dequeue(cfs_rq, se);
-		dequeue_runnable_load_avg(cfs_rq, se);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2961,16 +2934,13 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
+	if (se->on_rq)
 		account_entity_enqueue(cfs_rq, se);
-		enqueue_runnable_load_avg(cfs_rq, se);
-	}
+
 }
 
 void reweight_task(struct task_struct *p, int prio)
@@ -2980,7 +2950,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -3092,50 +3062,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	 */
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
-
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3147,7 +3073,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
 	if (!gcfs_rq)
 		return;
@@ -3156,16 +3082,15 @@ static void update_cfs_group(struct sched_entity *se)
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3290,11 +3215,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial and simply copies the running
- * sum over (but still wrong, because the group entity and group rq do not have
- * their PELT windows aligned).
+ * Per the above update_tg_cfs_util() is trivial  * and simply copies the
+ * running sum over (but still wrong, because the group entity and group rq do
+ * not have their PELT windows aligned).
  *
- * However, update_tg_cfs_runnable() is more complex. So we have:
+ * However, update_tg_cfs_load() is more complex. So we have:
  *
  *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg		(2)
  *
@@ -3375,11 +3300,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 }
 
 static inline void
-update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3427,20 +3352,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-
-	if (se->on_rq) {
-		delta_sum = runnable_load_sum -
-				se_weight(se) * se->avg.runnable_load_sum;
-		delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3468,7 +3379,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
-	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
 	trace_pelt_se_tp(se);
@@ -3612,8 +3523,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
@@ -4074,14 +3983,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	update_cfs_group(se);
-	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -4158,13 +4065,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
-	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -7649,9 +7554,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b7..3eb0ed3 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,8 +121,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -148,8 +146,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -186,7 +182,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, int running)
 {
 	u64 delta;
 
@@ -222,7 +218,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		runnable = running = 0;
+		running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -231,14 +227,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+___update_load_avg(struct sched_avg *sa, unsigned long load)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -246,7 +242,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -254,17 +249,13 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  * sched_entity:
  *
  *   task:
- *     se_runnable() == se_weight()
+ *     se_weight()   = se->load.weight
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
  *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
- *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *   load_sum := runnable
+ *   load_avg = se_weight(se) * load_sum
  *
  * XXX collapse load_sum and runnable_load_sum
  *
@@ -272,15 +263,12 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+	if (___update_load_sum(now, &se->avg, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -290,10 +278,9 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -306,10 +293,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1, 1);
+		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
 		return 1;
 	}
@@ -322,9 +308,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
  *
- *   load_avg and runnable_load_avg are not supported and meaningless.
+ *   load_avg is not supported and meaningless.
  *
  */
 
@@ -332,10 +318,9 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
 				running,
-				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1, 1);
+		___update_load_avg(&rq->avg_rt, 1);
 		trace_pelt_rt_tp(rq);
 		return 1;
 	}
@@ -348,7 +333,9 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
@@ -356,10 +343,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
 				running,
-				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1, 1);
+		___update_load_avg(&rq->avg_dl, 1);
 		trace_pelt_dl_tp(rq);
 		return 1;
 	}
@@ -373,7 +359,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
+ *   runnable_sum = util_sum
+ *
+ *   load_avg is not supported and meaningless.
  *
  */
 
@@ -402,15 +390,13 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
 				0,
-				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
 				1,
-				1,
 				1);
 
 	if (ret) {
-		___update_load_avg(&rq->avg_irq, 1, 1);
+		___update_load_avg(&rq->avg_irq, 1);
 		trace_pelt_irq_tp(rq);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12bf82d..ce27e58 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -489,7 +489,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
@@ -688,8 +687,10 @@ struct dl_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
+
 #else
 #define entity_is_task(se)	1
+
 #endif
 
 #ifdef CONFIG_SMP
@@ -701,10 +702,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
 
 static inline bool sched_asym_prefer(int a, int b)
 {

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/pelt: Add a new runnable average signal
  2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Vincent Guittot
  2020-02-24 16:01     ` Valentin Schneider
  0 siblings, 1 reply; 86+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Mel Gorman, Ingo Molnar, Dietmar Eggemann,
	Peter Zijlstra, Juri Lelli, Valentin Schneider, Phil Auld,
	Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9f68395333ad7f5bfe2f83473fed363d4229f11c
Gitweb:        https://git.kernel.org/tip/9f68395333ad7f5bfe2f83473fed363d4229f11c
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Mon, 24 Feb 2020 09:52:18 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:36 +01:00

sched/pelt: Add a new runnable average signal

Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
  A rq with more that around 80% of utilization and more than 1 tasks is
  considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net
---
 include/linux/sched.h | 26 ++++++++------
 kernel/sched/debug.c  |  9 +++--
 kernel/sched/fair.c   | 77 ++++++++++++++++++++++++++++++++++++++----
 kernel/sched/pelt.c   | 39 +++++++++++++++------
 kernel/sched/sched.h  | 22 +++++++++++-
 5 files changed, 142 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 037eaff..2e9199b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,28 +356,30 @@ struct util_est {
 } __attribute__((__aligned__(sizeof(u64))));
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series
+ * The load/runnable/util_avg accumulates an infinite geometric series
  * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
  *
  * [load_avg definition]
  *
  *   load_avg = runnable% * scale_load_down(load)
  *
- * where runnable% is the time ratio that a sched_entity is runnable.
- * For cfs_rq, it is the aggregated load_avg of all runnable and
- * blocked sched_entities.
+ * [runnable_avg definition]
+ *
+ *   runnable_avg = runnable% * SCHED_CAPACITY_SCALE
  *
  * [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
  *
- * where running% is the time ratio that a sched_entity is running on
- * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
- * and blocked sched_entities.
+ * where runnable% is the time ratio that a sched_entity is runnable and
+ * running% the time ratio that a sched_entity is running.
+ *
+ * For cfs_rq, they are the aggregated values of all runnable and blocked
+ * sched_entities.
  *
- * load_avg and util_avg don't direcly factor frequency scaling and CPU
- * capacity scaling. The scaling is done through the rq_clock_pelt that
- * is used for computing those signals (see update_rq_clock_pelt())
+ * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that is used
+ * for computing those signals (see update_rq_clock_pelt())
  *
  * N.B., the above ratios (runnable% and running%) themselves are in the
  * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
@@ -401,9 +403,11 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
@@ -467,6 +471,8 @@ struct sched_entity {
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
+	/* cached value of my_q->h_nr_running */
+	unsigned long			runnable_weight;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cfecaad..8331bc0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -405,6 +405,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
+	P(se->avg.runnable_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -524,6 +525,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
+			cfs_rq->avg.runnable_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
+	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_avg",
+			cfs_rq->removed.runnable_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
@@ -944,8 +947,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
+	P(se.avg.runnable_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
+	P(se.avg.runnable_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0fb3d6..49b36d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -794,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 		}
 	}
 
+	sa->runnable_avg = cpu_scale;
+
 	if (p->sched_class != &fair_sched_class) {
 		/*
 		 * For !fair tasks do:
@@ -3215,9 +3217,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- * Per the above update_tg_cfs_util() is trivial  * and simply copies the
- * running sum over (but still wrong, because the group entity and group rq do
- * not have their PELT windows aligned).
+ * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial
+ * and simply copies the running/runnable sum over (but still wrong, because
+ * the group entity and group rq do not have their PELT windows aligned).
  *
  * However, update_tg_cfs_load() is more complex. So we have:
  *
@@ -3300,6 +3302,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 }
 
 static inline void
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
+{
+	long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
+	/* Set new sched_entity's runnable */
+	se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
+	se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq runnable */
+	add_positive(&cfs_rq->avg.runnable_avg, delta);
+	cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX;
+}
+
+static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
@@ -3379,6 +3407,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
 	update_tg_cfs_util(cfs_rq, se, gcfs_rq);
+	update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
 	update_tg_cfs_load(cfs_rq, se, gcfs_rq);
 
 	trace_pelt_cfs_tp(cfs_rq);
@@ -3449,7 +3478,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3460,7 +3489,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
+		swap(cfs_rq->removed.runnable_avg, removed_runnable);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3472,7 +3501,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
+		r = removed_runnable;
+		sub_positive(&sa->runnable_avg, r);
+		sub_positive(&sa->runnable_sum, r * divider);
+
+		/*
+		 * removed_runnable is the unweighted version of removed_load so we
+		 * can use it to estimate removed_load_sum.
+		 */
+		add_tg_cfs_propagate(cfs_rq,
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
 
 		decayed = 1;
 	}
@@ -3517,6 +3555,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 */
 	se->avg.util_sum = se->avg.util_avg * divider;
 
+	se->avg.runnable_sum = se->avg.runnable_avg * divider;
+
 	se->avg.load_sum = divider;
 	if (se_weight(se)) {
 		se->avg.load_sum =
@@ -3526,6 +3566,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
+	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
+	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;
 
 	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
 
@@ -3547,6 +3589,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	dequeue_load_avg(cfs_rq, se);
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
+	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 
@@ -3653,10 +3697,15 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;
 	cfs_rq->removed.load_avg	+= se->avg.load_avg;
-	cfs_rq->removed.runnable_sum	+= se->avg.load_sum; /* == runnable_sum */
+	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
+static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->avg.runnable_avg;
+}
+
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;
@@ -3983,11 +4032,13 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
+	se_update_runnable(se);
 	update_cfs_group(se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -4065,11 +4116,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Subtract its load from the cfs_rq->runnable_avg.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG);
+	se_update_runnable(se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
 
@@ -5240,6 +5293,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running++;
@@ -5337,6 +5391,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
+		se_update_runnable(se);
 		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
@@ -5409,6 +5464,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
 	return load;
 }
 
+static unsigned long cpu_runnable(struct rq *rq)
+{
+	return cfs_rq_runnable_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7554,6 +7614,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
+	if (cfs_rq->avg.runnable_sum)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3eb0ed3..c40d57a 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, int running)
+	       unsigned long load, unsigned long runnable, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -121,6 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_sum =
+			decay_load(sa->runnable_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -146,6 +148,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -182,7 +186,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, int running)
+		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
 
@@ -218,7 +222,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Also see the comment in accumulate_sum().
 	 */
 	if (!load)
-		running = 0;
+		runnable = running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -227,7 +231,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, running))
+	if (!accumulate_sum(delta, sa, load, runnable, running))
 		return 0;
 
 	return 1;
@@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_avg = div_u64(sa->runnable_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -250,24 +255,30 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
  *
  *   task:
  *     se_weight()   = se->load.weight
+ *     se_runnable() = !!on_rq
  *
  *   group: [ see update_cfs_group() ]
  *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = grq->h_nr_running
+ *
+ *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
+ *   runnable_avg = runnable_sum
  *
  *   load_sum := runnable
  *   load_avg = se_weight(se) * load_sum
  *
- * XXX collapse load_sum and runnable_load_sum
- *
  * cfq_rq:
  *
+ *   runnable_sum = \Sum se->avg.runnable_sum
+ *   runnable_avg = \Sum se->avg.runnable_avg
+ *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0)) {
+	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
 		trace_pelt_se_tp(se);
 		return 1;
@@ -278,7 +289,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+				cfs_rq->curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
@@ -293,6 +305,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
+				cfs_rq->h_nr_running,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
@@ -310,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
@@ -318,6 +331,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
 				running,
+				running,
 				running)) {
 
 		___update_load_avg(&rq->avg_rt, 1);
@@ -335,7 +349,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
@@ -343,6 +357,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
 				running,
+				running,
 				running)) {
 
 		___update_load_avg(&rq->avg_dl, 1);
@@ -361,7 +376,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *   util_sum = cpu_scale * load_sum
  *   runnable_sum = util_sum
  *
- *   load_avg is not supported and meaningless.
+ *   load_avg and runnable_avg are not supported and meaningless.
  *
  */
 
@@ -390,9 +405,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 */
 	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
 				0,
+				0,
 				0);
 	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
 				1,
+				1,
 				1);
 
 	if (ret) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce27e58..2a0caf3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -527,7 +527,7 @@ struct cfs_rq {
 		int		nr;
 		unsigned long	load_avg;
 		unsigned long	util_avg;
-		unsigned long	runnable_sum;
+		unsigned long	runnable_avg;
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -688,9 +688,29 @@ struct dl_rq {
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
 
+static inline void se_update_runnable(struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		se->runnable_weight = se->my_q->h_nr_running;
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return !!se->on_rq;
+	else
+		return se->runnable_weight;
+}
+
 #else
 #define entity_is_task(se)	1
 
+static inline void se_update_runnable(struct sched_entity *se) {}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return !!se->on_rq;
+}
 #endif
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/fair: Take into account runnable_avg to classify group
  2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Vincent Guittot
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Mel Gorman, Ingo Molnar, Dietmar Eggemann,
	Peter Zijlstra, Juri Lelli, Steven Rostedt, Valentin Schneider,
	Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     070f5e860ee2bf588c99ef7b4c202451faa48236
Gitweb:        https://git.kernel.org/tip/070f5e860ee2bf588c99ef7b4c202451faa48236
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Mon, 24 Feb 2020 09:52:19 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:37 +01:00

sched/fair: Take into account runnable_avg to classify group

Take into account the new runnable_avg signal to classify a group and to
mitigate the volatility of util_avg in face of intensive migration or
new task with random utilization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-10-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49b36d6..87521ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5469,6 +5469,24 @@ static unsigned long cpu_runnable(struct rq *rq)
 	return cfs_rq_runnable_avg(&rq->cfs);
 }
 
+static unsigned long cpu_runnable_without(struct rq *rq, struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq;
+	unsigned int runnable;
+
+	/* Task has no contribution or is new */
+	if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
+		return cpu_runnable(rq);
+
+	cfs_rq = &rq->cfs;
+	runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+
+	/* Discount task's runnable from CPU's runnable */
+	lsub_positive(&runnable, p->se.avg.runnable_avg);
+
+	return runnable;
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -7752,7 +7770,8 @@ struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
-	unsigned long group_util; /* Total utilization of the group */
+	unsigned long group_util; /* Total utilization over the CPUs of the group */
+	unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
 	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
@@ -7973,6 +7992,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return false;
+
 	if ((sgs->group_capacity * 100) >
 			(sgs->group_util * imbalance_pct))
 		return true;
@@ -7998,6 +8021,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 			(sgs->group_util * imbalance_pct))
 		return true;
 
+	if ((sgs->group_capacity * imbalance_pct) <
+			(sgs->group_runnable * 100))
+		return true;
+
 	return false;
 }
 
@@ -8092,6 +8119,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
+		sgs->group_runnable += cpu_runnable(rq);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
@@ -8367,6 +8395,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 		sgs->group_load += cpu_load_without(rq, p);
 		sgs->group_util += cpu_util_without(i, p);
+		sgs->group_runnable += cpu_runnable_without(rq, p);
 		local = task_running_on_cpu(i, p);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running - local;
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity
  2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Valentin Schneider, Phil Auld,
	Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fb86f5b2119245afd339280099b4e9417cc0b03a
Gitweb:        https://git.kernel.org/tip/fb86f5b2119245afd339280099b4e9417cc0b03a
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:16 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:35 +01:00

sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity

The standard load balancer generally tries to keep the number of running
tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
tasks to move if there is spare capacity but this causes a conflict and
utilisation between NUMA nodes gets badly skewed. This patch uses similar
logic between the NUMA balancer and load balancer when deciding if a task
migrating to its preferred node can use an idle CPU.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-7-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 81 +++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc3d651..7a3c66f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1520,6 +1520,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
-
-	/* If dst node has spare capacity, there is no real load imbalance */
-	if (env->dst_stats.node_type == node_has_spare)
-		return false;
-
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1757,19 +1753,42 @@ unlock:
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
-	long src_load, dst_load, load;
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
-
 	/*
-	 * If the improvement from just moving env->p direction is better
-	 * than swapping tasks around, check if a move is possible.
+	 * If dst node has spare capacity, then check if there is an
+	 * imbalance that would be overruled by the load balancer.
 	 */
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
+	if (env->dst_stats.node_type == node_has_spare) {
+		unsigned int imbalance;
+		int src_running, dst_running;
+
+		/*
+		 * Would movement cause an imbalance? Note that if src has
+		 * more running tasks that the imbalance is ignored as the
+		 * move improves the imbalance from the perspective of the
+		 * CPU load balancer.
+		 * */
+		src_running = env->src_stats.nr_running - 1;
+		dst_running = env->dst_stats.nr_running + 1;
+		imbalance = max(0, dst_running - src_running);
+		imbalance = adjust_numa_imbalance(imbalance, src_running);
+
+		/* Use idle CPU if there is no imbalance */
+		if (!imbalance)
+			maymove = true;
+	} else {
+		long src_load, dst_load, load;
+		/*
+		 * If the improvement from just moving env->p direction is better
+		 * than swapping tasks around, check if a move is possible.
+		 */
+		load = task_h_load(env->p);
+		dst_load = env->dst_stats.load + load;
+		src_load = env->src_stats.load - load;
+		maymove = !load_too_imbalanced(src_load, dst_load, env);
+	}
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
@@ -8694,6 +8713,21 @@ next_group:
 	}
 }
 
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+{
+	unsigned int imbalance_min;
+
+	/*
+	 * Allow a small imbalance based on a simple pair of communicating
+	 * tasks that remain local when the source domain is almost idle.
+	 */
+	imbalance_min = 2;
+	if (src_nr_running <= imbalance_min)
+		return 0;
+
+	return imbalance;
+}
+
 /**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
@@ -8790,24 +8824,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA) {
-			unsigned int imbalance_min;
-
-			/*
-			 * Compute an allowed imbalance based on a simple
-			 * pair of communicating tasks that should remain
-			 * local and ignore them.
-			 *
-			 * NOTE: Generally this would have been based on
-			 * the domain size and this was evaluated. However,
-			 * the benefit is similar across a range of workloads
-			 * and machines but scaling by the domain size adds
-			 * the risk that lower domains have to be rebalanced.
-			 */
-			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
-				env->imbalance = 0;
-		}
+		if (env->sd->flags & SD_NUMA)
+			env->imbalance = adjust_numa_imbalance(env->imbalance,
+						busiest->sum_nr_running);
 
 		return;
 	}

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Replace runnable_load_avg by load_avg
  2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Vincent Guittot
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Mel Gorman, Ingo Molnar, Dietmar Eggemann,
	Peter Zijlstra, Juri Lelli, Valentin Schneider, Phil Auld,
	Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     6499b1b2dd1b8d404a16b9fbbf1af6b9b3c1d83d
Gitweb:        https://git.kernel.org/tip/6499b1b2dd1b8d404a16b9fbbf1af6b9b3c1d83d
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Mon, 24 Feb 2020 09:52:15 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:34 +01:00

sched/numa: Replace runnable_load_avg by load_avg

Similarly to what has been done for the normal load balancer, we can
replace runnable_load_avg by load_avg in numa load balancing and track the
other statistics like the utilization and the number of running tasks to
get to better view of the current state of a node.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-6-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++--------------
 1 file changed, 70 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6c7f8b..bc3d651 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
+/*
+ * 'numa_type' describes the node at the moment of load balancing.
+ */
+enum numa_type {
+	/* The node has spare capacity that can be used to run more tasks.  */
+	node_has_spare = 0,
+	/*
+	 * The node is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevertheless, some tasks might wait before running.
+	 */
+	node_fully_busy,
+	/*
+	 * The node is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
+	node_overloaded
+};
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
 	unsigned long load;
-
+	unsigned long util;
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
+	unsigned int nr_running;
+	unsigned int weight;
+	enum numa_type node_type;
 };
 
-/*
- * XXX borrowed from update_sg_lb_stats
- */
-static void update_numa_stats(struct numa_stats *ns, int nid)
-{
-	int cpu;
-
-	memset(ns, 0, sizeof(*ns));
-	for_each_cpu(cpu, cpumask_of_node(nid)) {
-		struct rq *rq = cpu_rq(cpu);
-
-		ns->load += cpu_runnable_load(rq);
-		ns->compute_capacity += capacity_of(cpu);
-	}
-
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 
@@ -1521,6 +1518,47 @@ struct task_numa_env {
 	int best_cpu;
 };
 
+static unsigned long cpu_load(struct rq *rq);
+static unsigned long cpu_util(int cpu);
+
+static inline enum
+numa_type numa_classify(unsigned int imbalance_pct,
+			 struct numa_stats *ns)
+{
+	if ((ns->nr_running > ns->weight) &&
+	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
+		return node_overloaded;
+
+	if ((ns->nr_running < ns->weight) ||
+	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
+		return node_has_spare;
+
+	return node_fully_busy;
+}
+
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct task_numa_env *env,
+			      struct numa_stats *ns, int nid)
+{
+	int cpu;
+
+	memset(ns, 0, sizeof(*ns));
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		ns->load += cpu_load(rq);
+		ns->util += cpu_util(cpu);
+		ns->nr_running += rq->cfs.h_nr_running;
+		ns->compute_capacity += capacity_of(cpu);
+	}
+
+	ns->weight = cpumask_weight(cpumask_of_node(nid));
+
+	ns->node_type = numa_classify(env->imbalance_pct, ns);
+}
+
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
@@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
 
+
+	/* If dst node has spare capacity, there is no real load imbalance */
+	if (env->dst_stats.node_type == node_has_spare)
+		return false;
+
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
 	 *
@@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env, &env.src_stats, env.src_nid);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env.dst_stats, env.dst_nid);
+	update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env.dst_stats, env.dst_nid);
+			update_numa_stats(&env, &env.dst_stats, env.dst_nid);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
@@ -3686,11 +3729,6 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->avg.runnable_load_avg;
-}
-
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 {
 	return cfs_rq->avg.load_avg;

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() failure cases
  2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b2b2042b204796190af7c20069ab790a614c36d0
Gitweb:        https://git.kernel.org/tip/b2b2042b204796190af7c20069ab790a614c36d0
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:13 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:33 +01:00

sched/numa: Distinguish between the different task_numa_migrate() failure cases

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node but from the trace, it's possibile to tell the
difference between "no CPU found", "migration to idle CPU failed" and
"tasks could not be swapped". Extend the tracepoint accordingly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-4-mgorman@techsingularity.net
---
 include/trace/events/sched.h | 49 +++++++++++++++++++----------------
 kernel/sched/fair.c          |  6 ++--
 2 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e..9c3ebb7 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -487,7 +487,11 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
-DECLARE_EVENT_CLASS(sched_move_task_template,
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes.
+ */
+TRACE_EVENT(sched_move_numa,
 
 	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
 
@@ -519,23 +523,7 @@ DECLARE_EVENT_CLASS(sched_move_task_template,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
-/*
- * Tracks migration of tasks from one runqueue to another. Can be used to
- * detect if automatic NUMA balancing is bouncing between nodes
- */
-DEFINE_EVENT(sched_move_task_template, sched_move_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
-
-	TP_ARGS(tsk, src_cpu, dst_cpu)
-);
-
-DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
-
-	TP_ARGS(tsk, src_cpu, dst_cpu)
-);
-
-TRACE_EVENT(sched_swap_numa,
+DECLARE_EVENT_CLASS(sched_numa_pair_template,
 
 	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
 		 struct task_struct *dst_tsk, int dst_cpu),
@@ -561,11 +549,11 @@ TRACE_EVENT(sched_swap_numa,
 		__entry->src_ngid	= task_numa_group_id(src_tsk);
 		__entry->src_cpu	= src_cpu;
 		__entry->src_nid	= cpu_to_node(src_cpu);
-		__entry->dst_pid	= task_pid_nr(dst_tsk);
-		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
-		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
+		__entry->dst_pid	= dst_tsk ? task_pid_nr(dst_tsk) : 0;
+		__entry->dst_tgid	= dst_tsk ? task_tgid_nr(dst_tsk) : 0;
+		__entry->dst_ngid	= dst_tsk ? task_numa_group_id(dst_tsk) : 0;
 		__entry->dst_cpu	= dst_cpu;
-		__entry->dst_nid	= cpu_to_node(dst_cpu);
+		__entry->dst_nid	= dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1;
 	),
 
 	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
@@ -575,6 +563,23 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
+DEFINE_EVENT(sched_numa_pair_template, sched_stick_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
+);
+
+DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
+
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
+
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
+);
+
+
 /*
  * Tracepoint for waking a polling cpu without an IPI.
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f524ce3..5d9c23c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1849,7 +1849,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 	/* No better CPU than the current one was found. */
 	if (env.best_cpu == -1) {
-		trace_sched_stick_numa(p, env.src_cpu, -1);
+		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
 		return -EAGAIN;
 	}
 
@@ -1858,7 +1858,7 @@ static int task_numa_migrate(struct task_struct *p)
 		ret = migrate_task_to(p, env.best_cpu);
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
-			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
+			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
 		return ret;
 	}
 
@@ -1866,7 +1866,7 @@ static int task_numa_migrate(struct task_struct *p)
 	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
-		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
+		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
 	put_task_struct(env.best_task);
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/numa: Trace when no candidate CPU was found on the preferred node
  2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Phil Auld, Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f22aef4afb0d6cc22e408d8254cf6d02d7982ef1
Gitweb:        https://git.kernel.org/tip/f22aef4afb0d6cc22e408d8254cf6d02d7982ef1
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Mon, 24 Feb 2020 09:52:12 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:32 +01:00

sched/numa: Trace when no candidate CPU was found on the preferred node

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node. The case where no candidate CPU could be found is
not traced which is an important gap. The tracepoint is not fired when
the task is not allowed to run on any CPU on the preferred node or the
task is already running on the target CPU but neither are interesting
corner cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f38ff5a..f524ce3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1848,8 +1848,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/* No better CPU than the current one was found. */
-	if (env.best_cpu == -1)
+	if (env.best_cpu == -1) {
+		trace_sched_stick_numa(p, env.src_cpu, -1);
 		return -EAGAIN;
+	}
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: sched/core] sched/fair: Reorder enqueue/dequeue_task_fair path
  2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
@ 2020-02-24 15:20   ` tip-bot2 for Vincent Guittot
  0 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2020-02-24 15:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Mel Gorman, Ingo Molnar, Dietmar Eggemann,
	Peter Zijlstra, Juri Lelli, Valentin Schneider, Phil Auld,
	Hillf Danton, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     6d4d22468dae3d8757af9f8b81b848a76ef4409d
Gitweb:        https://git.kernel.org/tip/6d4d22468dae3d8757af9f8b81b848a76ef4409d
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Mon, 24 Feb 2020 09:52:14 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 24 Feb 2020 11:36:34 +01:00

sched/fair: Reorder enqueue/dequeue_task_fair path

The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.

Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.

In addition, the same steps ordering is used when updating a cfs_rq:

 - update_load_avg
 - update_cfs_group
 - update *h_nr_running

This reordering enables the use of h_nr_running in PELT algorithm.

No functional and performance changes are expected and have been noticed
during tests.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d9c23c..a6c7f8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5260,32 +5260,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto enqueue_throttle;
+
 		flags = ENQUEUE_WAKEUP;
 	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto enqueue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 	}
 
+enqueue_throttle:
 	if (!se) {
 		add_nr_running(rq, 1);
 		/*
@@ -5346,17 +5345,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
+		if (cfs_rq_throttled(cfs_rq))
+			goto dequeue_throttle;
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5374,16 +5369,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			break;
+			goto dequeue_throttle;
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		update_cfs_group(se);
+
+		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 	}
 
+dequeue_throttle:
 	if (!se)
 		sub_nr_running(rq, 1);
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [tip: sched/core] sched/pelt: Add a new runnable average signal
  2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
@ 2020-02-24 16:01     ` Valentin Schneider
  2020-02-24 16:34       ` Mel Gorman
  2020-02-25  8:23       ` Vincent Guittot
  0 siblings, 2 replies; 86+ messages in thread
From: Valentin Schneider @ 2020-02-24 16:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-tip-commits, Vincent Guittot, Mel Gorman, Ingo Molnar,
	Dietmar Eggemann, Peter Zijlstra, Juri Lelli, Phil Auld,
	Hillf Danton, x86


tip-bot2 for Vincent Guittot writes:

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the fork time initialization thing being sorted out, the rest of the
runnable series can claim my

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

but I doubt any of that is worth the hassle since it's in tip already. Just
figured I'd mention it, being in Cc and all :-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [tip: sched/core] sched/pelt: Add a new runnable average signal
  2020-02-24 16:01     ` Valentin Schneider
@ 2020-02-24 16:34       ` Mel Gorman
  2020-02-25  8:23       ` Vincent Guittot
  1 sibling, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-24 16:34 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-tip-commits, Vincent Guittot, Ingo Molnar,
	Dietmar Eggemann, Peter Zijlstra, Juri Lelli, Phil Auld,
	Hillf Danton, x86

On Mon, Feb 24, 2020 at 04:01:04PM +0000, Valentin Schneider wrote:
> 
> tip-bot2 for Vincent Guittot writes:
> 
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> With the fork time initialization thing being sorted out, the rest of the
> runnable series can claim my
> 
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> but I doubt any of that is worth the hassle since it's in tip already. Just
> figured I'd mention it, being in Cc and all :-)

Whether the tag gets included or not, it's nice to have definite
confirmation that you're ok with this version!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [tip: sched/core] sched/pelt: Add a new runnable average signal
  2020-02-24 16:01     ` Valentin Schneider
  2020-02-24 16:34       ` Mel Gorman
@ 2020-02-25  8:23       ` Vincent Guittot
  1 sibling, 0 replies; 86+ messages in thread
From: Vincent Guittot @ 2020-02-25  8:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, linux-tip-commits, Mel Gorman, Ingo Molnar,
	Dietmar Eggemann, Peter Zijlstra, Juri Lelli, Phil Auld,
	Hillf Danton, x86

On Mon, 24 Feb 2020 at 17:01, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> tip-bot2 for Vincent Guittot writes:
>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
> With the fork time initialization thing being sorted out, the rest of the
> runnable series can claim my
>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

Thanks

>
> but I doubt any of that is worth the hassle since it's in tip already. Just
> figured I'd mention it, being in Cc and all :-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
@ 2020-02-25 11:59   ` Mel Gorman
  2020-02-25 13:28     ` Vincent Guittot
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-02-25 11:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML

On Mon, Feb 24, 2020 at 04:16:41PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > The only differences in V6 are due to Vincent's latest patch series.
> > 
> > This is V5 which includes the latest versions of Vincent's patch
> > addressing review feedback. Patches 4-9 are Vincent's work plus one
> > important performance fix. Vincent's patches were retested and while
> > not presented in detail, it was mostly an improvement.
> > 
> > Changelog since V5:
> > o Import Vincent's latest patch set
> 
> >  include/linux/sched.h        |  31 ++-
> >  include/trace/events/sched.h |  49 ++--
> >  kernel/sched/core.c          |  13 -
> >  kernel/sched/debug.c         |  17 +-
> >  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
> >  kernel/sched/pelt.c          |  59 ++--
> >  kernel/sched/sched.h         |  42 ++-
> >  7 files changed, 535 insertions(+), 302 deletions(-)
> 
> Applied to tip:sched/core for v5.7 inclusion, thanks Mel and Vincent!
> 

Thanks!

> Please base future iterations on top of a0f03b617c3b (current 
> sched/core).
> 

Will do.

However I noticed that "sched/fair: Fix find_idlest_group() to handle
CPU affinity" did not make it to tip/sched/core. Peter seemed to think it
was fine. Was it rejected or is it just sitting in Peter's queue somewhere?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-25 11:59   ` Mel Gorman
@ 2020-02-25 13:28     ` Vincent Guittot
  2020-02-25 14:24       ` Mel Gorman
  0 siblings, 1 reply; 86+ messages in thread
From: Vincent Guittot @ 2020-02-25 13:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML

On Tue, 25 Feb 2020 at 12:59, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Feb 24, 2020 at 04:16:41PM +0100, Ingo Molnar wrote:
> >
> > * Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > > The only differences in V6 are due to Vincent's latest patch series.
> > >
> > > This is V5 which includes the latest versions of Vincent's patch
> > > addressing review feedback. Patches 4-9 are Vincent's work plus one
> > > important performance fix. Vincent's patches were retested and while
> > > not presented in detail, it was mostly an improvement.
> > >
> > > Changelog since V5:
> > > o Import Vincent's latest patch set
> >
> > >  include/linux/sched.h        |  31 ++-
> > >  include/trace/events/sched.h |  49 ++--
> > >  kernel/sched/core.c          |  13 -
> > >  kernel/sched/debug.c         |  17 +-
> > >  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
> > >  kernel/sched/pelt.c          |  59 ++--
> > >  kernel/sched/sched.h         |  42 ++-
> > >  7 files changed, 535 insertions(+), 302 deletions(-)
> >
> > Applied to tip:sched/core for v5.7 inclusion, thanks Mel and Vincent!
> >
>
> Thanks!
>
> > Please base future iterations on top of a0f03b617c3b (current
> > sched/core).
> >
>
> Will do.
>
> However I noticed that "sched/fair: Fix find_idlest_group() to handle
> CPU affinity" did not make it to tip/sched/core. Peter seemed to think it
> was fine. Was it rejected or is it just sitting in Peter's queue somewhere?

The patch has already reached mainline through tip/sched-urgent-for-linus

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-25 13:28     ` Vincent Guittot
@ 2020-02-25 14:24       ` Mel Gorman
  2020-02-25 14:53         ` Vincent Guittot
  2020-02-27  9:09         ` Ingo Molnar
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-25 14:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML

On Tue, Feb 25, 2020 at 02:28:16PM +0100, Vincent Guittot wrote:
> >
> > Will do.
> >
> > However I noticed that "sched/fair: Fix find_idlest_group() to handle
> > CPU affinity" did not make it to tip/sched/core. Peter seemed to think it
> > was fine. Was it rejected or is it just sitting in Peter's queue somewhere?
> 
> The patch has already reached mainline through tip/sched-urgent-for-linus
> 

Bah, I pasted the wrong subject. I am thinking of your
patch "sched/fair: fix statistics for find_idlest_group" --
https://lore.kernel.org/lkml/20200218144534.4564-1-vincent.guittot@linaro.org/
It still appears to be relevant or did I miss something else?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-25 14:24       ` Mel Gorman
@ 2020-02-25 14:53         ` Vincent Guittot
  2020-02-27  9:09         ` Ingo Molnar
  1 sibling, 0 replies; 86+ messages in thread
From: Vincent Guittot @ 2020-02-25 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML

On Tue, 25 Feb 2020 at 15:24, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Feb 25, 2020 at 02:28:16PM +0100, Vincent Guittot wrote:
> > >
> > > Will do.
> > >
> > > However I noticed that "sched/fair: Fix find_idlest_group() to handle
> > > CPU affinity" did not make it to tip/sched/core. Peter seemed to think it
> > > was fine. Was it rejected or is it just sitting in Peter's queue somewhere?
> >
> > The patch has already reached mainline through tip/sched-urgent-for-linus
> >
>
> Bah, I pasted the wrong subject. I am thinking of your
> patch "sched/fair: fix statistics for find_idlest_group" --
> https://lore.kernel.org/lkml/20200218144534.4564-1-vincent.guittot@linaro.org/
> It still appears to be relevant or did I miss something else?

you're right, it's not yet in tip/sched/core but most probably in Peter's queue

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-25 14:24       ` Mel Gorman
  2020-02-25 14:53         ` Vincent Guittot
@ 2020-02-27  9:09         ` Ingo Molnar
  1 sibling, 0 replies; 86+ messages in thread
From: Ingo Molnar @ 2020-02-27  9:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> On Tue, Feb 25, 2020 at 02:28:16PM +0100, Vincent Guittot wrote:
> > >
> > > Will do.
> > >
> > > However I noticed that "sched/fair: Fix find_idlest_group() to handle
> > > CPU affinity" did not make it to tip/sched/core. Peter seemed to think it
> > > was fine. Was it rejected or is it just sitting in Peter's queue somewhere?
> > 
> > The patch has already reached mainline through tip/sched-urgent-for-linus
> > 
> 
> Bah, I pasted the wrong subject. I am thinking of your
> patch "sched/fair: fix statistics for find_idlest_group" --
> https://lore.kernel.org/lkml/20200218144534.4564-1-vincent.guittot@linaro.org/
> It still appears to be relevant or did I miss something else?

Applied to tip:sched/urgent now, thanks guys!

	Ingo

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
                   ` (13 preceding siblings ...)
  2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
@ 2020-03-09 19:12 ` Phil Auld
  2020-03-09 20:36   ` Mel Gorman
  14 siblings, 1 reply; 86+ messages in thread
From: Phil Auld @ 2020-03-09 19:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Valentin Schneider,
	Hillf Danton, LKML

Hi Mel,

On Mon, Feb 24, 2020 at 09:52:10AM +0000 Mel Gorman wrote:
> The only differences in V6 are due to Vincent's latest patch series.
> 
> This is V5 which includes the latest versions of Vincent's patch
> addressing review feedback. Patches 4-9 are Vincent's work plus one
> important performance fix. Vincent's patches were retested and while
> not presented in detail, it was mostly an improvement.
> 
> Changelog since V5:
> o Import Vincent's latest patch set
> 
> Changelog since V4:
> o The V4 send was the completely wrong versions of the patches and
>   is useless.
> 
> Changelog since V3:
> o Remove stray tab						(Valentin)
> o Update comment about allowing a move when src is imbalanced	(Hillf)
> o Updated "sched/pelt: Add a new runnable average signal"	(Vincent)
> 
> Changelog since V2:
> o Rebase on top of Vincent's series again
> o Fix a missed rcu_read_unlock
> o Reduce overhead of tracepoint
> 
> Changelog since V1:
> o Rebase on top of Vincent's series and rework
> 
> Note: The baseline for this series is tip/sched/core as of February
> 	12th rebased on top of v5.6-rc1. The series includes patches from
> 	Vincent as I needed to add a fix and build on top of it. Vincent's
> 	series on its own introduces performance regressions for *some*
> 	but not *all* machines so it's easily missed. This series overall
> 	is close to performance-neutral with some gains depending on the
> 	machine. However, the end result does less work on NUMA balancing
> 	and the fact that both the NUMA balancer and load balancer uses
> 	similar logic makes it much easier to understand.
> 
> The NUMA balancer makes placement decisions on tasks that partially
> take the load balancer into account and vice versa but there are
> inconsistencies. This can result in placement decisions that override
> each other leading to unnecessary migrations -- both task placement
> and page placement. This series reconciles many of the decisions --
> partially Vincent's work with some fixes and optimisations on top to
> merge our two series.
> 
> The first patch is unrelated. It's picked up by tip but was not present in
> the tree at the time of the fork. I'm including it here because I tested
> with it.
> 
> The second and third patches are tracing only and was needed to get
> sensible data out of ftrace with respect to task placement for NUMA
> balancing. The NUMA balancer is *far* easier to analyse with the
> patches and informed how the series should be developed.
> 
> Patches 4-5 are Vincent's and use very similar code patterns and logic
> between NUMA and load balancer. Patch 6 is a fix to Vincent's work that
> is necessary to avoid serious imbalances being introduced by the NUMA
> balancer. Patches 7-9 are also Vincents and while I have not reviewed
> them closely myself, others have.
> 
> The rest of the series are a mix of optimisations and improvements, one
> of which stops the NUMA balancer fighting with itself.
> 
> Note that this is not necessarily a universal performance win although
> performance results are generally ok (small gains/losses depending on
> the machine and workload). However, task migrations, page migrations,
> variability and overall overhead are generally reduced.
> 
> The main reference workload I used was specjbb running one JVM per node
> which typically would be expected to split evenly. It's an interesting
> workload because the number of "warehouses" does not linearly related
> to the number of running tasks due to the creation of GC threads
> and other interfering activity. The mmtests configuration used is
> jvm-specjbb2005-multi with two runs -- one with ftrace enabling relevant
> scheduler tracepoints.
> 
> An example of the headline performance of the series is below and the
> tested kernels are
> 
> baseline-v3r1	Patches 1-3 for the tracing
> loadavg-v3	Patches 1-5 (Add half of Vincent's work)
> lbidle-v6	Patches 1-6 Vincent's work with a fix on top
> classify-v6	Patches 1-9 Rest of Vincent's work
> stopsearch-v6	All patches
> 
>                                5.6.0-rc1              5.6.0-rc1              5.6.0-rc1              5.6.0-rc1              5.6.0-rc1
>                              baseline-v3             loadavg-v3              lbidle-v3            classify-v6          stopsearch-v6
> Hmean     tput-1     43593.49 (   0.00%)    41616.85 (  -4.53%)    43657.25 (   0.15%)    38110.46 * -12.58%*    42213.29 (  -3.17%)
> Hmean     tput-2     95692.84 (   0.00%)    93196.89 *  -2.61%*    92287.78 *  -3.56%*    89077.29 (  -6.91%)    96474.49 *   0.82%*
> Hmean     tput-3    143813.12 (   0.00%)   134447.05 *  -6.51%*   134587.84 *  -6.41%*   133706.98 (  -7.03%)   144279.90 (   0.32%)
> Hmean     tput-4    190702.67 (   0.00%)   176533.79 *  -7.43%*   182278.42 *  -4.42%*   181405.19 (  -4.88%)   189948.10 (  -0.40%)
> Hmean     tput-5    230242.39 (   0.00%)   209059.51 *  -9.20%*   223219.06 (  -3.05%)   227188.16 (  -1.33%)   225220.39 (  -2.18%)
> Hmean     tput-6    274868.74 (   0.00%)   246470.42 * -10.33%*   258387.09 *  -6.00%*   264252.76 (  -3.86%)   271429.49 (  -1.25%)
> Hmean     tput-7    312281.15 (   0.00%)   284564.06 *  -8.88%*   296446.00 *  -5.07%*   302682.72 (  -3.07%)   309187.26 (  -0.99%)
> Hmean     tput-8    347261.31 (   0.00%)   332019.39 *  -4.39%*   331202.25 *  -4.62%*   339469.52 (  -2.24%)   345504.60 (  -0.51%)
> Hmean     tput-9    387336.25 (   0.00%)   352219.62 *  -9.07%*   370222.03 *  -4.42%*   367077.01 (  -5.23%)   381610.17 (  -1.48%)
> Hmean     tput-10   421586.76 (   0.00%)   397304.22 (  -5.76%)   405458.01 (  -3.83%)   416689.66 (  -1.16%)   415549.97 (  -1.43%)
> Hmean     tput-11   459422.43 (   0.00%)   398023.51 * -13.36%*   441999.08 (  -3.79%)   449912.39 (  -2.07%)   454458.04 (  -1.08%)
> Hmean     tput-12   499087.97 (   0.00%)   400914.35 * -19.67%*   475755.59 (  -4.68%)   493678.32 (  -1.08%)   493936.79 (  -1.03%)
> Hmean     tput-13   536335.59 (   0.00%)   406101.41 * -24.28%*   514858.97 (  -4.00%)   528496.01 (  -1.46%)   530662.68 (  -1.06%)
> Hmean     tput-14   571542.75 (   0.00%)   478797.13 * -16.23%*   551716.00 (  -3.47%)   553771.29 (  -3.11%)   565915.55 (  -0.98%)
> Hmean     tput-15   601412.81 (   0.00%)   534776.98 * -11.08%*   580105.28 (  -3.54%)   597513.89 (  -0.65%)   596192.34 (  -0.87%)
> Hmean     tput-16   629817.55 (   0.00%)   407294.29 * -35.33%*   615606.40 (  -2.26%)   630044.12 (   0.04%)   627806.13 (  -0.32%)
> Hmean     tput-17   667025.18 (   0.00%)   457416.34 * -31.42%*   626074.81 (  -6.14%)   659706.41 (  -1.10%)   658350.40 (  -1.30%)
> Hmean     tput-18   688148.21 (   0.00%)   518534.45 * -24.65%*   663161.87 (  -3.63%)   675616.08 (  -1.82%)   682224.35 (  -0.86%)
> Hmean     tput-19   705092.87 (   0.00%)   466530.37 * -33.83%*   689430.29 (  -2.22%)   691050.89 (  -1.99%)   705532.41 (   0.06%)
> Hmean     tput-20   711481.44 (   0.00%)   564355.80 * -20.68%*   692170.67 (  -2.71%)   717866.36 (   0.90%)   716243.50 (   0.67%)
> Hmean     tput-21   739790.92 (   0.00%)   508542.10 * -31.26%*   712348.91 (  -3.71%)   724666.68 (  -2.04%)   723361.87 (  -2.22%)
> Hmean     tput-22   730593.57 (   0.00%)   540881.37 ( -25.97%)   709794.02 (  -2.85%)   727177.54 (  -0.47%)   721353.36 (  -1.26%)
> Hmean     tput-23   738401.59 (   0.00%)   561474.46 * -23.96%*   702869.93 (  -4.81%)   720954.73 (  -2.36%)   720813.53 (  -2.38%)
> Hmean     tput-24   731301.95 (   0.00%)   582929.73 * -20.29%*   704337.59 (  -3.69%)   717204.03 *  -1.93%*   714131.38 *  -2.35%*
> Hmean     tput-25   734414.40 (   0.00%)   591635.13 ( -19.44%)   702334.30 (  -4.37%)   720272.39 (  -1.93%)   714245.12 (  -2.75%)
> Hmean     tput-26   724774.17 (   0.00%)   701310.59 (  -3.24%)   700771.85 (  -3.31%)   718084.92 (  -0.92%)   712988.02 (  -1.63%)
> Hmean     tput-27   713484.55 (   0.00%)   632795.43 ( -11.31%)   692213.36 (  -2.98%)   710432.96 (  -0.43%)   703087.86 (  -1.46%)
> Hmean     tput-28   723111.86 (   0.00%)   697438.61 (  -3.55%)   695934.68 (  -3.76%)   708413.26 (  -2.03%)   703449.60 (  -2.72%)
> Hmean     tput-29   714690.69 (   0.00%)   675820.16 (  -5.44%)   689400.90 (  -3.54%)   698436.85 (  -2.27%)   699981.24 (  -2.06%)
> Hmean     tput-30   711106.03 (   0.00%)   699748.68 (  -1.60%)   688439.96 (  -3.19%)   698258.70 (  -1.81%)   691636.96 (  -2.74%)
> Hmean     tput-31   701632.39 (   0.00%)   698807.56 (  -0.40%)   682588.20 (  -2.71%)   696608.99 (  -0.72%)   691015.36 (  -1.51%)
> Hmean     tput-32   703479.77 (   0.00%)   679020.34 (  -3.48%)   674057.11 *  -4.18%*   690706.86 (  -1.82%)   684958.62 (  -2.63%)
> Hmean     tput-33   691594.71 (   0.00%)   686583.04 (  -0.72%)   673382.64 (  -2.63%)   687319.97 (  -0.62%)   683367.65 (  -1.19%)
> Hmean     tput-34   693435.51 (   0.00%)   685137.16 (  -1.20%)   674883.97 (  -2.68%)   684897.97 (  -1.23%)   674923.39 (  -2.67%)
> Hmean     tput-35   688036.06 (   0.00%)   682612.92 (  -0.79%)   668159.93 (  -2.89%)   679301.53 (  -1.27%)   678117.69 (  -1.44%)
> Hmean     tput-36   678957.95 (   0.00%)   670160.33 (  -1.30%)   662395.36 (  -2.44%)   672165.17 (  -1.00%)   668512.57 (  -1.54%)
> Hmean     tput-37   679748.70 (   0.00%)   675428.41 (  -0.64%)   666970.33 (  -1.88%)   674127.70 (  -0.83%)   667644.78 (  -1.78%)
> Hmean     tput-38   669969.62 (   0.00%)   670976.06 (   0.15%)   660499.74 (  -1.41%)   670848.38 (   0.13%)   666646.89 (  -0.50%)
> Hmean     tput-39   669291.41 (   0.00%)   665367.66 (  -0.59%)   649337.71 (  -2.98%)   659685.61 (  -1.44%)   658818.08 (  -1.56%)
> Hmean     tput-40   668074.80 (   0.00%)   672478.06 (   0.66%)   661273.87 (  -1.02%)   665147.36 (  -0.44%)   660279.43 (  -1.17%)
> 
> Note the regression with the first two patches of Vincent's work
> (loadavg-v3) followed by lbidle-v3 which mostly restores the performance
> and the final version keeping things close to performance neutral (showing
> a mix but within noise). This is not universal as a different 2-socket
> machine with fewer cores and older CPUs showed no difference. EPYC 1 and
> EPYC 2 were both affected by the regression as well as a 4-socket Intel
> box but again, the full series is mostly performance neutral for specjbb
> but with less NUMA balancing work.
> 
> While not presented here, the full series also shows that the throughput
> measured by each JVM is less variable.
> 
> The high-level NUMA stats from /proc/vmstat look like this
> 
>                                       5.6.0-rc1      5.6.0-rc1      5.6.0-rc1      5.6.0-rc1      5.6.0-rc1
>                                     baseline-v3     loadavg-v3      lbidle-v3    classify-v3  stopsearch-v3
> Ops NUMA alloc hit                    878062.00      882981.00      957762.00      961630.00      880821.00
> Ops NUMA alloc miss                        0.00           0.00           0.00           0.00           0.00
> Ops NUMA interleave hit               225582.00      237785.00      242554.00      234671.00      234818.00
> Ops NUMA alloc local                  764932.00      763850.00      835939.00      843950.00      763005.00
> Ops NUMA base-page range updates     2517600.00     3707398.00     2889034.00     2442203.00     3303790.00
> Ops NUMA PTE updates                 1754720.00     1672198.00     1569610.00     1356763.00     1591662.00
> Ops NUMA PMD updates                    1490.00        3975.00        2577.00        2120.00        3344.00
> Ops NUMA hint faults                 1678620.00     1586860.00     1475303.00     1285152.00     1512208.00
> Ops NUMA hint local faults %         1461203.00     1389234.00     1181790.00     1085517.00     1411194.00
> Ops NUMA hint local percent               87.05          87.55          80.10          84.47          93.32
> Ops NUMA pages migrated                69473.00       62504.00      121893.00       80802.00       46266.00
> Ops AutoNUMA cost                       8412.04        7961.44        7399.05        6444.39        7585.05
> 
> Overall, the local hints percentage is slightly better but crucially,
> it's done with much less page migrations.
> 
> A separate run gathered information from ftrace and analysed it
> offline. This is based on an earlier version of the series but the changes
> are not significant enough to warrant a rerun as there are no changes in
> the NUMA balancing optimisations.
> 
>                                              5.6.0-rc1       5.6.0-rc1
>                                            baseline-v2   stopsearch-v2
> Ops Migrate failed no CPU                      1871.00          689.00
> Ops Migrate failed move to   idle                 0.00            0.00
> Ops Migrate failed swap task fail               872.00          568.00
> Ops Task Migrated swapped                      6702.00         3344.00
> Ops Task Migrated swapped local NID               0.00            0.00
> Ops Task Migrated swapped within group         1094.00          124.00
> Ops Task Migrated idle CPU                    14409.00        14610.00
> Ops Task Migrated idle CPU local NID              0.00            0.00
> Ops Task Migrate retry                         2355.00         1074.00
> Ops Task Migrate retry success                    0.00            0.00
> Ops Task Migrate retry failed                  2355.00         1074.00
> Ops Load Balance cross NUMA                 1248401.00      1261853.00
> 
> "Migrate failed no CPU" is the times when NUMA balancing did not
> find a suitable page on a preferred node. This is increased because
> the series avoids making decisions that the LB would override.
> 
> "Migrate failed swap task fail" is when migrate_swap fails and it
> can fail for a lot of reasons.
> 
> "Task Migrated swapped" is lower which would would be a concern but in
> this test, locality was higher unlike the test with tracing disabled.
> This event triggers when two tasks are swapped to keep load neutral or
> improved from the perspective of the load balancer. The series attempts
> to swap tasks that both move to their preferred node.
> 
> "Task Migrated idle CPU" is similar and while the the series does try to
> avoid NUMA Balancer and LB fighting each other, it also continues to
> obey overall CPU load balancer.
> 
> "Task Migrate retry failed" happens when NUMA balancing makes multiple
> attempts to place a task on a preferred node. It is slightly reduced here
> but it would generally be expected to happen to maintain CPU load balance.
> 
> A variety of other workloads were evaluated and appear to be mostly
> neutral or improved. netperf running on localhost shows gains between 1-8%
> depending on the machine. hackbench is a mixed bag -- small regressions
> on one machine around 1-2% depending on the group count but up to 15%
> gain on another machine. dbench looks generally ok, very small performance
> gains. pgbench looks ok, small gains and losses, much of which is within
> the noise. schbench (Facebook workload that is sensitive to wakeup
> latencies) is mostly good.  The autonuma benchmark also generally looks
> good, most differences are within the noise but with higher locality
> success rates and fewer page migrations. Other long lived workloads are
> still running but I'm not expecting many surprises.
> 
>  include/linux/sched.h        |  31 ++-
>  include/trace/events/sched.h |  49 ++--
>  kernel/sched/core.c          |  13 -
>  kernel/sched/debug.c         |  17 +-
>  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
>  kernel/sched/pelt.c          |  59 ++--
>  kernel/sched/sched.h         |  42 ++-
>  7 files changed, 535 insertions(+), 302 deletions(-)
> 
> -- 
> 2.16.4
> 

Our Perf team was been comparing tip/sched/core (v5.6-rc3 + lb_numa series) 
with upstream v5.6-rc3 and has noticed some regressions. 

Here's a summary of what Jirka Hladky reported to me:

---
We see following problems when comparing 5.6.0_rc3.tip_lb_numa+-5 against
5.6.0-0.rc3.1.elrdy:

  • performance drop by 20% - 40% across almost all benchmarks especially 
    for low and medium thread counts and especially on 4 NUMA and 8 NUMA nodes
    servers
  • 2 NUMA nodes servers are affected as well, but performance drop is less
    significant (10-20%)
  • servers with just one NUMA node are NOT affected
  • we see big load imbalances between different NUMA nodes
---

The actual data reports are on an intranet web page so they are harder to 
share. I can create PDFs or screenshots but I didn't want to just blast 
those to the list. I'd be happy to send some direclty if you are interested. 

Some data in text format I can easily include shows imbalances across the
numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
pull and see the data in text. The regressions can be seen in other tests
as well.

For example:

5.6.0_rc3.tip_lb_numa+
sp.C.x_008_02  - CPU load average across the individual NUMA nodes 
(timestep is 5 seconds)
# NUMA | AVR | Utilization over time in percentage
  0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
  1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
  2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
  3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
  4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
  5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
  6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
  7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0

5.6.0-0.rc3.1.elrdy
sp.C.x_008_01  - CPU load average across the individual NUMA nodes 
(timestep is 5 seconds)
# NUMA | AVR | Utilization over time in percentage
  0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
  1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
  2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
  3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
  4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
  5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
  6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
  7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5


mops/s sp_C_x
kernel      		threads	8 	16 	32 	48 	56 	64
5.6.0-0.rc3.1.elrdy 	mean 	22819.8 39955.3 34301.6 31086.8 30316.2 30089.2
			max 	23911.4 47185.6 37995.9 33742.6 33048.0 30853.4
			min 	20799.3 36518.0 31459.4 27559.9 28565.7 29306.3
			stdev 	1096.7 	3965.3 	2350.2 	2134.7 	1871.1 	612.1
			first_q 22780.4 37263.7 32081.7 29955.8 28609.8 29632.0
			median 	22936.7 37577.6 34866.0 32020.8 29299.9 29906.3
			third_q 23671.0 41231.7 35105.1 32154.7 32057.8 30748.0
5.6.0_rc3.tip_lb_numa 	mean 	12111.1 24712.6 30147.8 32560.7 31040.4 28219.4
			max 	17772.9 28934.4 33828.3 33659.3 32275.3 30434.9
			min 	9869.9 	18407.9 25092.7 31512.9 29964.3 25652.8
			stdev 	2868.4 	3673.6 	2877.6 	832.2 	765.8 	1800.6
			first_q 10763.4 23346.1 29608.5 31827.2 30609.4 27008.8
			median 	10855.0 25415.4 30804.4 32462.1 31061.8 27992.6
			third_q 11294.5 27459.2 31405.0 33341.8 31291.2 30007.9
Comparison 		mean 	-47 	-38 	-12 	5 	2 	-6
			median 	-53 	-32 	-12 	1 	6 	-6


On 5.6.0-rc3.tip-lb_numa+ we see:

  • BIG fluctuation in runtime
  • NAS running up to 2x slower than on 5.6.0-0.rc3.1.elrdy

$ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 125.73
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 87.54
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 86.93
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 165.98
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 114.78

On the other hand, runtime on 5.6.0-0.rc3.1.elrdy is stable:
$ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 59.83
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 67.72
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 63.62
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 55.01
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 65.20

It looks like things are moving around a lot but not getting balanced
as well across the numa nodes. I have a couple of nice heat maps that 
show this if you want to see them. 


Thanks,
Phil

-- 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-09 19:12 ` Phil Auld
@ 2020-03-09 20:36   ` Mel Gorman
  2020-03-12  9:54     ` Mel Gorman
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-03-09 20:36 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Valentin Schneider,
	Hillf Danton, LKML

On Mon, Mar 09, 2020 at 03:12:34PM -0400, Phil Auld wrote:
> > A variety of other workloads were evaluated and appear to be mostly
> > neutral or improved. netperf running on localhost shows gains between 1-8%
> > depending on the machine. hackbench is a mixed bag -- small regressions
> > on one machine around 1-2% depending on the group count but up to 15%
> > gain on another machine. dbench looks generally ok, very small performance
> > gains. pgbench looks ok, small gains and losses, much of which is within
> > the noise. schbench (Facebook workload that is sensitive to wakeup
> > latencies) is mostly good.  The autonuma benchmark also generally looks
> > good, most differences are within the noise but with higher locality
> > success rates and fewer page migrations. Other long lived workloads are
> > still running but I'm not expecting many surprises.
> > 
> >  include/linux/sched.h        |  31 ++-
> >  include/trace/events/sched.h |  49 ++--
> >  kernel/sched/core.c          |  13 -
> >  kernel/sched/debug.c         |  17 +-
> >  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
> >  kernel/sched/pelt.c          |  59 ++--
> >  kernel/sched/sched.h         |  42 ++-
> >  7 files changed, 535 insertions(+), 302 deletions(-)
> > 
> > -- 
> > 2.16.4
> > 
> 
> Our Perf team was been comparing tip/sched/core (v5.6-rc3 + lb_numa series) 
> with upstream v5.6-rc3 and has noticed some regressions. 
> 
> Here's a summary of what Jirka Hladky reported to me:
> 
> ---
> We see following problems when comparing 5.6.0_rc3.tip_lb_numa+-5 against
> 5.6.0-0.rc3.1.elrdy:
> 
>   ??? performance drop by 20% - 40% across almost all benchmarks especially 
>     for low and medium thread counts and especially on 4 NUMA and 8 NUMA nodes
>     servers
>   ??? 2 NUMA nodes servers are affected as well, but performance drop is less
>     significant (10-20%)
>   ??? servers with just one NUMA node are NOT affected
>   ??? we see big load imbalances between different NUMA nodes
> ---
> 

UMA being unaffected is not a surprise, the rest is obviously.

> The actual data reports are on an intranet web page so they are harder to 
> share. I can create PDFs or screenshots but I didn't want to just blast 
> those to the list. I'd be happy to send some direclty if you are interested. 
> 

Send them to me privately please.

> Some data in text format I can easily include shows imbalances across the
> numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> pull and see the data in text. The regressions can be seen in other tests
> as well.
> 

What was the value for x?

I ask because I ran NAS across a variety of machines for C class in two
configurations -- one using as many CPUs as possible and one running
with a third of the available CPUs for both MPI and OMP. Generally there
were small gains and losses across multiple kernels but often within the
noise or within a few percent of each other.

The largest machine I had available was 4 sockets.

The other curiousity is that you used C class. On bigger machines, that
is very short lived to the point of being almost useless. Is D class
similarly affected?

> For example:
> 
> 5.6.0_rc3.tip_lb_numa+
> sp.C.x_008_02  - CPU load average across the individual NUMA nodes 
> (timestep is 5 seconds)
> # NUMA | AVR | Utilization over time in percentage
>   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
>   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
>   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
>   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
>   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
>   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
>   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
>   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> 
> 5.6.0-0.rc3.1.elrdy
> sp.C.x_008_01  - CPU load average across the individual NUMA nodes 
> (timestep is 5 seconds)
> # NUMA | AVR | Utilization over time in percentage
>   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
>   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
>   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
>   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
>   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
>   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
>   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
>   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> 

A critical difference with the series is that large imbalances shouldn't
happen but prior to the series the NUMA balancing would keep trying to
move tasks to a node with load balancing moving them back. That should
not happen any more but there are cases where it's actually faster to
have the fight between NUMA balancing and load balancing. Ideally a
degree of imbalance would be allowed but I haven't found a way of doing
that without side effects.

Generally the utilisatiojn looks low in either kernel making me think
the value for x is relatively small.

> 
> mops/s sp_C_x
> kernel      		threads	8 	16 	32 	48 	56 	64
> 5.6.0-0.rc3.1.elrdy 	mean 	22819.8 39955.3 34301.6 31086.8 30316.2 30089.2
> 			max 	23911.4 47185.6 37995.9 33742.6 33048.0 30853.4
> 			min 	20799.3 36518.0 31459.4 27559.9 28565.7 29306.3
> 			stdev 	1096.7 	3965.3 	2350.2 	2134.7 	1871.1 	612.1
> 			first_q 22780.4 37263.7 32081.7 29955.8 28609.8 29632.0
> 			median 	22936.7 37577.6 34866.0 32020.8 29299.9 29906.3
> 			third_q 23671.0 41231.7 35105.1 32154.7 32057.8 30748.0
> 5.6.0_rc3.tip_lb_numa 	mean 	12111.1 24712.6 30147.8 32560.7 31040.4 28219.4
> 			max 	17772.9 28934.4 33828.3 33659.3 32275.3 30434.9
> 			min 	9869.9 	18407.9 25092.7 31512.9 29964.3 25652.8
> 			stdev 	2868.4 	3673.6 	2877.6 	832.2 	765.8 	1800.6
> 			first_q 10763.4 23346.1 29608.5 31827.2 30609.4 27008.8
> 			median 	10855.0 25415.4 30804.4 32462.1 31061.8 27992.6
> 			third_q 11294.5 27459.2 31405.0 33341.8 31291.2 30007.9
> Comparison 		mean 	-47 	-38 	-12 	5 	2 	-6
> 			median 	-53 	-32 	-12 	1 	6 	-6
> 

So I *think* this is observing the difference in imbalance. The range
for 8 threads is massive but it stabilises when the thread count is
higher. When fewer threads than a NUMA side is used, it can be beneficial
to run everything one one node but the load balancer doesn't let that
happen and the NUMA balancer no longer fights it.

> On 5.6.0-rc3.tip-lb_numa+ we see:
> 
>   ??? BIG fluctuation in runtime
>   ??? NAS running up to 2x slower than on 5.6.0-0.rc3.1.elrdy
> 
> $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 125.73
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 87.54
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 86.93
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 165.98
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 114.78
> 
> On the other hand, runtime on 5.6.0-0.rc3.1.elrdy is stable:
> $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 59.83
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 67.72
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 63.62
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 55.01
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 65.20
> 
> It looks like things are moving around a lot but not getting balanced
> as well across the numa nodes. I have a couple of nice heat maps that 
> show this if you want to see them. 
> 

I'd like to see the heatmap. I just looked at the ones I had for NAS and
I'm not seeing a bad pattern with either all CPUs used or a third. What
I'm looking for is a pattern showing higher utilisation on one node over
another in the baseline kernel and showing relatively even utilisation
in tip/sched/core.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-09 20:36   ` Mel Gorman
@ 2020-03-12  9:54     ` Mel Gorman
  2020-03-12 12:17       ` Jirka Hladky
       [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-03-12  9:54 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Valentin Schneider,
	Hillf Danton, Jirka Hladky, LKML

On Mon, Mar 09, 2020 at 08:36:25PM +0000, Mel Gorman wrote:
> > The actual data reports are on an intranet web page so they are harder to 
> > share. I can create PDFs or screenshots but I didn't want to just blast 
> > those to the list. I'd be happy to send some direclty if you are interested. 
> > 
> 
> Send them to me privately please.
> 
> > Some data in text format I can easily include shows imbalances across the
> > numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> > pull and see the data in text. The regressions can be seen in other tests
> > as well.
> > 
> 
> What was the value for x?
> 
> I ask because I ran NAS across a variety of machines for C class in two
> configurations -- one using as many CPUs as possible and one running
> with a third of the available CPUs for both MPI and OMP. Generally there
> were small gains and losses across multiple kernels but often within the
> noise or within a few percent of each other.
> 

On re-examining the case, this pattern matches. There are some corner cases
for large machines that have low utilisation that are obvious. With the
old behaviour, load balancing would even load evenly all available NUMA
nodes while NUMA balancing would constantly adjust it for locality. The
old load balancer does this even if a task starts with all of its memory
local to one node.

The degree where it causes the most problems appears to be roughly for
task counts lower than 2 * NR_NODES as per the small imbalance allowed by
adjust_numa_imbalance but the actual distribution is variable. It's not
always 2 per node, sometimes it can be a little higher depending on when
idle balancing happens and other machine activity. This is not universal
as other machine sizes and workloads are fine with the new behaviour and
generally benefit.

The problem is particularly visible when the only active tasks in the
system have set numa_preferred_nid because as far as the load balancer and
NUMA balancer is concerned, there is no reason to force the SP workload
to spread wide.

> The largest machine I had available was 4 sockets.
> 
> The other curiousity is that you used C class. On bigger machines, that
> is very short lived to the point of being almost useless. Is D class
> similarly affected?
> 

I expect D class to be similarly affected because the same pattern holds
-- tasks say on CPUs local to their memory even though more memory
bandwidth may be available on remote nodes.

> > 5.6.0_rc3.tip_lb_numa+
> > sp.C.x_008_02  - CPU load average across the individual NUMA nodes 
> > (timestep is 5 seconds)
> > # NUMA | AVR | Utilization over time in percentage
> >   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
> >   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
> >   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
> >   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
> >   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
> >   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
> >   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
> >   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> > 
> > 5.6.0-0.rc3.1.elrdy
> > sp.C.x_008_01  - CPU load average across the individual NUMA nodes 
> > (timestep is 5 seconds)
> > # NUMA | AVR | Utilization over time in percentage
> >   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
> >   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
> >   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
> >   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
> >   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
> >   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
> >   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
> >   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> > 
> 
> A critical difference with the series is that large imbalances shouldn't
> happen but prior to the series the NUMA balancing would keep trying to
> move tasks to a node with load balancing moving them back. That should
> not happen any more but there are cases where it's actually faster to
> have the fight between NUMA balancing and load balancing. Ideally a
> degree of imbalance would be allowed but I haven't found a way of doing
> that without side effects.
> 

So this is what's happening -- at low utilisation, tasks are staying local
to their memory. For a lot of cases, this is a good thing -- communicating
tasks stay local for example and tasks that are not completely memory
bound benefit. Machines that have sufficient local memory bandwidth also
appear to benefit.

sp.C appears to be a significant corner case when the degree of
parallelisation is lower than the number of NUMA nodes in the system
and of the NAS workloads, bt is also mildly affected.  In each cases,
memory was almost completely local and there was low NUMA activity but
performance suffered. This is the BT case;

                            5.6.0-rc3              5.6.0-rc3
                              vanilla     schedcore-20200227
Min       bt.C      176.05 (   0.00%)      185.03 (  -5.10%)
Amean     bt.C      178.62 (   0.00%)      185.54 *  -3.88%*
Stddev    bt.C        4.26 (   0.00%)        0.60 (  85.95%)
CoeffVar  bt.C        2.38 (   0.00%)        0.32 (  86.47%)
Max       bt.C      186.09 (   0.00%)      186.48 (  -0.21%)
BAmean-50 bt.C      176.18 (   0.00%)      185.08 (  -5.06%)
BAmean-95 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
BAmean-99 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)

Note the spread in performance. tip/sched/core looks worse than average but
its coefficient of variance was just 0.32% versus 2.38% with the vanilla
kernel. The vanilla kernel is a lot less stable in terms of performance
due to the fighting between CPU Load and NUMA Balancing.

A heatmap of the CPU usage per LLC showed 4 tasks running on 2 nodes
with two nodes idle -- there was almost no other system activity that
would allow the load balancer to balance on tasks that are unconcerned
with locality. The vanilla case was interesting -- of the 5 iterations,
4 spread with one task on 4 nodes but one iteration stacked 4 tasks on
2 nodes so it's not even consistent.  The NUMA activity looked like this
for the overall workload.

Ops NUMA alloc hit                   3450166.00     2406738.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                 1047975.00       41131.00
Ops NUMA base-page range updates    15864254.00    16283456.00
Ops NUMA PTE updates                15148478.00    15563584.00
Ops NUMA PMD updates                    1398.00        1406.00
Ops NUMA hint faults                15128332.00    15535357.00
Ops NUMA hint local faults %        12253847.00    14471269.00
Ops NUMA hint local percent               81.00          93.15
Ops NUMA pages migrated               993033.00           4.00
Ops AutoNUMA cost                      75771.58       77790.77

PTE hinting was more or less the same but look at the locality. 81%
local for the baseline vanilla kernel and 93.15% for what's in
tip/sched/core. The baseline kernel migrates almost 1 million pages over
15 minutes (5 iterations) and tip/sched/core migrates ... 4 pages.

Looking at the faults over time, the baseline kernel initially faults
with pages local, drops to 80% shortly after starting and then starts
climbing back up again as pages get migrated. Initially the number of
hints the baseline kernel traps is extremely high and drops as pages
migrate

Most others were almost neutral with the impact of the series more
obvious in some than others. is.C is really short-lived for example but
locality of faults went from 43% to 95% local for example.

sp.C was by far the most obvious impact

                            5.6.0-rc3              5.6.0-rc3
                              vanilla     schedcore-20200227
Min       sp.C      141.52 (   0.00%)      173.61 ( -22.68%)
Amean     sp.C      141.87 (   0.00%)      174.00 * -22.65%*
Stddev    sp.C        0.26 (   0.00%)        0.25 (   5.06%)
CoeffVar  sp.C        0.18 (   0.00%)        0.14 (  22.59%)
Max       sp.C      142.10 (   0.00%)      174.25 ( -22.62%)
BAmean-50 sp.C      141.59 (   0.00%)      173.79 ( -22.74%)
BAmean-95 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
BAmean-99 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)

That's a big hit in terms of performance and it looks less
variable. Looking at the NUMA stats

Ops NUMA alloc hit                   3100836.00     2161667.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                  915700.00       98531.00
Ops NUMA base-page range updates    12178032.00    13483382.00
Ops NUMA PTE updates                11809904.00    12792182.00
Ops NUMA PMD updates                     719.00        1350.00
Ops NUMA hint faults                11791912.00    12782512.00
Ops NUMA hint local faults %         9345987.00    11467427.00
Ops NUMA hint local percent               79.26          89.71
Ops NUMA pages migrated               871805.00       21505.00
Ops AutoNUMA cost                      59061.37       64007.35

Note the locality -- 79.26% to 89.71% but the vanilla kernel migrated 871K
pages and the new kernel migrates 21K. Looking at migrations over time,
I can see that the vanilla kernel migrates 180K pages in the first 10
seconds of each iteration while tip/sched/core migrated few enough that
it's not even clear on the graph. The workload was long-lived enough that
the initial disruption was less visible when running for long enough.

The problem is that there is nothing unique that the kernel measures that
I can think of that uniquely identifies that SP should spread wide and
migrate early to move its shared pages from other processes that are less
memory bound or communicating heavily. The state is simply not maintained
and it cannot be inferred from the runqueue or task state. From both a
locality point of view and available CPUs, leaving SP alone makes sense
but we do not detect that memory bandwidth is an issue. In other cases, the
cost of migrations alone would damage performance and SP is an exception as
it's long-lived enough to benefit once the first few seconds have passed.

I experimented with a few different approaches but without being able to
detect the bandwidth, it was a case that SP can be improved but almost
everything else suffers. For example, SP on 2-socket degrades when spread
too quickly on machines with enough memory bandwidth so with tip/sched/core
SP either benefits or suffers depending on the machine. Basic communicating
tasks degrade 4-8% depending on the machine and exact workload when moving
back to the vanilla kernel and that is fairly universal AFAIS.

So I think that the new behaviour generally is more sane -- do not
excessively fight between memory and CPU balancing but if there are
suggestions on how to distinguish between tasks that should spread wide
and evenly regardless of initial memory locality then I'm all ears.
I do not think migrating like crazy hoping it happens to work out and
having CPU Load and NUMA Balancing using very different criteria for
evaluation is a better approach.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-12  9:54     ` Mel Gorman
@ 2020-03-12 12:17       ` Jirka Hladky
       [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-12 12:17 UTC (permalink / raw)
  To: LKML

Hi Mel,

thanks a lot for analyzing it!

My big concern is that the performance drop for low threads counts
(roughly up to 2x number of NUMA nodes) is not just a rare corner
case, but it might be more common. We see the drop for the following
benchmarks/tests, especially on 8 NUMA nodes servers. However, four
and even 2 NUMA node servers are affected as well.

Numbers show average performance drop (median of runtime collected
from 5 subsequential runs) compared to vanilla kernel.

2x AMD 7351 (EPYC Naples), 8 NUMA nodes
===================================
NAS: sp_C test: -50%, peak perf. drop with 8 threads
NAS: mg_D: -10% with 16 threads
SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8 threads)
SPECjbb2005: -10% for 16 threads

4x INTEL Xeon GOLD-6126 with Sub-NUMA clustering enabled, 8 NUMA nodes
=============================================================
NAS: sp_C test: -35%, peak perf. drop with 16 threads
SPECjvm2008: co_sunflow, compress and cr_signverify tests: -10%(peak
drop with 8 threads)
SPECjbb2005: -10% for 24 threads

So far, I have run only a limited number of our tests. I can run our
full testing suite next week when required. Please let me know.

Thanks!
Jirka


On Thu, Mar 12, 2020 at 10:54 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Mar 09, 2020 at 08:36:25PM +0000, Mel Gorman wrote:
> > > The actual data reports are on an intranet web page so they are harder to
> > > share. I can create PDFs or screenshots but I didn't want to just blast
> > > those to the list. I'd be happy to send some direclty if you are interested.
> > >
> >
> > Send them to me privately please.
> >
> > > Some data in text format I can easily include shows imbalances across the
> > > numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> > > pull and see the data in text. The regressions can be seen in other tests
> > > as well.
> > >
> >
> > What was the value for x?
> >
> > I ask because I ran NAS across a variety of machines for C class in two
> > configurations -- one using as many CPUs as possible and one running
> > with a third of the available CPUs for both MPI and OMP. Generally there
> > were small gains and losses across multiple kernels but often within the
> > noise or within a few percent of each other.
> >
>
> On re-examining the case, this pattern matches. There are some corner cases
> for large machines that have low utilisation that are obvious. With the
> old behaviour, load balancing would even load evenly all available NUMA
> nodes while NUMA balancing would constantly adjust it for locality. The
> old load balancer does this even if a task starts with all of its memory
> local to one node.
>
> The degree where it causes the most problems appears to be roughly for
> task counts lower than 2 * NR_NODES as per the small imbalance allowed by
> adjust_numa_imbalance but the actual distribution is variable. It's not
> always 2 per node, sometimes it can be a little higher depending on when
> idle balancing happens and other machine activity. This is not universal
> as other machine sizes and workloads are fine with the new behaviour and
> generally benefit.
>
> The problem is particularly visible when the only active tasks in the
> system have set numa_preferred_nid because as far as the load balancer and
> NUMA balancer is concerned, there is no reason to force the SP workload
> to spread wide.
>
> > The largest machine I had available was 4 sockets.
> >
> > The other curiousity is that you used C class. On bigger machines, that
> > is very short lived to the point of being almost useless. Is D class
> > similarly affected?
> >
>
> I expect D class to be similarly affected because the same pattern holds
> -- tasks say on CPUs local to their memory even though more memory
> bandwidth may be available on remote nodes.
>
> > > 5.6.0_rc3.tip_lb_numa+
> > > sp.C.x_008_02  - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > >   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
> > >   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
> > >   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
> > >   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
> > >   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
> > >   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
> > >   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
> > >   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> > >
> > > 5.6.0-0.rc3.1.elrdy
> > > sp.C.x_008_01  - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > >   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
> > >   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
> > >   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
> > >   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
> > >   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
> > >   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
> > >   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
> > >   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> > >
> >
> > A critical difference with the series is that large imbalances shouldn't
> > happen but prior to the series the NUMA balancing would keep trying to
> > move tasks to a node with load balancing moving them back. That should
> > not happen any more but there are cases where it's actually faster to
> > have the fight between NUMA balancing and load balancing. Ideally a
> > degree of imbalance would be allowed but I haven't found a way of doing
> > that without side effects.
> >
>
> So this is what's happening -- at low utilisation, tasks are staying local
> to their memory. For a lot of cases, this is a good thing -- communicating
> tasks stay local for example and tasks that are not completely memory
> bound benefit. Machines that have sufficient local memory bandwidth also
> appear to benefit.
>
> sp.C appears to be a significant corner case when the degree of
> parallelisation is lower than the number of NUMA nodes in the system
> and of the NAS workloads, bt is also mildly affected.  In each cases,
> memory was almost completely local and there was low NUMA activity but
> performance suffered. This is the BT case;
>
>                             5.6.0-rc3              5.6.0-rc3
>                               vanilla     schedcore-20200227
> Min       bt.C      176.05 (   0.00%)      185.03 (  -5.10%)
> Amean     bt.C      178.62 (   0.00%)      185.54 *  -3.88%*
> Stddev    bt.C        4.26 (   0.00%)        0.60 (  85.95%)
> CoeffVar  bt.C        2.38 (   0.00%)        0.32 (  86.47%)
> Max       bt.C      186.09 (   0.00%)      186.48 (  -0.21%)
> BAmean-50 bt.C      176.18 (   0.00%)      185.08 (  -5.06%)
> BAmean-95 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
> BAmean-99 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
>
> Note the spread in performance. tip/sched/core looks worse than average but
> its coefficient of variance was just 0.32% versus 2.38% with the vanilla
> kernel. The vanilla kernel is a lot less stable in terms of performance
> due to the fighting between CPU Load and NUMA Balancing.
>
> A heatmap of the CPU usage per LLC showed 4 tasks running on 2 nodes
> with two nodes idle -- there was almost no other system activity that
> would allow the load balancer to balance on tasks that are unconcerned
> with locality. The vanilla case was interesting -- of the 5 iterations,
> 4 spread with one task on 4 nodes but one iteration stacked 4 tasks on
> 2 nodes so it's not even consistent.  The NUMA activity looked like this
> for the overall workload.
>
> Ops NUMA alloc hit                   3450166.00     2406738.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                 1047975.00       41131.00
> Ops NUMA base-page range updates    15864254.00    16283456.00
> Ops NUMA PTE updates                15148478.00    15563584.00
> Ops NUMA PMD updates                    1398.00        1406.00
> Ops NUMA hint faults                15128332.00    15535357.00
> Ops NUMA hint local faults %        12253847.00    14471269.00
> Ops NUMA hint local percent               81.00          93.15
> Ops NUMA pages migrated               993033.00           4.00
> Ops AutoNUMA cost                      75771.58       77790.77
>
> PTE hinting was more or less the same but look at the locality. 81%
> local for the baseline vanilla kernel and 93.15% for what's in
> tip/sched/core. The baseline kernel migrates almost 1 million pages over
> 15 minutes (5 iterations) and tip/sched/core migrates ... 4 pages.
>
> Looking at the faults over time, the baseline kernel initially faults
> with pages local, drops to 80% shortly after starting and then starts
> climbing back up again as pages get migrated. Initially the number of
> hints the baseline kernel traps is extremely high and drops as pages
> migrate
>
> Most others were almost neutral with the impact of the series more
> obvious in some than others. is.C is really short-lived for example but
> locality of faults went from 43% to 95% local for example.
>
> sp.C was by far the most obvious impact
>
>                             5.6.0-rc3              5.6.0-rc3
>                               vanilla     schedcore-20200227
> Min       sp.C      141.52 (   0.00%)      173.61 ( -22.68%)
> Amean     sp.C      141.87 (   0.00%)      174.00 * -22.65%*
> Stddev    sp.C        0.26 (   0.00%)        0.25 (   5.06%)
> CoeffVar  sp.C        0.18 (   0.00%)        0.14 (  22.59%)
> Max       sp.C      142.10 (   0.00%)      174.25 ( -22.62%)
> BAmean-50 sp.C      141.59 (   0.00%)      173.79 ( -22.74%)
> BAmean-95 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
> BAmean-99 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
>
> That's a big hit in terms of performance and it looks less
> variable. Looking at the NUMA stats
>
> Ops NUMA alloc hit                   3100836.00     2161667.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                  915700.00       98531.00
> Ops NUMA base-page range updates    12178032.00    13483382.00
> Ops NUMA PTE updates                11809904.00    12792182.00
> Ops NUMA PMD updates                     719.00        1350.00
> Ops NUMA hint faults                11791912.00    12782512.00
> Ops NUMA hint local faults %         9345987.00    11467427.00
> Ops NUMA hint local percent               79.26          89.71
> Ops NUMA pages migrated               871805.00       21505.00
> Ops AutoNUMA cost                      59061.37       64007.35
>
> Note the locality -- 79.26% to 89.71% but the vanilla kernel migrated 871K
> pages and the new kernel migrates 21K. Looking at migrations over time,
> I can see that the vanilla kernel migrates 180K pages in the first 10
> seconds of each iteration while tip/sched/core migrated few enough that
> it's not even clear on the graph. The workload was long-lived enough that
> the initial disruption was less visible when running for long enough.
>
> The problem is that there is nothing unique that the kernel measures that
> I can think of that uniquely identifies that SP should spread wide and
> migrate early to move its shared pages from other processes that are less
> memory bound or communicating heavily. The state is simply not maintained
> and it cannot be inferred from the runqueue or task state. From both a
> locality point of view and available CPUs, leaving SP alone makes sense
> but we do not detect that memory bandwidth is an issue. In other cases, the
> cost of migrations alone would damage performance and SP is an exception as
> it's long-lived enough to benefit once the first few seconds have passed.
>
> I experimented with a few different approaches but without being able to
> detect the bandwidth, it was a case that SP can be improved but almost
> everything else suffers. For example, SP on 2-socket degrades when spread
> too quickly on machines with enough memory bandwidth so with tip/sched/core
> SP either benefits or suffers depending on the machine. Basic communicating
> tasks degrade 4-8% depending on the machine and exact workload when moving
> back to the vanilla kernel and that is fairly universal AFAIS.
>
> So I think that the new behaviour generally is more sane -- do not
> excessively fight between memory and CPU balancing but if there are
> suggestions on how to distinguish between tasks that should spread wide
> and evenly regardless of initial memory locality then I'm all ears.
> I do not think migrating like crazy hoping it happens to work out and
> having CPU Load and NUMA Balancing using very different criteria for
> evaluation is a better approach.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
@ 2020-03-12 15:56         ` Mel Gorman
  2020-03-12 17:06           ` Jirka Hladky
       [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-03-12 15:56 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

On Thu, Mar 12, 2020 at 01:10:36PM +0100, Jirka Hladky wrote:
> Hi Mel,
> 
> thanks a lot for analyzing it!
> 
> My big concern is that the performance drop for low threads counts (roughly
> up to 2x number of NUMA nodes) is not just a rare corner case, but it might
> be more common.

That is hard to tell. In my case I was seeing the problem running a HPC
workload, dedicated to that machine and using only 6% of available CPUs. I
find it unlikely that is common because who acquires such a large machine
and then uses a tiny percentage of it. Rember also the other points I made
such as 1M migrations happening in the first few seconds just trying to
deal with the load balancer and NUMA balancer fighting each other. While
that might happen to be good for SP, it's not universally good behaviour
and I've dealt with issues in the past whereby the NUMA balancer would
get confused and just ramp up the frequency it samples and migrates trying
to override the load balancer.

> We see the drop for the following benchmarks/tests,
> especially on 8 NUMA nodes servers. However, four and even 2 NUMA node
> servers are affected as well.
> 
> Numbers show average performance drop (median of runtime collected from 5
> subsequential runs) compared to vanilla kernel.
> 
> 2x AMD 7351 (EPYC Naples), 8 NUMA nodes
> ===================================
> NAS: sp_C test: -50%, peak perf. drop with 8 threads

I hadn't tested 8 threads specifically I think that works out as
using 12.5% of the available machine. The allowed imbalance between
nodes means that some SP instances will stack on the same node but not
the same CPU.

> NAS: mg_D: -10% with 16 threads

While I do not have the most up to date figures available, I found the
opposite trend at 21 threads (the test case I used were 1/3rd available
CPUs and as close to maximum CPUs as possible). There I found it was 10%
faster for an 8 node machine.

For 4 nodes, using a single JVM was performance neutral *on average* but
much less variable. With one JVM per node, there was a mix of small <2%
regressions for some thread counts and up to 9% faster on others.

> SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
> SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8
> threads)

I didn't run SPECjvm for multiple thread sizes so I don't have data yet
and may not have for another day at least.

> SPECjbb2005: -10% for 16 threads
> 

I found this to depend in the number of JVMs used and the thread count.
Slower at low thread counts, faster at higher thread counts but with
more stable results with the series applied and less NUMA balancer
activity.

This is somewhat of a dilemma. Without the series, the load balancer and
NUMA balancer use very different criteria on what should happen and
results are not stable. In some cases acting randomly happens to work
out and in others it does not and overall it depends on the workload and
machine. It's very highly dependent on both the workload and the machine
and it's a question if we want to continue dealing with two parts of the
scheduler disagreeing on what criteria to use or try to improve the
reconciled load and memory balancer sharing similar logic.

In *general*, I found that the series won a lot more than it lost across
a spread of workloads and machines but unfortunately it's also an area
where counter-examples can be found.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-12 15:56         ` Mel Gorman
@ 2020-03-12 17:06           ` Jirka Hladky
       [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-12 17:06 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

> find it unlikely that is common because who acquires such a large machine
> and then uses a tiny percentage of it.

I generally agree, but I also want to make a point that AMD made these
large systems much more affordable with their EPYC CPUs. The 8 NUMA
node server we are using costs under $8k.

>
> This is somewhat of a dilemma. Without the series, the load balancer and
> NUMA balancer use very different criteria on what should happen and
> results are not stable.

Unfortunately, I see instabilities also for the series. This is again
for the sp_C test with 8 threads executed on dual-socket AMD 7351
(EPYC Naples) server with 8 NUMA nodes. With the series applied, the
runtime varies from 86 to 165 seconds! Could we do something about it?
The runtime of 86 seconds would be acceptable. If we could stabilize
this case and get consistent runtime around 80 seconds, the problem
would be gone.

Do you experience the similar instability of results on your HW for
sp_C with low thread counts?

Runtime with this series applied:
 $ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
      125.73
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
       87.54
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
       86.93
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
      165.98
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
      114.78

For comparison, here are vanilla kernel results:
$ grep "Time in seconds" *log
sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
       59.83
sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
       67.72
sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
       63.62
sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
       55.01
sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
       65.20

> In *general*, I found that the series won a lot more than it lost across
> a spread of workloads and machines but unfortunately it's also an area
> where counter-examples can be found.

OK, fair enough. I understand that there will always be trade-offs
when making changes to scheduler like this. And I agree that cases
with higher system load (where is series is helpful) outweigh the
performance drops for low threads counts. I was hoping that it would
be possible to improve the small threads results while keeping the
gains for other scenarios:-)  But let's be realistic - I would be
happy to fix the extreme case mentioned above. The other issues where
performance drop is about 20% are OK with me and are outweighed by the
gains for different scenarios.

Thanks again for looking into it. I know that covering all cases is
hard. I very much appreciate what you do!

Jirka


On Thu, Mar 12, 2020 at 4:56 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Mar 12, 2020 at 01:10:36PM +0100, Jirka Hladky wrote:
> > Hi Mel,
> >
> > thanks a lot for analyzing it!
> >
> > My big concern is that the performance drop for low threads counts (roughly
> > up to 2x number of NUMA nodes) is not just a rare corner case, but it might
> > be more common.
>
> That is hard to tell. In my case I was seeing the problem running a HPC
> workload, dedicated to that machine and using only 6% of available CPUs. I
> find it unlikely that is common because who acquires such a large machine
> and then uses a tiny percentage of it. Rember also the other points I made
> such as 1M migrations happening in the first few seconds just trying to
> deal with the load balancer and NUMA balancer fighting each other. While
> that might happen to be good for SP, it's not universally good behaviour
> and I've dealt with issues in the past whereby the NUMA balancer would
> get confused and just ramp up the frequency it samples and migrates trying
> to override the load balancer.
>
> > We see the drop for the following benchmarks/tests,
> > especially on 8 NUMA nodes servers. However, four and even 2 NUMA node
> > servers are affected as well.
> >
> > Numbers show average performance drop (median of runtime collected from 5
> > subsequential runs) compared to vanilla kernel.
> >
> > 2x AMD 7351 (EPYC Naples), 8 NUMA nodes
> > ===================================
> > NAS: sp_C test: -50%, peak perf. drop with 8 threads
>
> I hadn't tested 8 threads specifically I think that works out as
> using 12.5% of the available machine. The allowed imbalance between
> nodes means that some SP instances will stack on the same node but not
> the same CPU.
>
> > NAS: mg_D: -10% with 16 threads
>
> While I do not have the most up to date figures available, I found the
> opposite trend at 21 threads (the test case I used were 1/3rd available
> CPUs and as close to maximum CPUs as possible). There I found it was 10%
> faster for an 8 node machine.
>
> For 4 nodes, using a single JVM was performance neutral *on average* but
> much less variable. With one JVM per node, there was a mix of small <2%
> regressions for some thread counts and up to 9% faster on others.
>
> > SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
> > SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8
> > threads)
>
> I didn't run SPECjvm for multiple thread sizes so I don't have data yet
> and may not have for another day at least.
>
> > SPECjbb2005: -10% for 16 threads
> >
>
> I found this to depend in the number of JVMs used and the thread count.
> Slower at low thread counts, faster at higher thread counts but with
> more stable results with the series applied and less NUMA balancer
> activity.
>
> This is somewhat of a dilemma. Without the series, the load balancer and
> NUMA balancer use very different criteria on what should happen and
> results are not stable. In some cases acting randomly happens to work
> out and in others it does not and overall it depends on the workload and
> machine. It's very highly dependent on both the workload and the machine
> and it's a question if we want to continue dealing with two parts of the
> scheduler disagreeing on what criteria to use or try to improve the
> reconciled load and memory balancer sharing similar logic.
>
> In *general*, I found that the series won a lot more than it lost across
> a spread of workloads and machines but unfortunately it's also an area
> where counter-examples can be found.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
@ 2020-03-12 21:47             ` Mel Gorman
  2020-03-12 22:24               ` Jirka Hladky
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-03-12 21:47 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

On Thu, Mar 12, 2020 at 05:54:29PM +0100, Jirka Hladky wrote:
> >
> > find it unlikely that is common because who acquires such a large machine
> > and then uses a tiny percentage of it.
> 
> 
> I generally agree, but I also want to make a point that AMD made these
> large systems much more affordable with their EPYC CPUs. The 8 NUMA node
> server we are using costs under $8k.
> 
> 
> 
> > This is somewhat of a dilemma. Without the series, the load balancer and
> > NUMA balancer use very different criteria on what should happen and
> > results are not stable.
> 
> 
> Unfortunately, I see instabilities also for the series. This is again for
> the sp_C test with 8 threads executed on dual-socket AMD 7351 (EPYC Naples)
> server with 8 NUMA nodes. With the series applied, the runtime varies from
> 86 to 165 seconds! Could we do something about it? The runtime of 86
> seconds would be acceptable. If we could stabilize this case and get
> consistent runtime around 80 seconds, the problem would be gone.
> 
> Do you experience the similar instability of results on your HW for sp_C
> with low thread counts?
> 

I saw something similar but observed that it depended on whether the
worker tasks got spread wide or not which partially came down to luck.
The question is if it's possible to pick a point where we spread wide
and can recover quickly enough when tasks need to remain close without
knowledge of the future. Putting a balancing limit on tasks that
recently woke would be one option but that could also cause persistent
improper balancing for tasks that wake frequently.

> Runtime with this series applied:
>  $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
>   125.73
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
>    87.54
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
>    86.93
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
>   165.98
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
>   114.78
> 
> For comparison, here are vanilla kernel results:
> $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
>    59.83
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
>    67.72
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
>    63.62
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
>    55.01
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
>    65.20
> 
> 
> 
> > In *general*, I found that the series won a lot more than it lost across
> > a spread of workloads and machines but unfortunately it's also an area
> > where counter-examples can be found.
> 
> 
> OK, fair enough. I understand that there will always be trade-offs when
> making changes to scheduler like this. And I agree that cases with higher
> system load (where is series is helpful) outweigh the performance drops for
> low threads counts. I was hoping that it would be possible to improve the
> small threads results while keeping the gains for other scenarios:-)  But
> let's be realistic - I would be happy to fix the extreme case mentioned
> above. The other issues where performance drop is about 20% are OK with me
> and are outweighed by the gains for different scenarios.
> 

I'll continue thinking about it but whatever chance there is of
improving it while keeping CPU balancing, NUMA balancing and wake affine
consistent with each other, I think there is no chance with the
inconsistent logic used in the vanilla code :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-12 21:47             ` Mel Gorman
@ 2020-03-12 22:24               ` Jirka Hladky
  2020-03-20 15:08                 ` Jirka Hladky
       [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
  0 siblings, 2 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-12 22:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

> I'll continue thinking about it but whatever chance there is of
> improving it while keeping CPU balancing, NUMA balancing and wake affine
> consistent with each other, I think there is no chance with the
> inconsistent logic used in the vanilla code :(

Thank you, Mel!


On Thu, Mar 12, 2020 at 10:47 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Mar 12, 2020 at 05:54:29PM +0100, Jirka Hladky wrote:
> > >
> > > find it unlikely that is common because who acquires such a large machine
> > > and then uses a tiny percentage of it.
> >
> >
> > I generally agree, but I also want to make a point that AMD made these
> > large systems much more affordable with their EPYC CPUs. The 8 NUMA node
> > server we are using costs under $8k.
> >
> >
> >
> > > This is somewhat of a dilemma. Without the series, the load balancer and
> > > NUMA balancer use very different criteria on what should happen and
> > > results are not stable.
> >
> >
> > Unfortunately, I see instabilities also for the series. This is again for
> > the sp_C test with 8 threads executed on dual-socket AMD 7351 (EPYC Naples)
> > server with 8 NUMA nodes. With the series applied, the runtime varies from
> > 86 to 165 seconds! Could we do something about it? The runtime of 86
> > seconds would be acceptable. If we could stabilize this case and get
> > consistent runtime around 80 seconds, the problem would be gone.
> >
> > Do you experience the similar instability of results on your HW for sp_C
> > with low thread counts?
> >
>
> I saw something similar but observed that it depended on whether the
> worker tasks got spread wide or not which partially came down to luck.
> The question is if it's possible to pick a point where we spread wide
> and can recover quickly enough when tasks need to remain close without
> knowledge of the future. Putting a balancing limit on tasks that
> recently woke would be one option but that could also cause persistent
> improper balancing for tasks that wake frequently.
>
> > Runtime with this series applied:
> >  $ grep "Time in seconds" *log
> > sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
> >   125.73
> > sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
> >    87.54
> > sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
> >    86.93
> > sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
> >   165.98
> > sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
> >   114.78
> >
> > For comparison, here are vanilla kernel results:
> > $ grep "Time in seconds" *log
> > sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
> >    59.83
> > sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
> >    67.72
> > sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
> >    63.62
> > sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
> >    55.01
> > sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
> >    65.20
> >
> >
> >
> > > In *general*, I found that the series won a lot more than it lost across
> > > a spread of workloads and machines but unfortunately it's also an area
> > > where counter-examples can be found.
> >
> >
> > OK, fair enough. I understand that there will always be trade-offs when
> > making changes to scheduler like this. And I agree that cases with higher
> > system load (where is series is helpful) outweigh the performance drops for
> > low threads counts. I was hoping that it would be possible to improve the
> > small threads results while keeping the gains for other scenarios:-)  But
> > let's be realistic - I would be happy to fix the extreme case mentioned
> > above. The other issues where performance drop is about 20% are OK with me
> > and are outweighed by the gains for different scenarios.
> >
>
> I'll continue thinking about it but whatever chance there is of
> improving it while keeping CPU balancing, NUMA balancing and wake affine
> consistent with each other, I think there is no chance with the
> inconsistent logic used in the vanilla code :(
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-12 22:24               ` Jirka Hladky
@ 2020-03-20 15:08                 ` Jirka Hladky
       [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-20 15:08 UTC (permalink / raw)
  To: linux-kernel

Hi Mel,

just a quick update. I have increased the testing coverage and other
tests from the NAS shows a big performance drop for the low number of
threads as well:

sp_C_x - show still the biggest drop upto 50%
bt_C_x - performance drop upto 40%
ua_C_x - performance drop upto 30%

My point is that the performance drop for the low number of threads is
more common than we have initially thought.

Let me know what you need more data.

Thanks!
Jirka


On Thu, Mar 12, 2020 at 11:24 PM Jirka Hladky <jhladky@redhat.com> wrote:
>
> > I'll continue thinking about it but whatever chance there is of
> > improving it while keeping CPU balancing, NUMA balancing and wake affine
> > consistent with each other, I think there is no chance with the
> > inconsistent logic used in the vanilla code :(
>
> Thank you, Mel!
>
>
> On Thu, Mar 12, 2020 at 10:47 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Mar 12, 2020 at 05:54:29PM +0100, Jirka Hladky wrote:
> > > >
> > > > find it unlikely that is common because who acquires such a large machine
> > > > and then uses a tiny percentage of it.
> > >
> > >
> > > I generally agree, but I also want to make a point that AMD made these
> > > large systems much more affordable with their EPYC CPUs. The 8 NUMA node
> > > server we are using costs under $8k.
> > >
> > >
> > >
> > > > This is somewhat of a dilemma. Without the series, the load balancer and
> > > > NUMA balancer use very different criteria on what should happen and
> > > > results are not stable.
> > >
> > >
> > > Unfortunately, I see instabilities also for the series. This is again for
> > > the sp_C test with 8 threads executed on dual-socket AMD 7351 (EPYC Naples)
> > > server with 8 NUMA nodes. With the series applied, the runtime varies from
> > > 86 to 165 seconds! Could we do something about it? The runtime of 86
> > > seconds would be acceptable. If we could stabilize this case and get
> > > consistent runtime around 80 seconds, the problem would be gone.
> > >
> > > Do you experience the similar instability of results on your HW for sp_C
> > > with low thread counts?
> > >
> >
> > I saw something similar but observed that it depended on whether the
> > worker tasks got spread wide or not which partially came down to luck.
> > The question is if it's possible to pick a point where we spread wide
> > and can recover quickly enough when tasks need to remain close without
> > knowledge of the future. Putting a balancing limit on tasks that
> > recently woke would be one option but that could also cause persistent
> > improper balancing for tasks that wake frequently.
> >
> > > Runtime with this series applied:
> > >  $ grep "Time in seconds" *log
> > > sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
> > >   125.73
> > > sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
> > >    87.54
> > > sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
> > >    86.93
> > > sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
> > >   165.98
> > > sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
> > >   114.78
> > >
> > > For comparison, here are vanilla kernel results:
> > > $ grep "Time in seconds" *log
> > > sp.C.x.defaultRun.008threads.loop01.log: Time in seconds =
> > >    59.83
> > > sp.C.x.defaultRun.008threads.loop02.log: Time in seconds =
> > >    67.72
> > > sp.C.x.defaultRun.008threads.loop03.log: Time in seconds =
> > >    63.62
> > > sp.C.x.defaultRun.008threads.loop04.log: Time in seconds =
> > >    55.01
> > > sp.C.x.defaultRun.008threads.loop05.log: Time in seconds =
> > >    65.20
> > >
> > >
> > >
> > > > In *general*, I found that the series won a lot more than it lost across
> > > > a spread of workloads and machines but unfortunately it's also an area
> > > > where counter-examples can be found.
> > >
> > >
> > > OK, fair enough. I understand that there will always be trade-offs when
> > > making changes to scheduler like this. And I agree that cases with higher
> > > system load (where is series is helpful) outweigh the performance drops for
> > > low threads counts. I was hoping that it would be possible to improve the
> > > small threads results while keeping the gains for other scenarios:-)  But
> > > let's be realistic - I would be happy to fix the extreme case mentioned
> > > above. The other issues where performance drop is about 20% are OK with me
> > > and are outweighed by the gains for different scenarios.
> > >
> >
> > I'll continue thinking about it but whatever chance there is of
> > improving it while keeping CPU balancing, NUMA balancing and wake affine
> > consistent with each other, I think there is no chance with the
> > inconsistent logic used in the vanilla code :(
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
>
>
> --
> -Jirka



-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
@ 2020-03-20 15:22                   ` Mel Gorman
  2020-03-20 15:33                     ` Jirka Hladky
       [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-03-20 15:22 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

On Fri, Mar 20, 2020 at 03:37:44PM +0100, Jirka Hladky wrote:
> Hi Mel,
> 
> just a quick update. I have increased the testing coverage and other tests
> from the NAS shows a big performance drop for the low number of threads as
> well:
> 
> sp_C_x - show still the biggest drop upto 50%
> bt_C_x - performance drop upto 40%
> ua_C_x - performance drop upto 30%
> 

MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
think it must be OMP you are using because I found I had to disable UA
for MPI at some point in the past for reasons I no longer remember.

> My point is that the performance drop for the low number of threads is more
> common than we have initially thought.
> 
> Let me know what you need more data.
> 

I just a clarification on the thread count and a confirmation it's OMP. For
MPI, I did note that some of the other NAS kernels shows a slight dip but
it was nowhere near as severe as SP and the problem was the same as more --
two or more tasks stayed on the same node without spreading out because
there was no pressure to do so. There was enough CPU and memory capacity
with no obvious pattern that could be used to spread the load wide early.

One possibility would be to spread wide always at clone time and assume
wake_affine will pull related tasks but it's fragile because it breaks
if the cloned task execs and then allocates memory from a remote node
only to migrate to a local node immediately.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-20 15:22                   ` Mel Gorman
@ 2020-03-20 15:33                     ` Jirka Hladky
       [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-20 15:33 UTC (permalink / raw)
  To: linux-kernel

> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.

Yes, it's indeed OMP.  With low threads count, I mean up to 2x number
of NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA
node servers).

> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.

I think the only way to find out how it performs is to test it. If you
could prepare a patch like that, I'm more than happy to give it a try!

Jirka


On Fri, Mar 20, 2020 at 4:22 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Mar 20, 2020 at 03:37:44PM +0100, Jirka Hladky wrote:
> > Hi Mel,
> >
> > just a quick update. I have increased the testing coverage and other tests
> > from the NAS shows a big performance drop for the low number of threads as
> > well:
> >
> > sp_C_x - show still the biggest drop upto 50%
> > bt_C_x - performance drop upto 40%
> > ua_C_x - performance drop upto 30%
> >
>
> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.
>
> > My point is that the performance drop for the low number of threads is more
> > common than we have initially thought.
> >
> > Let me know what you need more data.
> >
>
> I just a clarification on the thread count and a confirmation it's OMP. For
> MPI, I did note that some of the other NAS kernels shows a slight dip but
> it was nowhere near as severe as SP and the problem was the same as more --
> two or more tasks stayed on the same node without spreading out because
> there was no pressure to do so. There was enough CPU and memory capacity
> with no obvious pattern that could be used to spread the load wide early.
>
> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
@ 2020-03-20 16:38                       ` Mel Gorman
  2020-03-20 17:21                         ` Jirka Hladky
  2020-05-07 15:24                         ` Jirka Hladky
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-03-20 16:38 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> >
> > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > think it must be OMP you are using because I found I had to disable UA
> > for MPI at some point in the past for reasons I no longer remember.
> 
> 
> Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> servers).
> 

Ok, so we know it's within the imbalance threshold where a NUMA node can
be left idle.

> One possibility would be to spread wide always at clone time and assume
> > wake_affine will pull related tasks but it's fragile because it breaks
> > if the cloned task execs and then allocates memory from a remote node
> > only to migrate to a local node immediately.
> 
> 
> I think the only way to find out how it performs is to test it. If you
> could prepare a patch like that, I'm more than happy to give it a try!
> 

When the initial spreading was prevented, it was for pipelines mainly --
even basic shell scripts. In that case it was observed that a shell would
fork/exec two tasks connected via pipe that started on separate nodes and
had allocated remote data before being pulled close. The processes were
typically too short lived for NUMA balancing to fix it up by exec time
the information on where the fork happened was lost.  See 2c83362734da
("sched/fair: Consider SD_NUMA when selecting the most idle group to
schedule on"). Now the logic has probably been partially broken since
because of how SD_NUMA is now treated but the concern about spreading
wide prematurely remains.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-20 16:38                       ` Mel Gorman
@ 2020-03-20 17:21                         ` Jirka Hladky
  2020-05-07 15:24                         ` Jirka Hladky
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-03-20 17:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML

> When the initial spreading was prevented, it was for pipelines mainly --
> even basic shell scripts. In that case it was observed that a shell would
> fork/exec two tasks connected via pipe that started on separate nodes and
> had allocated remote data before being pulled close. The processes were
> typically too short lived for NUMA balancing to fix it up by exec time
> the information on where the fork happened was lost.  See 2c83362734da
> ("sched/fair: Consider SD_NUMA when selecting the most idle group to
> schedule on"). Now the logic has probably been partially broken since
> because of how SD_NUMA is now treated but the concern about spreading
> wide prematurely remains.

I understand. It's a hard one - let's keep an eye on it. We will
continue to test the upstream kernels and we will have some discussion
internally. Let's see if anybody has some idea how to treat these
special cases.

Enjoy the weekend!
Jirka


On Fri, Mar 20, 2020 at 5:38 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> > >
> > > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > > think it must be OMP you are using because I found I had to disable UA
> > > for MPI at some point in the past for reasons I no longer remember.
> >
> >
> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
> >
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.
>
> > One possibility would be to spread wide always at clone time and assume
> > > wake_affine will pull related tasks but it's fragile because it breaks
> > > if the cloned task execs and then allocates memory from a remote node
> > > only to migrate to a local node immediately.
> >
> >
> > I think the only way to find out how it performs is to test it. If you
> > could prepare a patch like that, I'm more than happy to give it a try!
> >
>
> When the initial spreading was prevented, it was for pipelines mainly --
> even basic shell scripts. In that case it was observed that a shell would
> fork/exec two tasks connected via pipe that started on separate nodes and
> had allocated remote data before being pulled close. The processes were
> typically too short lived for NUMA balancing to fix it up by exec time
> the information on where the fork happened was lost.  See 2c83362734da
> ("sched/fair: Consider SD_NUMA when selecting the most idle group to
> schedule on"). Now the logic has probably been partially broken since
> because of how SD_NUMA is now treated but the concern about spreading
> wide prematurely remains.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-03-20 16:38                       ` Mel Gorman
  2020-03-20 17:21                         ` Jirka Hladky
@ 2020-05-07 15:24                         ` Jirka Hladky
  2020-05-07 15:54                           ` Mel Gorman
  1 sibling, 1 reply; 86+ messages in thread
From: Jirka Hladky @ 2020-05-07 15:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Hi Mel,

> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.

we have discussed today with my colleagues the performance drop for
some workloads for low threads counts (roughly up to 2x number of NUMA
nodes). We are worried that it can be a severe issue for some use
cases, which require a full memory bandwidth even when only part of
CPUs is used.

We understand that scheduler cannot distinguish this type of workload
from others automatically. However, there was an idea for a * new
kernel tunable to control the imbalance threshold *. Based on the
purpose of the server, users could set this tunable. See the tuned
project, which allows creating performance profiles [1].

What do you think about this approach?

Thanks a lot!
Jirka

[1] https://tuned-project.org


On Fri, Mar 20, 2020 at 5:38 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> > >
> > > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > > think it must be OMP you are using because I found I had to disable UA
> > > for MPI at some point in the past for reasons I no longer remember.
> >
> >
> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
> >
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.
>
> > One possibility would be to spread wide always at clone time and assume
> > > wake_affine will pull related tasks but it's fragile because it breaks
> > > if the cloned task execs and then allocates memory from a remote node
> > > only to migrate to a local node immediately.
> >
> >
> > I think the only way to find out how it performs is to test it. If you
> > could prepare a patch like that, I'm more than happy to give it a try!
> >
>
> When the initial spreading was prevented, it was for pipelines mainly --
> even basic shell scripts. In that case it was observed that a shell would
> fork/exec two tasks connected via pipe that started on separate nodes and
> had allocated remote data before being pulled close. The processes were
> typically too short lived for NUMA balancing to fix it up by exec time
> the information on where the fork happened was lost.  See 2c83362734da
> ("sched/fair: Consider SD_NUMA when selecting the most idle group to
> schedule on"). Now the logic has probably been partially broken since
> because of how SD_NUMA is now treated but the concern about spreading
> wide prematurely remains.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-07 15:24                         ` Jirka Hladky
@ 2020-05-07 15:54                           ` Mel Gorman
  2020-05-07 16:29                             ` Jirka Hladky
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-07 15:54 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > servers).
> >
> > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > be left idle.
> 
> we have discussed today with my colleagues the performance drop for
> some workloads for low threads counts (roughly up to 2x number of NUMA
> nodes). We are worried that it can be a severe issue for some use
> cases, which require a full memory bandwidth even when only part of
> CPUs is used.
> 
> We understand that scheduler cannot distinguish this type of workload
> from others automatically. However, there was an idea for a * new
> kernel tunable to control the imbalance threshold *. Based on the
> purpose of the server, users could set this tunable. See the tuned
> project, which allows creating performance profiles [1].
> 

I'm not completely opposed to it but given that the setting is global,
I imagine it could have other consequences if two applications ran
at different times have different requirements. Given that it's OMP,
I would have imagined that an application that really cared about this
would specify what was needed using OMP_PLACES. Why would someone prefer
kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
specific knowledge of the application even to know that a particular
tuned profile is needed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-07 15:54                           ` Mel Gorman
@ 2020-05-07 16:29                             ` Jirka Hladky
  2020-05-07 17:49                               ` Phil Auld
  2020-05-08  9:22                               ` Mel Gorman
  0 siblings, 2 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-07 16:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Hi Mel,

we are not targeting just OMP applications. We see the performance
degradation also for other workloads, like SPECjbb2005 and
SPECjvm2008. Even worse, it also affects a higher number of threads.
For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
threads (the system has 64 CPUs in total). We observe this degradation
only when we run a single SPECjbb binary. When running 4 SPECjbb
binaries in parallel, there is no change in performance between 5.6
and 5.7.

That's why we are asking for the kernel tunable, which we would add to
the tuned profile. We don't expect users to change this frequently but
rather to set the performance profile once based on the purpose of the
server.

If you could prepare a patch for us, we would be more than happy to
test it extensively. Based on the results, we can then evaluate if
it's the way to go. Thoughts?

Thanks a lot!
Jirka

On Thu, May 7, 2020 at 5:54 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > servers).
> > >
> > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > be left idle.
> >
> > we have discussed today with my colleagues the performance drop for
> > some workloads for low threads counts (roughly up to 2x number of NUMA
> > nodes). We are worried that it can be a severe issue for some use
> > cases, which require a full memory bandwidth even when only part of
> > CPUs is used.
> >
> > We understand that scheduler cannot distinguish this type of workload
> > from others automatically. However, there was an idea for a * new
> > kernel tunable to control the imbalance threshold *. Based on the
> > purpose of the server, users could set this tunable. See the tuned
> > project, which allows creating performance profiles [1].
> >
>
> I'm not completely opposed to it but given that the setting is global,
> I imagine it could have other consequences if two applications ran
> at different times have different requirements. Given that it's OMP,
> I would have imagined that an application that really cared about this
> would specify what was needed using OMP_PLACES. Why would someone prefer
> kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> specific knowledge of the application even to know that a particular
> tuned profile is needed.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-07 16:29                             ` Jirka Hladky
@ 2020-05-07 17:49                               ` Phil Auld
       [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
  2020-05-08  9:22                               ` Mel Gorman
  1 sibling, 1 reply; 86+ messages in thread
From: Phil Auld @ 2020-05-07 17:49 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> Hi Mel,
> 
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
> 
> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
> 
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
>

I'm happy to spin up a patch once I'm sure what exactly the tuning would
effect. At an initial glance I'm thinking it would be the imbalance_min
which is currently hardcoded to 2. But there may be something else...


Cheers,
Phil


> Thanks a lot!
> Jirka
> 
> On Thu, May 7, 2020 at 5:54 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > > servers).
> > > >
> > > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > > be left idle.
> > >
> > > we have discussed today with my colleagues the performance drop for
> > > some workloads for low threads counts (roughly up to 2x number of NUMA
> > > nodes). We are worried that it can be a severe issue for some use
> > > cases, which require a full memory bandwidth even when only part of
> > > CPUs is used.
> > >
> > > We understand that scheduler cannot distinguish this type of workload
> > > from others automatically. However, there was an idea for a * new
> > > kernel tunable to control the imbalance threshold *. Based on the
> > > purpose of the server, users could set this tunable. See the tuned
> > > project, which allows creating performance profiles [1].
> > >
> >
> > I'm not completely opposed to it but given that the setting is global,
> > I imagine it could have other consequences if two applications ran
> > at different times have different requirements. Given that it's OMP,
> > I would have imagined that an application that really cared about this
> > would specify what was needed using OMP_PLACES. Why would someone prefer
> > kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> > specific knowledge of the application even to know that a particular
> > tuned profile is needed.
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> -- 
> -Jirka
> 

-- 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-07 16:29                             ` Jirka Hladky
  2020-05-07 17:49                               ` Phil Auld
@ 2020-05-08  9:22                               ` Mel Gorman
  2020-05-08 11:05                                 ` Jirka Hladky
       [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
  1 sibling, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-08  9:22 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
> 

Minimally I suggest confirming that it's really due to
adjust_numa_imbalance() by making the function a no-op and retesting.
I have found odd artifacts with it but I'm unsure how to proceed without
causing problems elsehwere.

For example, netperf on localhost in some cases reported a regression
when the client and server were running on the same node. The problem
appears to be that netserver completes its work faster when running
local and goes idle more regularly. The cost of going idle and waking up
builds up and a lower throughput is reported but I'm not sure if gaming
an artifact like that is a good idea.

> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
> 
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
> 

I would suggest simply disabling that function first to ensure that is
really what is causing problems for you.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-08  9:22                               ` Mel Gorman
@ 2020-05-08 11:05                                 ` Jirka Hladky
       [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-08 11:05 UTC (permalink / raw)
  To: linux-kernel

Hi Mel,

thanks for hints! We will try it.

@Phil - could you please prepare a kernel build for me to test?

Thank you!
Jirka

On Fri, May 8, 2020 at 11:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we are not targeting just OMP applications. We see the performance
> > degradation also for other workloads, like SPECjbb2005 and
> > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > threads (the system has 64 CPUs in total). We observe this degradation
> > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > binaries in parallel, there is no change in performance between 5.6
> > and 5.7.
> >
>
> Minimally I suggest confirming that it's really due to
> adjust_numa_imbalance() by making the function a no-op and retesting.
> I have found odd artifacts with it but I'm unsure how to proceed without
> causing problems elsehwere.
>
> For example, netperf on localhost in some cases reported a regression
> when the client and server were running on the same node. The problem
> appears to be that netserver completes its work faster when running
> local and goes idle more regularly. The cost of going idle and waking up
> builds up and a lower throughput is reported but I'm not sure if gaming
> an artifact like that is a good idea.
>
> > That's why we are asking for the kernel tunable, which we would add to
> > the tuned profile. We don't expect users to change this frequently but
> > rather to set the performance profile once based on the purpose of the
> > server.
> >
> > If you could prepare a patch for us, we would be more than happy to
> > test it extensively. Based on the results, we can then evaluate if
> > it's the way to go. Thoughts?
> >
>
> I would suggest simply disabling that function first to ensure that is
> really what is causing problems for you.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
@ 2020-05-13 14:57                                   ` Jirka Hladky
  2020-05-13 15:30                                     ` Mel Gorman
  0 siblings, 1 reply; 86+ messages in thread
From: Jirka Hladky @ 2020-05-13 14:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Hi Mel,

we have tried the kernel with adjust_numa_imbalance() crippled to just
return the imbalance it's given.

It has solved all the performance problems I have reported.
Performance is the same as with 5.6 kernel (before the patch was
applied).

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop for low load scenarios (SPECjvm2008 and NAS)

Any suggestions on how to proceed? One approach is to turn
"imbalance_min" into the kernel tunable. Any other ideas?

https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914

Thanks a lot!
Jirka






On Fri, May 8, 2020 at 12:40 PM Jirka Hladky <jhladky@redhat.com> wrote:
>
> Hi Mel,
>
> thanks for hints! We will try it.
>
> @Phil - could you please prepare a kernel build for me to test?
>
> Thank you!
> Jirka
>
> On Fri, May 8, 2020 at 11:22 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we are not targeting just OMP applications. We see the performance
>> > degradation also for other workloads, like SPECjbb2005 and
>> > SPECjvm2008. Even worse, it also affects a higher number of threads.
>> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
>> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
>> > threads (the system has 64 CPUs in total). We observe this degradation
>> > only when we run a single SPECjbb binary. When running 4 SPECjbb
>> > binaries in parallel, there is no change in performance between 5.6
>> > and 5.7.
>> >
>>
>> Minimally I suggest confirming that it's really due to
>> adjust_numa_imbalance() by making the function a no-op and retesting.
>> I have found odd artifacts with it but I'm unsure how to proceed without
>> causing problems elsehwere.
>>
>> For example, netperf on localhost in some cases reported a regression
>> when the client and server were running on the same node. The problem
>> appears to be that netserver completes its work faster when running
>> local and goes idle more regularly. The cost of going idle and waking up
>> builds up and a lower throughput is reported but I'm not sure if gaming
>> an artifact like that is a good idea.
>>
>> > That's why we are asking for the kernel tunable, which we would add to
>> > the tuned profile. We don't expect users to change this frequently but
>> > rather to set the performance profile once based on the purpose of the
>> > server.
>> >
>> > If you could prepare a patch for us, we would be more than happy to
>> > test it extensively. Based on the results, we can then evaluate if
>> > it's the way to go. Thoughts?
>> >
>>
>> I would suggest simply disabling that function first to ensure that is
>> really what is causing problems for you.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>
>
> --
> -Jirka



-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-13 14:57                                   ` Jirka Hladky
@ 2020-05-13 15:30                                     ` Mel Gorman
  2020-05-13 16:20                                       ` Jirka Hladky
                                                         ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-13 15:30 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have tried the kernel with adjust_numa_imbalance() crippled to just
> return the imbalance it's given.
> 
> It has solved all the performance problems I have reported.
> Performance is the same as with 5.6 kernel (before the patch was
> applied).
> 
> * solved the performance drop upto 20%  with single instance
> SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> Rome systems) => this performance drop was INCREASING with higher
> threads counts (10% for 16 threads and 20 % for 32 threads)
> * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> 
> Any suggestions on how to proceed? One approach is to turn
> "imbalance_min" into the kernel tunable. Any other ideas?
> 
> https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> 

Complete shot in the dark but restore adjust_numa_imbalance() and try
this

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..0b31f4468d5b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+	if (sched_feat(TTWU_QUEUE)) {
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
 		ttwu_queue_remote(p, cpu, wake_flags);
 		return;

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-13 15:30                                     ` Mel Gorman
@ 2020-05-13 16:20                                       ` Jirka Hladky
  2020-05-14  9:50                                         ` Mel Gorman
  2020-05-14 15:31                                       ` Peter Zijlstra
  2020-05-15 14:43                                       ` Jirka Hladky
  2 siblings, 1 reply; 86+ messages in thread
From: Jirka Hladky @ 2020-05-13 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Thank you, Mel!

I think I have to make sure we cover the scenario you have targeted
when developing adjust_numa_imbalance:

=======================================================================
https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910

/*
* Allow a small imbalance based on a simple pair of communicating
* tasks that remain local when the source domain is almost idle.
*/
=======================================================================

Could you point me to a benchmark for this scenario? I have checked
https://github.com/gormanm/mmtests
and we use lots of the same benchmarks but I'm not sure if we cover
this particular scenario.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>         struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +       if (sched_feat(TTWU_QUEUE)) {
>                 sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>                 ttwu_queue_remote(p, cpu, wake_flags);
>                 return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-13 16:20                                       ` Jirka Hladky
@ 2020-05-14  9:50                                         ` Mel Gorman
       [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-14  9:50 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Wed, May 13, 2020 at 06:20:53PM +0200, Jirka Hladky wrote:
> Thank you, Mel!
> 
> I think I have to make sure we cover the scenario you have targeted
> when developing adjust_numa_imbalance:
> 
> =======================================================================
> https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910
> 
> /*
> * Allow a small imbalance based on a simple pair of communicating
> * tasks that remain local when the source domain is almost idle.
> */
> =======================================================================
> 
> Could you point me to a benchmark for this scenario? I have checked
> https://github.com/gormanm/mmtests
> and we use lots of the same benchmarks but I'm not sure if we cover
> this particular scenario.
> 

The NUMA imbalance part showed up as part of the general effort to
reconcile NUMA balancing with Load balancing. It's been known for years
that the two balancers disagreed to the extent that NUMA balancing retries
migrations multiple times just to keep things local leading to excessive
migrations. The full battery of tests that were used when I was trying
to reconcile the balancers and later working on Vincent's version is
as follows

scheduler-unbound
scheduler-forkintensive
scheduler-perfpipe
scheduler-perfpipe-cpufreq
scheduler-schbench
db-pgbench-timed-ro-small-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-mpi-half-xfs
hpc-nas-c-class-omp-full
hpc-nas-c-class-omp-half
hpc-nas-d-class-mpi-full-xfs
hpc-nas-d-class-mpi-half-xfs
hpc-nas-d-class-omp-full
hpc-nas-d-class-omp-half
io-dbench4-async-ext4
io-dbench4-async-xfs
jvm-specjbb2005-multi
jvm-specjbb2005-single
network-netperf-cstate
network-netperf-rr-cstate
network-netperf-rr-unbound
network-netperf-unbound
network-tbench
numa-autonumabench
workload-kerndevel-xfs
workload-shellscripts-xfs

Where there is -ext4 or -xfs, just remove the filesystem to get the base
configuration. i.e. io-dbench4-async-ext4 basic configuration is
io-dbench4-async. Both filesystems are sometimes tested because they
interact differently with the scheduler due to ext4 using a journal
thread and xfs using workqueues.

The imbalance one is most obvious with network-netperf-unbound running
on localhost. If the client/server are on separate nodes, it's obvious
from mpstat that two nodes are busy and it's migrating quite a bit. The
second effect is that NUMA balancing is active, trapping hinting faults
and migrating pages.

The biggest problem I have right now is that the wakeup path between tasks
that are local is slower than doing a remote wakeup via wake_list that
potentially sends an IPI which is ridiculous. The slower wakeup manifests
as a loss of throughput for netperf even though all the accesses are
local. At least that's what I'm looking at whenever I get the chance.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
@ 2020-05-14 10:08                                             ` Mel Gorman
  2020-05-14 10:22                                               ` Jirka Hladky
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-14 10:08 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 14, 2020 at 11:58:36AM +0200, Jirka Hladky wrote:
> Thank you, Mel!
> 
> We are using netperf as well, but AFAIK it's running on two different
> hosts. I will check with colleagues, if they can
> add network-netperf-unbound run on the localhost.
> 
> Is this the right config?
> https://github.com/gormanm/mmtests/blob/345f82bee77cbf09ba57f470a1cfc1ae413c97df/bin/generate-generic-configs
> sed -e 's/NETPERF_BUFFER_SIZES=.*/NETPERF_BUFFER_SIZES=64/'
> config-network-netperf-unbound > config-network-netperf-unbound-small
> 

That's one I was using at the moment to have a quick test after
the reconciliation series was completed. It has since changed to
config-network-netperf-cstate-small-cross-socket to limit cstates, bind
the client and server to two local CPUs and using one buffer size. It
was necessary to get an ftrace function graph of the wakeup path that
was readable and not too noisy due to migrations, cpuidle exit costs etc.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-14 10:08                                             ` Mel Gorman
@ 2020-05-14 10:22                                               ` Jirka Hladky
  2020-05-14 11:50                                                 ` Mel Gorman
  0 siblings, 1 reply; 86+ messages in thread
From: Jirka Hladky @ 2020-05-14 10:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Thanks!

Do you have a link? I cannot find it on github
(https://github.com/gormanm/mmtests, searched for
config-network-netperf-cstate-small-cross-socket)


On Thu, May 14, 2020 at 12:08 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 14, 2020 at 11:58:36AM +0200, Jirka Hladky wrote:
> > Thank you, Mel!
> >
> > We are using netperf as well, but AFAIK it's running on two different
> > hosts. I will check with colleagues, if they can
> > add network-netperf-unbound run on the localhost.
> >
> > Is this the right config?
> > https://github.com/gormanm/mmtests/blob/345f82bee77cbf09ba57f470a1cfc1ae413c97df/bin/generate-generic-configs
> > sed -e 's/NETPERF_BUFFER_SIZES=.*/NETPERF_BUFFER_SIZES=64/'
> > config-network-netperf-unbound > config-network-netperf-unbound-small
> >
>
> That's one I was using at the moment to have a quick test after
> the reconciliation series was completed. It has since changed to
> config-network-netperf-cstate-small-cross-socket to limit cstates, bind
> the client and server to two local CPUs and using one buffer size. It
> was necessary to get an ftrace function graph of the wakeup path that
> was readable and not too noisy due to migrations, cpuidle exit costs etc.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-14 10:22                                               ` Jirka Hladky
@ 2020-05-14 11:50                                                 ` Mel Gorman
  2020-05-14 13:34                                                   ` Jirka Hladky
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-14 11:50 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 14, 2020 at 12:22:05PM +0200, Jirka Hladky wrote:
> Thanks!
> 
> Do you have a link? I cannot find it on github
> (https://github.com/gormanm/mmtests, searched for
> config-network-netperf-cstate-small-cross-socket)
> 

https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-14 11:50                                                 ` Mel Gorman
@ 2020-05-14 13:34                                                   ` Jirka Hladky
  0 siblings, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-14 13:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

THANK YOU!


On Thu, May 14, 2020 at 1:50 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 14, 2020 at 12:22:05PM +0200, Jirka Hladky wrote:
> > Thanks!
> >
> > Do you have a link? I cannot find it on github
> > (https://github.com/gormanm/mmtests, searched for
> > config-network-netperf-cstate-small-cross-socket)
> >
>
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-13 15:30                                     ` Mel Gorman
  2020-05-13 16:20                                       ` Jirka Hladky
@ 2020-05-14 15:31                                       ` Peter Zijlstra
  2020-05-15  8:47                                         ` Mel Gorman
  2020-05-15 14:43                                       ` Jirka Hladky
  2 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-14 15:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>  	struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +	if (sched_feat(TTWU_QUEUE)) {

just saying that this has the risk of regressing other workloads, see:

  518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")

>  		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>  		ttwu_queue_remote(p, cpu, wake_flags);
>  		return;
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-14 15:31                                       ` Peter Zijlstra
@ 2020-05-15  8:47                                         ` Mel Gorman
  2020-05-15 11:17                                           ` Peter Zijlstra
  2020-05-15 11:28                                           ` Peter Zijlstra
  0 siblings, 2 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-15  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Thu, May 14, 2020 at 05:31:22PM +0200, Peter Zijlstra wrote:
> On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> > Complete shot in the dark but restore adjust_numa_imbalance() and try
> > this
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..0b31f4468d5b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> >  	struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> > +	if (sched_feat(TTWU_QUEUE)) {
> 
> just saying that this has the risk of regressing other workloads, see:
> 
>   518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
> 

Sure, I didn't really think that this was appropriate but it was the best I
had at the time. This can be a lot more targetted but it took me a while
to put together a picture as it was difficult for me to analyse. I was
going to post this as a seperate RFC but given this thread, what do you
think of this?

---8<---
sched: Wake cache-local tasks via wake_list if wakee CPU is polling

There are two separate method for waking a task from idle, one for
tasks running on CPUs that share a cache and one for when the caches are
separate. The methods can loosely be called local and remote even though
this is not directly related to NUMA and instead is due to the expected
cost of accessing data that is cache-hot on another CPU. The "local" costs
are expected to be relatively cheap as they share the LLC in comparison to
a remote IPI that is potentially required when using the "remote" wakeup.
The problem is that the local wakeup is not always cheaper and it appears
to have degraded even further around the 4.19 mark.

There appears to be a couple of reasons why it can be slower.

The first is specific to pairs of tasks where one or both rapidly enters
idle. For example, on netperf UDP_STREAM, the client sends a bunch of
packets, wakes the server, the server processes some packets and goes
back to sleep. There is a relationship between the tasks but it's not
strictly synchronous. The timing is different if the client/server are on
separate NUMA nodes and netserver is more likely to enter idle (measured
as server entering idle 10% more times when tasks are local vs remote
but machine-specific). However, the wakeups are so rapid that the wakeup
happens while the server is descheduling. That forces the waker to spin
on smp_cond_load_acquire for longer. In this case, it can be cheaper to
add the task to the rq->wake_list even if that potentially requires an IPI.

The second is that the local wakeup path is simply not always
that fast. Using ftrace, the cost of the locks, update_rq_clock and
ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
a single instance, the cost of the "remote" wakeup for try_to_wake_up
was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
wakeup. When there are tens of thousands of wakeups per second, these costs
accumulate and manifest as a throughput regression in netperf UDP_STREAM.

The essential difference in costs comes down to whether the CPU is fully
idle, a task is descheduling or polling in poll_idle(). This patch
special cases ttwu_queue() to use the "remote" method if the CPUs
task is polling as it's generally cheaper to use the wake_list in that
case than the local method because an IPI should not be required. As it is
race-prone, a reschedule IPI may still be sent but in that case the local
wakeup would also have to send a reschedule IPI so it should be harmless.

The benefit is not universal as it requires the specific pattern of wakeups
that happen when the wakee is likely to be descheduling. netperf UDP_STREAM
is one of the more obvious examples and it is more obvious since the
load balancer was reconciled and the client and server are more likely to
be running on the same NUMA node which is why it was missed for so long.

netperf-udp
                                  5.7.0-rc5              5.7.0-rc5
                                    vanilla          wakelist-v1r1
Hmean     send-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     send-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     send-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     send-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     send-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     send-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     send-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     send-8192     17943.12 (   0.00%)    19886.07 *  10.83%*
Hmean     send-16384    29732.94 (   0.00%)    32297.44 *   8.63%*
Hmean     recv-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     recv-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     recv-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     recv-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     recv-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     recv-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     recv-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     recv-8192     17943.12 (   0.00%)    19886.06 *  10.83%*
Hmean     recv-16384    29731.75 (   0.00%)    32296.08 *   8.62%*

It's less obvious on something like TCP_STREAM as there is a stricter
relationship between the client and server but with the patch, it's
much less variable

netperf-tcp
                             5.7.0-rc5              5.7.0-rc5
                               vanilla          wakelist-v1r1
Hmean     64         769.28 (   0.00%)      779.96 *   1.39%*
Hmean     128       1497.54 (   0.00%)     1509.33 *   0.79%*
Hmean     256       2806.00 (   0.00%)     2787.59 (  -0.66%)
Hmean     1024      9700.80 (   0.00%)     9720.02 (   0.20%)
Hmean     2048     17071.28 (   0.00%)    16968.92 (  -0.60%)
Hmean     3312     22976.20 (   0.00%)    23012.80 (   0.16%)
Hmean     4096     25885.74 (   0.00%)    26152.49 *   1.03%*
Hmean     8192     34014.83 (   0.00%)    33878.67 (  -0.40%)
Hmean     16384    39482.14 (   0.00%)    40307.22 (   2.09%)
Stddev    64           2.60 (   0.00%)        0.17 (  93.31%)
Stddev    128          3.69 (   0.00%)        1.52 (  58.97%)
Stddev    256         31.06 (   0.00%)       14.76 (  52.47%)
Stddev    1024        96.50 (   0.00%)       46.43 (  51.89%)
Stddev    2048       115.98 (   0.00%)       62.59 (  46.03%)
Stddev    3312       293.28 (   0.00%)       41.92 (  85.71%)
Stddev    4096       173.45 (   0.00%)      123.19 (  28.98%)
Stddev    8192       783.59 (   0.00%)      129.62 (  83.46%)
Stddev    16384     1012.37 (   0.00%)      252.08 (  75.10%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..59077c7c6660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+	if (sched_feat(TTWU_QUEUE)) {
+		/*
+		 * A remote wakeup is often expensive as can require
+		 * an IPI and the wakeup path is slow. However, in
+		 * the specific case where the target CPU is idle
+		 * and polling, the CPU is active and rapidly checking
+		 * if a reschedule is needed. In this case, the idle
+		 * task just needs to be marked for resched and p
+		 * will rapidly be requeued which is less expensive
+		 * than the direct wakeup path.
+		 */
+		if (cpus_share_cache(smp_processor_id(), cpu)) {
+			struct thread_info *ti = task_thread_info(p);
+			typeof(ti->flags) val = READ_ONCE(ti->flags);
+
+			if (val & _TIF_POLLING_NRFLAG)
+				goto activate;
+		}
+
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
 		ttwu_queue_remote(p, cpu, wake_flags);
 		return;
 	}
 #endif
 
+activate:
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	ttwu_do_activate(rq, p, wake_flags, &rf);

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15  8:47                                         ` Mel Gorman
@ 2020-05-15 11:17                                           ` Peter Zijlstra
  2020-05-15 13:03                                             ` Mel Gorman
  2020-05-15 14:24                                             ` Peter Zijlstra
  2020-05-15 11:28                                           ` Peter Zijlstra
  1 sibling, 2 replies; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 11:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:

> However, the wakeups are so rapid that the wakeup
> happens while the server is descheduling. That forces the waker to spin
> on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> add the task to the rq->wake_list even if that potentially requires an IPI.

Right, I think Rik ran into that as well at some point. He wanted to
make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
cpu (which is currently in the middle of schedule()) should be an easier
proposition.

Maybe something like this untested thing... could explode most mighty,
didn't thing too hard.

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa6c19d38e82..c07b92a0ee5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
+	if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
 
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
@@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15  8:47                                         ` Mel Gorman
  2020-05-15 11:17                                           ` Peter Zijlstra
@ 2020-05-15 11:28                                           ` Peter Zijlstra
  2020-05-15 12:22                                             ` Mel Gorman
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 11:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> sched: Wake cache-local tasks via wake_list if wakee CPU is polling
> 
> There are two separate method for waking a task from idle, one for
> tasks running on CPUs that share a cache and one for when the caches are
> separate. The methods can loosely be called local and remote even though
> this is not directly related to NUMA and instead is due to the expected
> cost of accessing data that is cache-hot on another CPU. The "local" costs
> are expected to be relatively cheap as they share the LLC in comparison to
> a remote IPI that is potentially required when using the "remote" wakeup.
> The problem is that the local wakeup is not always cheaper and it appears
> to have degraded even further around the 4.19 mark.
> 
> There appears to be a couple of reasons why it can be slower.
> 
> The first is specific to pairs of tasks where one or both rapidly enters
> idle. For example, on netperf UDP_STREAM, the client sends a bunch of
> packets, wakes the server, the server processes some packets and goes
> back to sleep. There is a relationship between the tasks but it's not
> strictly synchronous. The timing is different if the client/server are on
> separate NUMA nodes and netserver is more likely to enter idle (measured
> as server entering idle 10% more times when tasks are local vs remote
> but machine-specific). However, the wakeups are so rapid that the wakeup
> happens while the server is descheduling. That forces the waker to spin
> on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> The second is that the local wakeup path is simply not always
> that fast. Using ftrace, the cost of the locks, update_rq_clock and
> ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
> a single instance, the cost of the "remote" wakeup for try_to_wake_up
> was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
> wakeup. When there are tens of thousands of wakeups per second, these costs
> accumulate and manifest as a throughput regression in netperf UDP_STREAM.
> 
> The essential difference in costs comes down to whether the CPU is fully
> idle, a task is descheduling or polling in poll_idle(). This patch
> special cases ttwu_queue() to use the "remote" method if the CPUs
> task is polling as it's generally cheaper to use the wake_list in that
> case than the local method because an IPI should not be required. As it is
> race-prone, a reschedule IPI may still be sent but in that case the local
> wakeup would also have to send a reschedule IPI so it should be harmless.

We don't in fact send a wakeup IPI when polling. So this might end up
with an extra IPI.


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..59077c7c6660 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>  	struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +	if (sched_feat(TTWU_QUEUE)) {
> +		/*
> +		 * A remote wakeup is often expensive as can require
> +		 * an IPI and the wakeup path is slow. However, in
> +		 * the specific case where the target CPU is idle
> +		 * and polling, the CPU is active and rapidly checking
> +		 * if a reschedule is needed.

Not strictly true; MWAIT can be very deep idle, it's just that with
POLLING we indicate we do not have to send an IPI to wake up. Just
setting the TIF_NEED_RESCHED flag is sufficient to wake up -- the
monitor part of monitor-wait.

>                                             In this case, the idle
> +		 * task just needs to be marked for resched and p
> +		 * will rapidly be requeued which is less expensive
> +		 * than the direct wakeup path.
> +		 */
> +		if (cpus_share_cache(smp_processor_id(), cpu)) {
> +			struct thread_info *ti = task_thread_info(p);
> +			typeof(ti->flags) val = READ_ONCE(ti->flags);
> +
> +			if (val & _TIF_POLLING_NRFLAG)
> +				goto activate;

I'm completely confused... the result here is that if you're polling you
do _NOT_ queue on the wake_list, but instead immediately enqueue.

(which kinda makes sense, since if the remote CPU is idle, it doesn't
have these lines in its cache anyway)

But the subject and comments all seem to suggest the opposite !?

Also, this will fail compilation when !defined(TIF_POLLING_NRFLAGG).

> +		}
> +
>  		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>  		ttwu_queue_remote(p, cpu, wake_flags);
>  		return;
>  	}
>  #endif
>  
> +activate:

The labels wants to go inside the ifdef, otherwise GCC will complain
about unused labels etc..

>  	rq_lock(rq, &rf);
>  	update_rq_clock(rq);
>  	ttwu_do_activate(rq, p, wake_flags, &rf);

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 11:28                                           ` Peter Zijlstra
@ 2020-05-15 12:22                                             ` Mel Gorman
  2020-05-15 12:51                                               ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-15 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 01:28:15PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > sched: Wake cache-local tasks via wake_list if wakee CPU is polling
> > 
> > There are two separate method for waking a task from idle, one for
> > tasks running on CPUs that share a cache and one for when the caches are
> > separate. The methods can loosely be called local and remote even though
> > this is not directly related to NUMA and instead is due to the expected
> > cost of accessing data that is cache-hot on another CPU. The "local" costs
> > are expected to be relatively cheap as they share the LLC in comparison to
> > a remote IPI that is potentially required when using the "remote" wakeup.
> > The problem is that the local wakeup is not always cheaper and it appears
> > to have degraded even further around the 4.19 mark.
> > 
> > There appears to be a couple of reasons why it can be slower.
> > 
> > The first is specific to pairs of tasks where one or both rapidly enters
> > idle. For example, on netperf UDP_STREAM, the client sends a bunch of
> > packets, wakes the server, the server processes some packets and goes
> > back to sleep. There is a relationship between the tasks but it's not
> > strictly synchronous. The timing is different if the client/server are on
> > separate NUMA nodes and netserver is more likely to enter idle (measured
> > as server entering idle 10% more times when tasks are local vs remote
> > but machine-specific). However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> > 
> > The second is that the local wakeup path is simply not always
> > that fast. Using ftrace, the cost of the locks, update_rq_clock and
> > ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
> > a single instance, the cost of the "remote" wakeup for try_to_wake_up
> > was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
> > wakeup. When there are tens of thousands of wakeups per second, these costs
> > accumulate and manifest as a throughput regression in netperf UDP_STREAM.
> > 
> > The essential difference in costs comes down to whether the CPU is fully
> > idle, a task is descheduling or polling in poll_idle(). This patch
> > special cases ttwu_queue() to use the "remote" method if the CPUs
> > task is polling as it's generally cheaper to use the wake_list in that
> > case than the local method because an IPI should not be required. As it is
> > race-prone, a reschedule IPI may still be sent but in that case the local
> > wakeup would also have to send a reschedule IPI so it should be harmless.
> 
> We don't in fact send a wakeup IPI when polling.

I know when it's polling that an IPI can be skipped as set_nr_if_polling
will succeed.

> So this might end up
> with an extra IPI.
> 

Yes, it might. If between the check for polling and set_nr_and_not_polling,
the polling may have stopped and the CPU entered a c-state and an IPI
is sent but in that case, either path is likely to result in some sort
of IPI.

> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..59077c7c6660 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> >  	struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> > +	if (sched_feat(TTWU_QUEUE)) {
> > +		/*
> > +		 * A remote wakeup is often expensive as can require
> > +		 * an IPI and the wakeup path is slow. However, in
> > +		 * the specific case where the target CPU is idle
> > +		 * and polling, the CPU is active and rapidly checking
> > +		 * if a reschedule is needed.
> 
> Not strictly true; MWAIT can be very deep idle, it's just that with
> POLLING we indicate we do not have to send an IPI to wake up. Just
> setting the TIF_NEED_RESCHED flag is sufficient to wake up -- the
> monitor part of monitor-wait.
> 

The terminology is unhelpful. In this case when I said "idle and polling",
I was thinking of the idle task running poll_idle() and the CPU should
not be in any c-state below C0 but it doesn't really matter due to your
own comments.

> >                                             In this case, the idle
> > +		 * task just needs to be marked for resched and p
> > +		 * will rapidly be requeued which is less expensive
> > +		 * than the direct wakeup path.
> > +		 */
> > +		if (cpus_share_cache(smp_processor_id(), cpu)) {
> > +			struct thread_info *ti = task_thread_info(p);
> > +			typeof(ti->flags) val = READ_ONCE(ti->flags);
> > +
> > +			if (val & _TIF_POLLING_NRFLAG)
> > +				goto activate;
> 
> I'm completely confused... the result here is that if you're polling you
> do _NOT_ queue on the wake_list, but instead immediately enqueue.
> 
> (which kinda makes sense, since if the remote CPU is idle, it doesn't
> have these lines in its cache anyway)
> 

Crap, I rushed this and severely confused myself about what is going
on. It is definitely the case that flipping this check does not give
any benefit. The patch shows a benefit but I'm failing to understand
exactly why. How I ended up here was perf indicating a lot of time spent
on smp_cond_load_acquire() which made me look closely at ttwu_remote()
and looking at function graphs to compare the different types of wakeups
and their timings.

With this patch when netperf is waking the server, netperf is all that
is active with the wakeup completing 3.489 microseconds. If this check
is flipped, the idle thread is running at the same time and clearly
interleaving with netperf in the ftrace function_graph and the wakeup
takes 9.449 microseconds. The exact timings vary obviously but it's
what led to the patch -- brute force starting at function_graphs until
something fell out.

> But the subject and comments all seem to suggest the opposite !?
> 

Yeah, I need to think more about why exactly this patch makes a difference.

> Also, this will fail compilation when !defined(TIF_POLLING_NRFLAGG).
> 

I can fix that.

> > +		}
> > +
> >  		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> >  		ttwu_queue_remote(p, cpu, wake_flags);
> >  		return;
> >  	}
> >  #endif
> >  
> > +activate:
> 
> The labels wants to go inside the ifdef, otherwise GCC will complain
> about unused labels etc..
> 

And that too.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 12:22                                             ` Mel Gorman
@ 2020-05-15 12:51                                               ` Peter Zijlstra
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 12:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 01:22:31PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2020 at 01:28:15PM +0200, Peter Zijlstra wrote:

> > > +			if (val & _TIF_POLLING_NRFLAG)
> > > +				goto activate;
> > 
> > I'm completely confused... the result here is that if you're polling you
> > do _NOT_ queue on the wake_list, but instead immediately enqueue.
> > 
> > (which kinda makes sense, since if the remote CPU is idle, it doesn't
> > have these lines in its cache anyway)
> > 
> 
> Crap, I rushed this and severely confused myself about what is going

Hehe, and here I though I was confused :-)

> on. It is definitely the case that flipping this check does not give
> any benefit. The patch shows a benefit but I'm failing to understand
> exactly why. How I ended up here was perf indicating a lot of time spent
> on smp_cond_load_acquire() which made me look closely at ttwu_remote()
> and looking at function graphs to compare the different types of wakeups
> and their timings.

So the raisin we did this remote wakeup thing in the first place was
that Oracle was having very heavy rq->lock cache-line contention. By
farming off the enqueue to the CPU that was going to run the task
anyway, the rq->lock (and the other runqueue structure lines) could stay
in the CPU that was using them (hard). Less cacheline ping-pong, more
win.

The observation here is that if a CPU is idle, it's rq will not be
contended.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 11:17                                           ` Peter Zijlstra
@ 2020-05-15 13:03                                             ` Mel Gorman
  2020-05-15 13:12                                               ` Peter Zijlstra
  2020-05-15 14:24                                             ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-15 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> 
> > However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> Right, I think Rik ran into that as well at some point. He wanted to
> make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> cpu (which is currently in the middle of schedule()) should be an easier
> proposition.
> 
> Maybe something like this untested thing... could explode most mighty,
> didn't thing too hard.
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fa6c19d38e82..c07b92a0ee5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
>  	sched_ttwu_pending();
>  }
>  
> -static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  
> @@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
>  {
>  	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
>  }
> +
> +static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +{
> +	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> +		__ttwu_queue_remote(p, cpu, wake_flags);
> +		return true;
> +	}
> +
> +	return false;
> +}
>  #endif /* CONFIG_SMP */
>  
>  static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> @@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>  	struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> -		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> -		ttwu_queue_remote(p, cpu, wake_flags);
> +	if (ttwu_queue_remote(p, cpu, wake_flags))
>  		return;
> -	}
>  #endif
>  
>  	rq_lock(rq, &rf);
> @@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>  		goto unlock;
>  
> +	if (p->in_iowait) {
> +		delayacct_blkio_end(p);
> +		atomic_dec(&task_rq(p)->nr_iowait);
> +	}
> +
>  #ifdef CONFIG_SMP
> +	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> +	p->state = TASK_WAKING;
> +
>  	/*
>  	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
>  	 * possible to, falsely, observe p->on_cpu == 0.
> @@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	 * This ensures that tasks getting woken will be fully ordered against
>  	 * their previous state and preserve Program Order.
>  	 */
> -	smp_cond_load_acquire(&p->on_cpu, !VAL);
> -
> -	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> -	p->state = TASK_WAKING;
> +	if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
> +		goto unlock;
>  
> -	if (p->in_iowait) {
> -		delayacct_blkio_end(p);
> -		atomic_dec(&task_rq(p)->nr_iowait);
> -	}
> +	smp_cond_load_acquire(&p->on_cpu, !VAL);
>  
>  	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>  	if (task_cpu(p) != cpu) {

I don't see a problem with moving the updating of p->state to the other
side of the barrier but I'm relying on the comment that the barrier is
only related to on_rq and on_cpu.

However, I'm less sure about what exactly you intended to do.
__ttwu_queue_remote is void so maybe you meant to use ttwu_queue_remote.
In that case, we potentially avoid spinning on on_rq for wakeups between
tasks that do not share CPU but it's not clear why it would be specific to
remote tasks. If you meant to call __ttwu_queue_remote unconditionally,
it's not clear why that's now safe when smp_cond_load_acquire() 
cared about on_rq being 0 before queueing a task for wakup or directly
waking it up.

Also because __ttwu_queue_remote() now happens before select_task_rq(), is
there not a risk that in some cases we end up stacking tasks unnecessarily?

> @@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		psi_ttwu_dequeue(p);
>  		set_task_cpu(p, cpu);
>  	}
> -
> -#else /* CONFIG_SMP */
> -
> -	if (p->in_iowait) {
> -		delayacct_blkio_end(p);
> -		atomic_dec(&task_rq(p)->nr_iowait);
> -	}
> -
>  #endif /* CONFIG_SMP */
>  
>  	ttwu_queue(p, cpu, wake_flags);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 13:03                                             ` Mel Gorman
@ 2020-05-15 13:12                                               ` Peter Zijlstra
  2020-05-15 13:28                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 13:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

On Fri, May 15, 2020 at 02:03:46PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> > On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > +static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> > +{
> > +	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> > +		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> > +		__ttwu_queue_remote(p, cpu, wake_flags);
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}

> > +	if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
> > +		goto unlock;

> I don't see a problem with moving the updating of p->state to the other
> side of the barrier but I'm relying on the comment that the barrier is
> only related to on_rq and on_cpu.

Yeah, I went with that too, like I said, didn't think too hard.

> However, I'm less sure about what exactly you intended to do.
> __ttwu_queue_remote is void so maybe you meant to use ttwu_queue_remote.

That!

> In that case, we potentially avoid spinning on on_rq for wakeups between
> tasks that do not share CPU but it's not clear why it would be specific to
> remote tasks.

The thinking was that we can avoid spinning on ->on_cpu, and let the CPU
get on with things. Rik had a workload where that spinning was
significant, and I thought to have understood you saw the same.

By sticking the task on the wake_list of the CPU that's in charge of
clearing ->on_cpu we ensure ->on_cpu is 0 by the time we get to doing
the actual enqueue.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 13:12                                               ` Peter Zijlstra
@ 2020-05-15 13:28                                                 ` Peter Zijlstra
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 13:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel


> > In that case, we potentially avoid spinning on on_rq for wakeups between
> > tasks that do not share CPU but it's not clear why it would be specific to
> > remote tasks.
> 
> The thinking was that we can avoid spinning on ->on_cpu, and let the CPU
> get on with things. Rik had a workload where that spinning was
> significant, and I thought to have understood you saw the same.
> 
> By sticking the task on the wake_list of the CPU that's in charge of
> clearing ->on_cpu we ensure ->on_cpu is 0 by the time we get to doing
> the actual enqueue.
> 

Of course, aside from sending an obviously broken patch, I also forgot
to Cc Rik.

This one compiled, booted and built a kernel, so it must be perfect :-)

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b64ffd6c728..df588ac75bf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,7 +2330,7 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct emote(p, cpu, wake_flags)) *rq = cpu_eturn;(cpu);
 
@@ -2372,6 +2372,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_q_lock(rq, &rf);_id, this_cpu) == per_cpu(sd_y_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* q && ttwu_remote(p, wake_flags))_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2380,11 +2391,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2568,7 +2576,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2599,15 +2615,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_ibutes_to_load = !!task_contributes_to_load(p);_load_acquire(&p->on_cpu, !VAL);
-
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
 
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
@@ -2615,14 +2626,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 11:17                                           ` Peter Zijlstra
  2020-05-15 13:03                                             ` Mel Gorman
@ 2020-05-15 14:24                                             ` Peter Zijlstra
  2020-05-21 10:38                                               ` Mel Gorman
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-15 14:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel

On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> 
> > However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> Right, I think Rik ran into that as well at some point. He wanted to
> make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> cpu (which is currently in the middle of schedule()) should be an easier
> proposition.
> 
> Maybe something like this untested thing... could explode most mighty,
> didn't thing too hard.
> 

Mel pointed out that that patch got mutilated somewhere (my own .Sent
copy was fine), let me try again.


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b64ffd6c728..df588ac75bf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,7 +2330,7 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2372,6 +2372,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2380,11 +2391,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2568,7 +2576,15 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2599,15 +2615,10 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
 
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
@@ -2615,14 +2626,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-13 15:30                                     ` Mel Gorman
  2020-05-13 16:20                                       ` Jirka Hladky
  2020-05-14 15:31                                       ` Peter Zijlstra
@ 2020-05-15 14:43                                       ` Jirka Hladky
  2 siblings, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-15 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>         struct rq_flags rf;
>  #if defined(CONFIG_SMP)
> -       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +       if (sched_feat(TTWU_QUEUE)) {
>                 sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>                 ttwu_queue_remote(p, cpu, wake_flags);
>                 return;

Hi Mel,

we have performance results for the proposed patch above ^^^.
Unfortunately, it hasn't helped the performance.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>         struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> +       if (sched_feat(TTWU_QUEUE)) {
>                 sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>                 ttwu_queue_remote(p, cpu, wake_flags);
>                 return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
@ 2020-05-18 14:52                                   ` Jirka Hladky
       [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
  0 siblings, 1 reply; 86+ messages in thread
From: Jirka Hladky @ 2020-05-18 14:52 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Phil Auld, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray

Hi Hillf,

thanks a lot for your patch!

Compared to 5.7 rc4 vanilla, we observe the following:
  * Single-tenant jobs show improvement up to 15% for SPECjbb2005 and
up to 100% for NAS in low thread mode. In other words, it fixes all
the problems we have reported in this thread.
  * Multitenancy jobs show performance degradation up to 30% for SPECjbb2005

While it fixes problems with single-tenant jobs and with a performance
at low system load, it breaks multi-tenant tasks.

We have compared it against kernel with adjust_numa_imbalance disabled
[1], and both kernels perform at the same level for the single-tenant
jobs, but the proposed patch is bad for the multitenancy mode. The
kernel with adjust_numa_imbalance disabled is a clear winner here.

We would be very interested in what others think about disabling
adjust_numa_imbalance function. The patch is bellow. It would be great
to collect performance results for different scenarios to make sure
the results are objective.

Thanks a lot!
Jirka

[1] Patch to test kernel with adjust_numa_imbalance disabled:
===============================================
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..8c43d29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8907,14 +8907,6 @@ static inline long adjust_numa_imbalance(int
imbalance, int src_nr_running)
{
       unsigned int imbalance_min;

-       /*
-        * Allow a small imbalance based on a simple pair of communicating
-        * tasks that remain local when the source domain is almost idle.
-        */
-       imbalance_min = 2;
-       if (src_nr_running <= imbalance_min)
-               return 0;
-
       return imbalance;
}
===============================================





On Fri, May 8, 2020 at 5:47 AM Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Thu, 7 May 2020 13:49:34 Phil Auld wrote:
> >
> > On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > we are not targeting just OMP applications. We see the performance
> > > degradation also for other workloads, like SPECjbb2005 and
> > > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > > threads (the system has 64 CPUs in total). We observe this degradation
> > > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > > binaries in parallel, there is no change in performance between 5.6
> > > and 5.7.
> > >
> > > That's why we are asking for the kernel tunable, which we would add to
> > > the tuned profile. We don't expect users to change this frequently but
> > > rather to set the performance profile once based on the purpose of the
> > > server.
> > >
> > > If you could prepare a patch for us, we would be more than happy to
> > > test it extensively. Based on the results, we can then evaluate if
> > > it's the way to go. Thoughts?
> > >
> >
> > I'm happy to spin up a patch once I'm sure what exactly the tuning would
> > effect. At an initial glance I'm thinking it would be the imbalance_min
> > which is currently hardcoded to 2. But there may be something else...
>
> hrm... try to restore the old behavior by skipping task migrate in favor
> of the local node if there is no imbalance.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1928,18 +1928,16 @@ static void task_numa_find_cpu(struct ta
>                 src_running = env->src_stats.nr_running - 1;
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
> -               imbalance = adjust_numa_imbalance(imbalance, src_running);
> +               imbalance = adjust_numa_imbalance(imbalance, src_running +1);
>
> -               /* Use idle CPU if there is no imbalance */
> +               /* No task migrate without imbalance */
>                 if (!imbalance) {
> -                       maymove = true;
> -                       if (env->dst_stats.idle_cpu >= 0) {
> -                               env->dst_cpu = env->dst_stats.idle_cpu;
> -                               task_numa_assign(env, NULL, 0);
> -                               return;
> -                       }
> +                       env->best_cpu = -1;
> +                       return;
>                 }
> -       } else {
> +       }
> +
> +       do {
>                 long src_load, dst_load, load;
>                 /*
>                  * If the improvement from just moving env->p direction is better
> @@ -1949,7 +1947,7 @@ static void task_numa_find_cpu(struct ta
>                 dst_load = env->dst_stats.load + load;
>                 src_load = env->src_stats.load - load;
>                 maymove = !load_too_imbalanced(src_load, dst_load, env);
> -       }
> +       } while (0);
>
>         for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
>                 /* Skip this CPU if the source task cannot migrate */
>
>


--
-Jirka


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
@ 2020-05-20 13:58                                       ` Jirka Hladky
  2020-05-20 16:01                                         ` Jirka Hladky
                                                           ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-20 13:58 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Phil Auld, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, aokuliar, kkolakow

Hi Hillf, Mel and all,

thanks for the patch! It has produced really GOOD results.

1) It has fixed performance problems with 5.7 vanilla kernel for
single-tenant workload and low system load scenarios, without
performance degradation for the multi-tenant tasks. It's producing the
same results as the previous proof-of-concept patch where
adjust_numa_imbalance function was modified to be a no-op (returning
the same value of imbalance as it gets on the input).

2) We have also added Mel's netperf-cstate-small-cross-socket test to
our test battery:
https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket

Mel told me that he had seen significant performance improvements with
5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.

Out of 6 different patches we have tested, your patch has performed
the best for this scenario. Compared to vanilla, we see minimal
performance degradation of 2.5% for the udp stream throughput and 0.6%
for the tcp throughput. The testing was done on a dual-socket system
with Gold 6132 CPU.

@Mel - could you please test Hillf's patch with your full testing
suite? So far, it looks very promising, but I would like to check the
patch thoroughly to make sure it does not hurt performance in other
areas.

Thanks a lot!
Jirka












On Tue, May 19, 2020 at 6:32 AM Hillf Danton <hdanton@sina.com> wrote:
>
>
> Hi Jirka
>
> On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> >
> > We have compared it against kernel with adjust_numa_imbalance disabled
> > [1], and both kernels perform at the same level for the single-tenant
> > jobs, but the proposed patch is bad for the multitenancy mode. The
> > kernel with adjust_numa_imbalance disabled is a clear winner here.
>
> Double thanks to you for the tests!
>
> > We would be very interested in what others think about disabling
> > adjust_numa_imbalance function. The patch is bellow. It would be great
>
> A minute...
>
> > to collect performance results for different scenarios to make sure
> > the results are objective.
>
> I don't have another test case but a diff trying to confine the tool
> in question back to the hard-coded 2's field.
>
> It's used in the first hunk below to detect imbalance before migrating
> a task, and a small churn of code is added at another call site when
> balancing idle CPUs.
>
> Thanks
> Hillf
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
>          * imbalance that would be overruled by the load balancer.
>          */
>         if (env->dst_stats.node_type == node_has_spare) {
> -               unsigned int imbalance;
> -               int src_running, dst_running;
> +               unsigned int imbalance = 2;
>
> -               /*
> -                * Would movement cause an imbalance? Note that if src has
> -                * more running tasks that the imbalance is ignored as the
> -                * move improves the imbalance from the perspective of the
> -                * CPU load balancer.
> -                * */
> -               src_running = env->src_stats.nr_running - 1;
> -               dst_running = env->dst_stats.nr_running + 1;
> -               imbalance = max(0, dst_running - src_running);
> -               imbalance = adjust_numa_imbalance(imbalance, src_running);
> +               //No imbalance computed without spare capacity
> +               if (env->dst_stats.node_type != env->src_stats.node_type)
> +                       goto check_imb;
> +
> +               imbalance = adjust_numa_imbalance(imbalance,
> +                                               env->src_stats.nr_running);
> +
> +               //Do nothing without imbalance
> +               if (!imbalance) {
> +                       imbalance = 2;
> +                       goto check_imb;
> +               }
> +
> +               //Migrate task if it's likely to grow balance
> +               if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> +                       imbalance = 0;
>
> +check_imb:
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
>                         maymove = true;
> @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
>                         env->migration_type = migrate_task;
>                         env->imbalance = max_t(long, 0, (local->idle_cpus -
>                                                  busiest->idle_cpus) >> 1);
> -               }
>
> -               /* Consider allowing a small imbalance between NUMA groups */
> -               if (env->sd->flags & SD_NUMA)
> -                       env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                                               busiest->sum_nr_running);
> +                       /* Consider allowing a small imbalance between NUMA groups */
> +                       if (env->sd->flags & SD_NUMA &&
> +                           local->group_type == busiest->group_type)
> +                               env->imbalance = adjust_numa_imbalance(env->imbalance,
> +                                                       busiest->sum_nr_running);
> +               }
>
>                 return;
>         }
> --
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-20 13:58                                       ` Jirka Hladky
@ 2020-05-20 16:01                                         ` Jirka Hladky
  2020-05-21 11:06                                         ` Mel Gorman
       [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
  2 siblings, 0 replies; 86+ messages in thread
From: Jirka Hladky @ 2020-05-20 16:01 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Phil Auld, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, aokuliar, kkolakow

I have an update on netperf-cstate-small-cross-socket results.

Reported performance degradation of 2.5% for the UDP stream throughput
and 0.6% for the TCP throughput is for message size of 16kB. For
smaller message sizes, the performance drop is higher - up to 5% for
UDP throughput for a message size of 64B. See the numbers below [1]

We still think that it's acceptable given the gains in other
situations (this is again compared to 5.7 vanilla) :

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop upto 50% for low load scenarios
(SPECjvm2008 and NAS)

[1]
Hillf's patch compared to 5.7 (rc4) vanilla:

TCP throughput
Message size (B)
64          -2.6%
128        -2.3%
256        -2.6%
1024      -2.7%
2048      -2.2%
3312      -2.4%
4096      -1.1%
8192      -0.4%
16384    -0.6%

UDP throughput
64          -5.0%
128        -3.0%
256        -3.0%
1024      -3.1%
2048      -3.3%
3312      -3.5%
4096      -4.0%
8192      -3.3%
16384    -2.6%


On Wed, May 20, 2020 at 3:58 PM Jirka Hladky <jhladky@redhat.com> wrote:
>
> Hi Hillf, Mel and all,
>
> thanks for the patch! It has produced really GOOD results.
>
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
>
> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
>
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput. The testing was done on a dual-socket system
> with Gold 6132 CPU.
>
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
>
> Thanks a lot!
> Jirka
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, May 19, 2020 at 6:32 AM Hillf Danton <hdanton@sina.com> wrote:
> >
> >
> > Hi Jirka
> >
> > On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> > >
> > > We have compared it against kernel with adjust_numa_imbalance disabled
> > > [1], and both kernels perform at the same level for the single-tenant
> > > jobs, but the proposed patch is bad for the multitenancy mode. The
> > > kernel with adjust_numa_imbalance disabled is a clear winner here.
> >
> > Double thanks to you for the tests!
> >
> > > We would be very interested in what others think about disabling
> > > adjust_numa_imbalance function. The patch is bellow. It would be great
> >
> > A minute...
> >
> > > to collect performance results for different scenarios to make sure
> > > the results are objective.
> >
> > I don't have another test case but a diff trying to confine the tool
> > in question back to the hard-coded 2's field.
> >
> > It's used in the first hunk below to detect imbalance before migrating
> > a task, and a small churn of code is added at another call site when
> > balancing idle CPUs.
> >
> > Thanks
> > Hillf
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
> >          * imbalance that would be overruled by the load balancer.
> >          */
> >         if (env->dst_stats.node_type == node_has_spare) {
> > -               unsigned int imbalance;
> > -               int src_running, dst_running;
> > +               unsigned int imbalance = 2;
> >
> > -               /*
> > -                * Would movement cause an imbalance? Note that if src has
> > -                * more running tasks that the imbalance is ignored as the
> > -                * move improves the imbalance from the perspective of the
> > -                * CPU load balancer.
> > -                * */
> > -               src_running = env->src_stats.nr_running - 1;
> > -               dst_running = env->dst_stats.nr_running + 1;
> > -               imbalance = max(0, dst_running - src_running);
> > -               imbalance = adjust_numa_imbalance(imbalance, src_running);
> > +               //No imbalance computed without spare capacity
> > +               if (env->dst_stats.node_type != env->src_stats.node_type)
> > +                       goto check_imb;
> > +
> > +               imbalance = adjust_numa_imbalance(imbalance,
> > +                                               env->src_stats.nr_running);
> > +
> > +               //Do nothing without imbalance
> > +               if (!imbalance) {
> > +                       imbalance = 2;
> > +                       goto check_imb;
> > +               }
> > +
> > +               //Migrate task if it's likely to grow balance
> > +               if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> > +                       imbalance = 0;
> >
> > +check_imb:
> >                 /* Use idle CPU if there is no imbalance */
> >                 if (!imbalance) {
> >                         maymove = true;
> > @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
> >                         env->migration_type = migrate_task;
> >                         env->imbalance = max_t(long, 0, (local->idle_cpus -
> >                                                  busiest->idle_cpus) >> 1);
> > -               }
> >
> > -               /* Consider allowing a small imbalance between NUMA groups */
> > -               if (env->sd->flags & SD_NUMA)
> > -                       env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                                               busiest->sum_nr_running);
> > +                       /* Consider allowing a small imbalance between NUMA groups */
> > +                       if (env->sd->flags & SD_NUMA &&
> > +                           local->group_type == busiest->group_type)
> > +                               env->imbalance = adjust_numa_imbalance(env->imbalance,
> > +                                                       busiest->sum_nr_running);
> > +               }
> >
> >                 return;
> >         }
> > --
> >
>
>
> --
> -Jirka



-- 
-Jirka


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-15 14:24                                             ` Peter Zijlstra
@ 2020-05-21 10:38                                               ` Mel Gorman
  2020-05-21 11:41                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-21 10:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel

On Fri, May 15, 2020 at 04:24:44PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> > On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > 
> > > However, the wakeups are so rapid that the wakeup
> > > happens while the server is descheduling. That forces the waker to spin
> > > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > > add the task to the rq->wake_list even if that potentially requires an IPI.
> > 
> > Right, I think Rik ran into that as well at some point. He wanted to
> > make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> > cpu (which is currently in the middle of schedule()) should be an easier
> > proposition.
> > 
> > Maybe something like this untested thing... could explode most mighty,
> > didn't thing too hard.
> > 
> 
> Mel pointed out that that patch got mutilated somewhere (my own .Sent
> copy was fine), let me try again.
> 

Sorry for the slow response. My primary work machine suffered a
catastrophic failure on Sunday night which is a fantastic way to start
a Monday morning so I'm playing catchup.

IIUC, this patch front-loads as much work as possible before checking if
the task is on_rq and then the waker/wakee shares a cache, queue task on
the wake_list and otherwise do a direct wakeup.

The advantage is that spinning is avoided on p->on_rq when p does not
share a cache. The disadvantage is that it may result in tasks being
stacked but this should only happen when the domain is overloaded and
select_task_eq() is unlikely to find an idle CPU. The load balancer would
soon correct the situation anyway.

In terms of netperf for my testing, the benefit is marginal because the
wakeups are primarily between tasks that share cache. It does trigger as
perf indicates that some time is spent in ttwu_queue_remote with this
patch, it's just that the overall time spent spinning on p->on_rq is
very similar. I'm still waiting on other workloads to complete to see
what the impact is.

However, intuitively at least, it makes sense to avoid spinning on p->on_rq
when it's unnecessary and the other changes appear to be safe.  Even if
wake_list should be used in some cases for local wakeups, it would make
sense to put that on top of this patch. Do you want to slap a changelog
around this and update the comments or do you want me to do it? I should
have more results in a few hours even if they are limited to one machine
but ideally Rik would test his workload too.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-20 13:58                                       ` Jirka Hladky
  2020-05-20 16:01                                         ` Jirka Hladky
@ 2020-05-21 11:06                                         ` Mel Gorman
       [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
  2 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-21 11:06 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Hillf Danton, Phil Auld, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, aokuliar, kkolakow

On Wed, May 20, 2020 at 03:58:01PM +0200, Jirka Hladky wrote:
> Hi Hillf, Mel and all,
> 
> thanks for the patch! It has produced really GOOD results.
> 
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
> 

I worry that what it's doing is sort of reverts the patch but in a
relatively obscure way. We already know that a side-effect of having a
pair of tasks sharing cache is that the wakeup paths can be more expensive
and the difference in the wakeup path exceeds the advantage of using
local memory. However, I don't think the right approach long term is to
keep tasks artificially apart so that a different wakeup path is used as
a side-effect.

The patch also needs a changelog and better comments to follow exactly
what the rationale is. Take this part


+               //No imbalance computed without spare capacity
+               if (env->dst_stats.node_type != env->src_stats.node_type)
+                       goto check_imb;

I'm ignoring the coding style of c++ comments but minimally that should
be fixed. More importantly, node_type can be one of node_overloaded,
node_has_spare or node_fully busy and this is checking if there is a
mismatch. However, it's not taking into account whether the dst_node
is more or less loaded than the src and does not appear to be actually
checking spare capacity like the comment suggests.

Then there is this part

+               imbalance = adjust_numa_imbalance(imbalance,
+                                               env->src_stats.nr_running);
+
+               //Do nothing without imbalance
+               if (!imbalance) {
+                       imbalance = 2;
+                       goto check_imb;
+               }

So... if there is no imbalance, assume there is an imbalance of 2, jump to
a branch that will always be false and fall through to code that ignores
the value of imbalance ...... it's hard to see exactly why that code flow
is ideal.

> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
> 
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
> 
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput.

Which implies that we are once again using remote memory for netperf and
possibly getting some interference from NUMA balancing hinting faults.

> The testing was done on a dual-socket system
> with Gold 6132 CPU.
> 
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
> 

I can't at the moment due to a backlog on my test grid which isn't helped
that by the fact I lost two days development time due to a thrashed work
machine. That aside, I'm finding it hard to see exactly why the patch
is suitable. What I've seen so far is that there are two outstanding
problems after the rewritten load balancer and the reconcilation between
load balancer and NUMA balancer.

The first is that 57abff067a08 ("sched/fair: Rework find_idlest_group()")
has shown up some problems with LKP reporting it here
https://lore.kernel.org/lkml/20200514141526.GA30976@xsang-OptiPlex-9020/
but I've also seen numerous workloads internally bisect to the same
commit. This patch is meant to ensure that finding the busiest group uses
similar logic to finding the idlest group but something is hiding in there.

The second is that waking two tasks sharing tasks can be more expensive
than waking two remote tasks but it's preferable to fix the wakeup logic
than keep related tasks on separate caches just because it happens to
generate good numbers in some cases.

I feel that it makes sense to try and get both of those issues resolved
before making adjust_numa_imbalance a tunable or reverting it but I
really think it makes sense for communicating tasks to be sharing cache
if possible unless a node is overloaded or limited by memory bandwidth.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-21 10:38                                               ` Mel Gorman
@ 2020-05-21 11:41                                                 ` Peter Zijlstra
  2020-05-22 13:28                                                   ` Mel Gorman
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-21 11:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel

On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> IIUC, this patch front-loads as much work as possible before checking if
> the task is on_rq and then the waker/wakee shares a cache, queue task on
> the wake_list and otherwise do a direct wakeup.
> 
> The advantage is that spinning is avoided on p->on_rq when p does not
> share a cache. The disadvantage is that it may result in tasks being
> stacked but this should only happen when the domain is overloaded and
> select_task_eq() is unlikely to find an idle CPU. The load balancer would
> soon correct the situation anyway.
> 
> In terms of netperf for my testing, the benefit is marginal because the
> wakeups are primarily between tasks that share cache. It does trigger as
> perf indicates that some time is spent in ttwu_queue_remote with this
> patch, it's just that the overall time spent spinning on p->on_rq is
> very similar. I'm still waiting on other workloads to complete to see
> what the impact is.

So it might make sense to play with the exact conditions under which
we'll attempt this remote queue, if we see a large 'local' p->on_cpu
spin time, it might make sense to attempt the queue even in this case.

We could for example change it to:

	if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | WF_ON_CPU))
		goto unlock;

and then use that in ttwu_queue_remote() to differentiate between these
two cases.

Anyway, if it's a wash (atomic op vs spinning) then it's probably not
worth it.

Another optimization might be to forgo the IPI entirely in this case and
instead stick a sched_ttwu_pending() at the end of __schedule() or
something like that.

> However, intuitively at least, it makes sense to avoid spinning on p->on_rq
> when it's unnecessary and the other changes appear to be safe.  Even if
> wake_list should be used in some cases for local wakeups, it would make
> sense to put that on top of this patch. Do you want to slap a changelog
> around this and update the comments or do you want me to do it? I should
> have more results in a few hours even if they are limited to one machine
> but ideally Rik would test his workload too.

I've written you a Changelog, but please carry it in your set to
evaluate if it's actually worth it.

---
Subject: sched: Optimize ttwu() spinning on p->on_cpu
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 15 May 2020 16:24:44 +0200

Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that own's the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int
 {
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+		__ttwu_queue_remote(p, cpu, wake_flags);
+		return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struc
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		ttwu_queue_remote(p, cpu, wake_flags);
+	if (ttwu_queue_remote(p, cpu, wake_flags))
 		return;
-	}
 #endif
 
 	rq_lock(rq, &rf);
@@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, un
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto unlock;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #ifdef CONFIG_SMP
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, un
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+		goto unlock;
 
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
@@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, un
 		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
-
-#else /* CONFIG_SMP */
-
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
@ 2020-05-21 16:04                                           ` Mel Gorman
       [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
  1 sibling, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-21 16:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jirka Hladky, Phil Auld, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, aokuliar, kkolakow

On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > I'm ignoring the coding style of c++ comments but minimally that should
> > be fixed. More importantly, node_type can be one of node_overloaded,
> > node_has_spare or node_fully busy and this is checking if there is a
> > mismatch. However, it's not taking into account whether the dst_node
> > is more or less loaded than the src and does not appear to be actually
> > checking spare capacity like the comment suggests.
> > 
>
> Type other than node_has_spare is not considered because node is not
> possible to be so idle that two communicating tasks would better
> "stay local."
> 

You hardcode an imbalance of 2 at the start without computing any
imbalance. Then if the source is fully_busy or overloaded while the
dst is idle, a task can move but that is based on an imaginary hard-coded
imbalance of 2. Finally, it appears that that the load balancer and
NUMA balancer are again using separate logic again when one part of the
objective of the series is that the load balancer and NUMA balancer would
not override each other. As the imbalance is never computed, the patch
can create one which the load balancer then overrides. What am I missing?

That said, it does make a certain amount of sense that if the dst has
spare capacity and the src is fully busy or overloaded then allow the
migration regardless of imbalance but that would be a different patch --
something like this but I didn't think too deeply or test it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..97c0e090e161 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1916,19 +1916,27 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	 * imbalance that would be overruled by the load balancer.
 	 */
 	if (env->dst_stats.node_type == node_has_spare) {
-		unsigned int imbalance;
+		unsigned int imbalance = 0;
 		int src_running, dst_running;
 
 		/*
-		 * Would movement cause an imbalance? Note that if src has
-		 * more running tasks that the imbalance is ignored as the
-		 * move improves the imbalance from the perspective of the
-		 * CPU load balancer.
-		 * */
-		src_running = env->src_stats.nr_running - 1;
-		dst_running = env->dst_stats.nr_running + 1;
-		imbalance = max(0, dst_running - src_running);
-		imbalance = adjust_numa_imbalance(imbalance, src_running);
+		 * Check the imbalance if both src and dst have spare
+		 * capacity. If src is fully_busy or overloaded then
+		 * allow the task to move as it's both improving locality
+		 * and reducing an imbalance.
+		 */
+		if (env->src_stats.node_type == node_has_spare) {
+			/*
+			 * Would movement cause an imbalance? Note that if src
+			 * has more running tasks that the imbalance is
+			 * ignored as the move improves the imbalance from the
+			 * perspective of the CPU load balancer.
+			 */
+			src_running = env->src_stats.nr_running - 1;
+			dst_running = env->dst_stats.nr_running + 1;
+			imbalance = max(0, dst_running - src_running);
+			imbalance = adjust_numa_imbalance(imbalance, src_running);
+		}
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {

> > Then there is this part
> > 
> > +               imbalance = adjust_numa_imbalance(imbalance,
> > +                                               env->src_stats.nr_running);
> > +
> > +               //Do nothing without imbalance
> > +               if (!imbalance) {
> > +                       imbalance = 2;
> > +                       goto check_imb;
> > +               }
> > 
> > So... if there is no imbalance, assume there is an imbalance of 2, jump to
> > a branch that will always be false and fall through to code that ignores
> > the value of imbalance ...... it's hard to see exactly why that code flow
> > is ideal.
> > 
> With spare capacity be certain, then lets see how idle the node is.

But you no longer check how idle it is or what it's relative idleness of
the destination is relative to the source. adjust_numa_imbalance does
not calculate an imbalance, it only decides whether it should be
ignored.

> And I
> can not do that without the tool you created, He bless you. Although not
> sure they're two tasks talking to each other. This is another topic in the
> coming days.
> 

There is insufficient context in this path to determine if two tasks are
communicating. In some cases it may be inferred from last_wakee but that
only works for two tasks, it doesn't work for 1:n waker:wakees such as
might happen with a producer/consumer pattern. I guess you could record
the time that tasks migrated cross-node due to a wakeup and avoid migrating
those tasks for a period of time by either NUMA or load balancer but that
could cause transient overloads of a domain if there are a lot of wakeups
(e.g. hackbench). It's a change that would need to be done on both the
NUMA and load balancer to force the migration even if there are wakeup
relationships if the domain is fully busy or overloaded.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
       [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
@ 2020-05-22 11:05                                             ` Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-05-22 11:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Jirka Hladky, Phil Auld, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Valentin Schneider, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, aokuliar, kkolakow

On Fri, May 22, 2020 at 09:09:50AM +0800, Hillf Danton wrote:
> 
> On Thu, 21 May 2020 17:04:08 +0100 Mel Gorman wrote:
> > On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > > > I'm ignoring the coding style of c++ comments but minimally that should
> > > > be fixed. More importantly, node_type can be one of node_overloaded,
> > > > node_has_spare or node_fully busy and this is checking if there is a
> > > > mismatch. However, it's not taking into account whether the dst_node
> > > > is more or less loaded than the src and does not appear to be actually
> > > > checking spare capacity like the comment suggests.
> > > > 
> > >
> > > Type other than node_has_spare is not considered because node is not
> > > possible to be so idle that two communicating tasks would better
> > > "stay local."
> > > 
> > 
> > You hardcode an imbalance of 2 at the start without computing any
> > imbalance.
> 
> Same result comes up if it's a bool.
> 

Then the magic number is simply confusing. The patch needs to be
a lot clearer about what the intent is if my patch that adds a "if
(env->src_stats.node_type == node_has_spare)" check is not what you were
aiming for.

> > Then if the source is fully_busy or overloaded while the
> > dst is idle, a task can move but that is based on an imaginary hard-coded
> > imbalance of 2.
> 
> This is the potpit I walked around by checking spare capacity first. As
> for overloaded, I see it as a signal that a glitch is not idle somewhere
> else, and I prefer to push it in to ICU before it's too late.
> 

The domain could be overloaded simply due to CPU bindings. I cannot
parse the ICU comment (Intensive Care Unit?!?) :(

> > Finally, it appears that that the load balancer and
> > NUMA balancer are again using separate logic again when one part of the
> > objective of the series is that the load balancer and NUMA balancer would
> > not override each other.
> 
> This explains 80% of why it is a choppy road ahead.
> 

And I don't think we should go back to the load balancer and NUMA balancer
taking different actions. It ends up doing useless CPU migrations and
can lead to higher NUMA scanning activity. It's partially why I changed
task_numa_compare to prefer swapping with tasks that move to their
preferred node when using an idle CPU would cause an imbalance.

> > As the imbalance is never computed, the patch
> > can create one which the load balancer then overrides. What am I missing?
> 
> LB would give a green light if things move in the direction in favor of
> cutting imb.
> 

Load balancing primarily cares about balancing the number of idle CPUs
between domains when there is spare capacity. While it tries to avoid
balancing by moving tasks from their preferred node, it will do so if
there are no other candidates.

> > > > <SNIP>
> > > >
> > > > Then there is this part
> > > > 
> > > > +               imbalance = adjust_numa_imbalance(imbalance,
> > > > +                                               env->src_stats.nr_running);
> > > > +
> > > > +               //Do nothing without imbalance
> > > > +               if (!imbalance) {
> > > > +                       imbalance = 2;
> > > > +                       goto check_imb;
> > > > +               }
> > > > 
> > > > So... if there is no imbalance, assume there is an imbalance of 2, jump to
> > > > a branch that will always be false and fall through to code that ignores
> > > > the value of imbalance ...... it's hard to see exactly why that code flow
> > > > is ideal.
> > > > 
> > > With spare capacity be certain, then lets see how idle the node is.
> > 
> > But you no longer check how idle it is or what it's relative idleness of
> > the destination is relative to the source. adjust_numa_imbalance does
> > not calculate an imbalance, it only decides whether it should be
> > ignored.
> 
> Then the idle CPU is no longer so tempting.
> 

True. While it's prefectly possible to ignore imbalance and use an idle
CPU if it exists, the load balancer will simply override it later and we
go back to the NUMA balancer and load balancer fighting each other with
the NUMA balancer retrying migrations based on p->numa_migrate_retry.
Whatever logic is used to allow imbalances (be they based on communicating
tasks or preferred locality), it needs to be the same in both balancers.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-21 11:41                                                 ` Peter Zijlstra
@ 2020-05-22 13:28                                                   ` Mel Gorman
  2020-05-22 14:38                                                     ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Mel Gorman @ 2020-05-22 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel

On Thu, May 21, 2020 at 01:41:32PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> > IIUC, this patch front-loads as much work as possible before checking if
> > the task is on_rq and then the waker/wakee shares a cache, queue task on
> > the wake_list and otherwise do a direct wakeup.
> > 
> > The advantage is that spinning is avoided on p->on_rq when p does not
> > share a cache. The disadvantage is that it may result in tasks being
> > stacked but this should only happen when the domain is overloaded and
> > select_task_eq() is unlikely to find an idle CPU. The load balancer would
> > soon correct the situation anyway.
> > 
> > In terms of netperf for my testing, the benefit is marginal because the
> > wakeups are primarily between tasks that share cache. It does trigger as
> > perf indicates that some time is spent in ttwu_queue_remote with this
> > patch, it's just that the overall time spent spinning on p->on_rq is
> > very similar. I'm still waiting on other workloads to complete to see
> > what the impact is.
> 
> So it might make sense to play with the exact conditions under which
> we'll attempt this remote queue, if we see a large 'local' p->on_cpu
> spin time, it might make sense to attempt the queue even in this case.
> 
> We could for example change it to:
> 
> 	if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | WF_ON_CPU))
> 		goto unlock;
> 
> and then use that in ttwu_queue_remote() to differentiate between these
> two cases.
> 

>  #endif /* CONFIG_SMP */
>  
>  	ttwu_queue(p, cpu, wake_flags);

Is something like this on top of your patch what you had in mind?

---8<---

---
 kernel/sched/core.c  | 35 ++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  3 ++-
 2 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 987b8ecf2ee9..435ecf5820ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,13 +2330,19 @@ void scheduler_ipi(void)
 	irq_exit();
 }
 
-static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+/*
+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
+ * necessary. The wakee CPU on receipt of the IPI will queue the task
+ * via sched_ttwu_wakeup() for activation instead of the waking task
+ * activating and queueing the wakee.
+ */
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_send_reschedule(cpu);
 		else
@@ -2373,12 +2379,23 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
-static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		__ttwu_queue_remote(p, cpu, wake_flags);
-		return true;
+	if (sched_feat(TTWU_QUEUE)) {
+		/*
+		 * If CPU does not share cache then queue the task on the remote
+		 * rqs wakelist to avoid accessing remote data. Alternatively,
+		 * if the task is descheduling and the only running task on the
+		 * CPU then use the wakelist to offload the task activation to
+		 * the CPU that will soon be idle so the waker can continue.
+		 * nr_running is checked to avoid unnecessary task stacking.
+		 */
+		if (!cpus_share_cache(smp_processor_id(), cpu) ||
+		    ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)) {
+			sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+			__ttwu_queue_wakelist(p, cpu, wake_flags);
+			return true;
+		}
 	}
 
 	return false;
@@ -2391,7 +2408,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (ttwu_queue_remote(p, cpu, wake_flags))
+	if (ttwu_queue_wakelist(p, cpu, wake_flags))
 		return;
 #endif
 
@@ -2611,7 +2628,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * let the waker make forward progress. This is safe because IRQs are
 	 * disabled and the IPI will deliver after on_cpu is cleared.
 	 */
-	if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+	if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
 		goto unlock;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..06297d1142a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_ON_RQ		0x08		/* Wakee is on_rq */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
  2020-05-22 13:28                                                   ` Mel Gorman
@ 2020-05-22 14:38                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Zijlstra @ 2020-05-22 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jirka Hladky, Phil Auld, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Valentin Schneider, Hillf Danton, LKML, Douglas Shakshober,
	Waiman Long, Joe Mario, Bill Gray, riel

On Fri, May 22, 2020 at 02:28:54PM +0100, Mel Gorman wrote:

> Is something like this on top of your patch what you had in mind?

All under the assumption that is makes it go faster of course ;-)

> ---8<---

static inline bool ttwu_queue_cond()
{
	/*
	 * If the CPU does not share cache, then queue the task on the
	 * remote rqs wakelist to avoid accessing remote data.
	 */
	if (!cpus_share_cache(smp_processor_id(), cpu))
		return true;

	/*
	 * If the task is descheduling and the only running task on the
	 * CPU, ....
	 */
	if ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)
		return true;

	return false;
}

> -static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>  {
> -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> -		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> -		__ttwu_queue_remote(p, cpu, wake_flags);
> -		return true;
> +	if (sched_feat(TTWU_QUEUE)) {
> +		/*
> +		 * If CPU does not share cache then queue the task on the remote
> +		 * rqs wakelist to avoid accessing remote data. Alternatively,
> +		 * if the task is descheduling and the only running task on the
> +		 * CPU then use the wakelist to offload the task activation to
> +		 * the CPU that will soon be idle so the waker can continue.
> +		 * nr_running is checked to avoid unnecessary task stacking.
> +		 */
> +		if (!cpus_share_cache(smp_processor_id(), cpu) ||
> +		    ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)) {
> +			sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> +			__ttwu_queue_wakelist(p, cpu, wake_flags);
> +			return true;
> +		}

	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
		__ttwu_queue_remote(p, cpu, wake_flags);
		return true;

>  	}
>  
>  	return false;


might be easier to read...

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found
  2020-02-19 14:07 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v5 Mel Gorman
@ 2020-02-19 14:07 ` Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-19 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.

o A task can move to an idle CPU and an idle CPU is found
o A swap candidate is found that would move to its preferred domain

This patch terminates the search when either condition is met.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d790fac0072c..3060ba94e813 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	long moveimp = imp;
 	long load;
+	bool stopsearch = false;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.
 	 */
-	if (cur == env->p)
+	if (cur == env->p) {
+		stopsearch = true;
 		goto unlock;
+	}
 
 	if (!cur) {
 		if (maymove && moveimp >= env->best_imp)
@@ -1860,8 +1863,27 @@ static void task_numa_compare(struct task_numa_env *env,
 	}
 
 	task_numa_assign(env, cur, imp);
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
+		stopsearch = true;
+
+	/*
+	 * If a swap candidate must be identified and the current best task
+	 * moves its preferred node then stop the search.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		stopsearch = true;
+	}
 unlock:
 	rcu_read_unlock();
+
+	return stopsearch;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1916,7 +1938,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found
  2020-02-19 13:54 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v4 Mel Gorman
@ 2020-02-19 13:54 ` Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-19 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.

o A task can move to an idle CPU and an idle CPU is found
o A swap candidate is found that would move to its preferred domain

This patch terminates the search when either condition is met.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 935baf529f10..a1d5760f7985 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	long moveimp = imp;
 	long load;
+	bool stopsearch = false;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.
 	 */
-	if (cur == env->p)
+	if (cur == env->p) {
+		stopsearch = true;
 		goto unlock;
+	}
 
 	if (!cur) {
 		if (maymove && moveimp >= env->best_imp)
@@ -1860,8 +1863,27 @@ static void task_numa_compare(struct task_numa_env *env,
 	}
 
 	task_numa_assign(env, cur, imp);
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
+		stopsearch = true;
+
+	/*
+	 * If a swap candidate must be identified and the current best task
+	 * moves its preferred node then stop the search.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		stopsearch = true;
+	}
 unlock:
 	rcu_read_unlock();
+
+	return stopsearch;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1911,7 +1933,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found
  2020-02-17 10:43 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v3 Mel Gorman
@ 2020-02-17 10:44 ` Mel Gorman
  0 siblings, 0 replies; 86+ messages in thread
From: Mel Gorman @ 2020-02-17 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld,
	Hillf Danton, LKML, Mel Gorman

When domains are imbalanced or overloaded a search of all CPUs on the
target domain is searched and compared with task_numa_compare. In some
circumstances, a candidate is found that is an obvious win.

o A task can move to an idle CPU and an idle CPU is found
o A swap candidate is found that would move to its preferred domain

This patch terminates the search when either condition is met.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 935baf529f10..a1d5760f7985 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	long moveimp = imp;
 	long load;
+	bool stopsearch = false;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.
 	 */
-	if (cur == env->p)
+	if (cur == env->p) {
+		stopsearch = true;
 		goto unlock;
+	}
 
 	if (!cur) {
 		if (maymove && moveimp >= env->best_imp)
@@ -1860,8 +1863,27 @@ static void task_numa_compare(struct task_numa_env *env,
 	}
 
 	task_numa_assign(env, cur, imp);
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
+		stopsearch = true;
+
+	/*
+	 * If a swap candidate must be identified and the current best task
+	 * moves its preferred node then stop the search.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		stopsearch = true;
+	}
 unlock:
 	rcu_read_unlock();
+
+	return stopsearch;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1911,7 +1933,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2020-05-22 14:39 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01     ` Valentin Schneider
2020-02-24 16:34       ` Mel Gorman
2020-02-25  8:23       ` Vincent Guittot
2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59   ` Mel Gorman
2020-02-25 13:28     ` Vincent Guittot
2020-02-25 14:24       ` Mel Gorman
2020-02-25 14:53         ` Vincent Guittot
2020-02-27  9:09         ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36   ` Mel Gorman
2020-03-12  9:54     ` Mel Gorman
2020-03-12 12:17       ` Jirka Hladky
     [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56         ` Mel Gorman
2020-03-12 17:06           ` Jirka Hladky
     [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47             ` Mel Gorman
2020-03-12 22:24               ` Jirka Hladky
2020-03-20 15:08                 ` Jirka Hladky
     [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22                   ` Mel Gorman
2020-03-20 15:33                     ` Jirka Hladky
     [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38                       ` Mel Gorman
2020-03-20 17:21                         ` Jirka Hladky
2020-05-07 15:24                         ` Jirka Hladky
2020-05-07 15:54                           ` Mel Gorman
2020-05-07 16:29                             ` Jirka Hladky
2020-05-07 17:49                               ` Phil Auld
     [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52                                   ` Jirka Hladky
     [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58                                       ` Jirka Hladky
2020-05-20 16:01                                         ` Jirka Hladky
2020-05-21 11:06                                         ` Mel Gorman
     [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04                                           ` Mel Gorman
     [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05                                             ` Mel Gorman
2020-05-08  9:22                               ` Mel Gorman
2020-05-08 11:05                                 ` Jirka Hladky
     [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57                                   ` Jirka Hladky
2020-05-13 15:30                                     ` Mel Gorman
2020-05-13 16:20                                       ` Jirka Hladky
2020-05-14  9:50                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08                                             ` Mel Gorman
2020-05-14 10:22                                               ` Jirka Hladky
2020-05-14 11:50                                                 ` Mel Gorman
2020-05-14 13:34                                                   ` Jirka Hladky
2020-05-14 15:31                                       ` Peter Zijlstra
2020-05-15  8:47                                         ` Mel Gorman
2020-05-15 11:17                                           ` Peter Zijlstra
2020-05-15 13:03                                             ` Mel Gorman
2020-05-15 13:12                                               ` Peter Zijlstra
2020-05-15 13:28                                                 ` Peter Zijlstra
2020-05-15 14:24                                             ` Peter Zijlstra
2020-05-21 10:38                                               ` Mel Gorman
2020-05-21 11:41                                                 ` Peter Zijlstra
2020-05-22 13:28                                                   ` Mel Gorman
2020-05-22 14:38                                                     ` Peter Zijlstra
2020-05-15 11:28                                           ` Peter Zijlstra
2020-05-15 12:22                                             ` Mel Gorman
2020-05-15 12:51                                               ` Peter Zijlstra
2020-05-15 14:43                                       ` Jirka Hladky
  -- strict thread matches above, loose matches on Subject: below --
2020-02-19 14:07 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v5 Mel Gorman
2020-02-19 14:07 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-19 13:54 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v4 Mel Gorman
2020-02-19 13:54 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-17 10:43 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v3 Mel Gorman
2020-02-17 10:44 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).