linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer
@ 2020-02-12  9:36 Mel Gorman
  2020-02-12  9:36 ` [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains Mel Gorman
                   ` (12 more replies)
  0 siblings, 13 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

The NUMA balancer makes placement decisions on tasks that partially
take the load balancer into account and vice versa but there are
inconsistencies. This can result in placement decisions that override
each other leading to unnecessary migrations -- both task placement and
page placement. This is a prototype series that attempts to reconcile the
decisions. It's a bit premature but it would also need to be reconciled
with Vincent's series "[PATCH 0/4] remove runnable_load_avg and improve
group_classify"

The first three patches are unrelated and are either pending in tip or
should be but they were part of the testing of this series so I have to
mention them.

The fourth and fifth patches are tracing only and was needed to get
sensible data out of ftrace with respect to task placement for NUMA
balancing. Patches 6-8 reduce overhead and reduce the changes of NUMA
balancing overriding itself. Patches 9-11 try and bring the CPU placement
decisions of NUMA balancing in line with the load balancer.

In terms of Vincent's patches, I have not checked but I expect conflicts
to be with patches 10 and 11.

Note that this is not necessarily a universal performance win although
performance results are generally ok (small gains/losses depending on
the machine and workload). However, task migrations, page migrations,
variability and overall overhead are generally reduced.

Tests are still running and take quite a long time so I do not have a
full picture. The main reference workload I used was specjbb running one
JVM per node which typically would be expected to split evenly. It's
an interesting workload because the number of "warehouses" does not
linearly related to the number of running tasks due to the creation of
GC threads and other interfering activity. The mmtests configuration used
is jvm-specjbb2005-multi with two runs -- one with ftrace enabling
relevant scheduler tracepoints.

The baseline is taken from late in the 5.6 merge window plus patches 1-4
to take into account patches that are already in flight and the tracing
patch I relied on for analysis.

The headline performance of the series looks like

			     baseline-v1          lboverload-v1
Hmean     tput-1     37842.47 (   0.00%)    42391.63 *  12.02%*
Hmean     tput-2     94225.00 (   0.00%)    91937.32 (  -2.43%)
Hmean     tput-3    141855.04 (   0.00%)   142100.59 (   0.17%)
Hmean     tput-4    186799.96 (   0.00%)   184338.10 (  -1.32%)
Hmean     tput-5    229918.54 (   0.00%)   230894.68 (   0.42%)
Hmean     tput-6    271006.38 (   0.00%)   271367.35 (   0.13%)
Hmean     tput-7    312279.37 (   0.00%)   314141.97 (   0.60%)
Hmean     tput-8    354916.09 (   0.00%)   357029.57 (   0.60%)
Hmean     tput-9    397299.92 (   0.00%)   399832.32 (   0.64%)
Hmean     tput-10   438169.79 (   0.00%)   442954.02 (   1.09%)
Hmean     tput-11   476864.31 (   0.00%)   484322.15 (   1.56%)
Hmean     tput-12   512327.04 (   0.00%)   519117.29 (   1.33%)
Hmean     tput-13   528983.50 (   0.00%)   530772.34 (   0.34%)
Hmean     tput-14   537757.24 (   0.00%)   538390.58 (   0.12%)
Hmean     tput-15   535328.60 (   0.00%)   539402.88 (   0.76%)
Hmean     tput-16   539356.59 (   0.00%)   545617.63 (   1.16%)
Hmean     tput-17   535370.94 (   0.00%)   547217.95 (   2.21%)
Hmean     tput-18   540510.94 (   0.00%)   548145.71 (   1.41%)
Hmean     tput-19   536737.76 (   0.00%)   545281.39 (   1.59%)
Hmean     tput-20   537509.85 (   0.00%)   543759.71 (   1.16%)
Hmean     tput-21   534632.44 (   0.00%)   544848.03 (   1.91%)
Hmean     tput-22   531538.29 (   0.00%)   540987.41 (   1.78%)
Hmean     tput-23   523364.37 (   0.00%)   536640.28 (   2.54%)
Hmean     tput-24   530613.55 (   0.00%)   531431.12 (   0.15%)
Stddev    tput-1      1569.78 (   0.00%)      674.58 (  57.03%)
Stddev    tput-2         8.49 (   0.00%)     1368.25 (-16025.00%)
Stddev    tput-3      4125.26 (   0.00%)     1120.06 (  72.85%)
Stddev    tput-4      4677.51 (   0.00%)      717.71 (  84.66%)
Stddev    tput-5      3387.75 (   0.00%)     1774.13 (  47.63%)
Stddev    tput-6      1400.07 (   0.00%)     1079.75 (  22.88%)
Stddev    tput-7      4374.16 (   0.00%)     2571.75 (  41.21%)
Stddev    tput-8      2370.22 (   0.00%)     2918.23 ( -23.12%)
Stddev    tput-9      3893.33 (   0.00%)     2708.93 (  30.42%)
Stddev    tput-10     6260.02 (   0.00%)     3935.05 (  37.14%)
Stddev    tput-11     3989.50 (   0.00%)     6443.16 ( -61.50%)
Stddev    tput-12      685.19 (   0.00%)    12999.45 (-1797.21%)
Stddev    tput-13     3251.98 (   0.00%)     9311.18 (-186.32%)
Stddev    tput-14     2793.78 (   0.00%)     6175.87 (-121.06%)
Stddev    tput-15     6777.62 (   0.00%)    25942.33 (-282.76%)
Stddev    tput-16    25057.04 (   0.00%)     4227.08 (  83.13%)
Stddev    tput-17    22336.80 (   0.00%)    16890.66 (  24.38%)
Stddev    tput-18     6662.36 (   0.00%)     3015.10 (  54.74%)
Stddev    tput-19    20395.79 (   0.00%)     1098.14 (  94.62%)
Stddev    tput-20    17140.27 (   0.00%)     9019.15 (  47.38%)
Stddev    tput-21     5176.73 (   0.00%)     4300.62 (  16.92%)
Stddev    tput-22    28279.32 (   0.00%)     6544.98 (  76.86%)
Stddev    tput-23    25368.87 (   0.00%)     3621.09 (  85.73%)
Stddev    tput-24     3082.28 (   0.00%)     2500.33 (  18.88%)

Generally, this is showing a small gain in performance but it's
borderline noise. However, in most cases, variability between
the JVM performance is much reduced except at the point where
a node is almost fully utilised.

The high-level NUMA stats from /proc/vmstat look like this

NUMA base-page range updates     1710927.00     2199691.00
NUMA PTE updates                  871759.00     1060491.00
NUMA PMD updates                    1639.00        2225.00
NUMA hint faults                  772179.00      967165.00
NUMA hint local faults %          647558.00      845357.00
NUMA hint local percent               83.86          87.41
NUMA pages migrated                64920.00       45254.00
AutoNUMA cost                       3874.10        4852.08

The percentage of local hits is higher (87.41% vs 84.86%). The
number of pages migrated is reduced by 30%. The downside is
that there are spikes when scanning is higher because in some
cases NUMA balancing will not move a task to a local node if
the CPU load balancer would immediately override it but it's
not straight-forward to fix this in a universal way and should
be a separate series.

A separate run gathered information from ftrace and analysed it
offline.

                                             5.5.0           5.5.0
                                         baseline    lboverload-v1
Migrate failed no CPU                      1934.00         4999.00
Migrate failed move to   idle                 0.00            0.00
Migrate failed swap task fail               981.00         2810.00
Task Migrated swapped                      6765.00        12609.00
Task Migrated swapped local NID               0.00            0.00
Task Migrated swapped within group          644.00         1105.00
Task Migrated idle CPU                    14776.00          750.00
Task Migrated idle CPU local NID              0.00            0.00
Task Migrate retry                         2521.00         7564.00
Task Migrate retry success                    0.00            0.00
Task Migrate retry failed                  2521.00         7564.00
Load Balance cross NUMA                 1222195.00      1223454.00

"Migrate failed no CPU" is the times when NUMA balancing did not
find a suitable page on a preferred node. This is increased because
the series avoids making decisions that the LB would override.

"Migrate failed swap task fail" is when migrate_swap fails and it
can fail for a lot of reasons.

"Task Migrated swapped" is also higher but this is somewhat positive.
It is when two tasks are swapped to keep load neutral or improved
from the perspective of the load balancer. The series attempts to
swap tasks that both move to their preferred node for example.

"Task Migrated idle CPU" is also reduced. Again, this is a reflection
that the series is trying to avoid NUMA Balancer and LB fighting
each other.

"Task Migrate retry failed" happens when NUMA balancing makes multiple
attempts to place a task on a preferred node.

So broadly speaking, similar or better performance with fewer page
migrations and less conflict between the two balancers for at least one
workload and one machine. There is room for improvement and I need data
on more workloads and machines but an early review would be nice.

 include/trace/events/sched.h |  51 +++--
 kernel/sched/core.c          |  11 --
 kernel/sched/fair.c          | 430 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h         |  13 ++
 4 files changed, 379 insertions(+), 126 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 02/11] sched/fair: Optimize select_idle_core() Mel Gorman
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a fixed degree of imbalance of two tasks to exist between
NUMA domains when the source domain is almost idle. In some cases, this
prevents communicating tasks being pulled apart. It was evaluated whether
the imbalance should be scaled to the domain size. However, no additional
benefit was measured across a range of workloads and machines and scaling
adds the risk that lower domains have to be rebalanced. While this could
change again in the future, such a change should specify the use case
and benefit.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

 2-socket Haswell machine 48 core, HT enabled
 netperf-tcp -- mmtests config config-network-netperf-unbound
			      baseline              lbnuma-v3
 Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
 Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
 Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
 Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
 Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
 Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
 Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
 Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
 Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
 Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
 Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
 Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
 Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
 Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
 Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
 Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
 Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
 Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)

In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.

 Ops NUMA base-page range updates       19796.00         292.00
 Ops NUMA PTE updates                   19796.00         292.00
 Ops NUMA PMD updates                       0.00           0.00
 Ops NUMA hint faults                   16113.00         143.00
 Ops NUMA hint local faults %            8407.00         142.00
 Ops NUMA hint local percent               52.18          99.30
 Ops NUMA pages migrated                 4244.00           1.00

Without the patch, only 52.18% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

 Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
 Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
 Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
 Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
 Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
 Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
 Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
 Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
 Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
 Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
 Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
 Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
 Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small.  For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d775375..199d1476bb90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,10 +8658,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8701,16 +8697,31 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_min;
+
+			/*
+			 * Allow a small imbalance based on a simple pair of
+			 * communicating tasks that remain local when the
+			 * source domain is almost idle.
+			 */
+			imbalance_min = 2;
+			if (busiest->sum_nr_running <= imbalance_min)
+				env->imbalance = 0;
+		}
+
 		return;
 	}
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 02/11] sched/fair: Optimize select_idle_core()
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
  2020-02-12  9:36 ` [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 03/11] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Currently we loop through all threads of a core to evaluate if the core is
idle or not. This is unnecessary. If a thread of a core is not idle, skip
evaluating other threads of a core. Also while clearing the cpumask, bits
of all CPUs of a core can be cleared in one-shot.

Collecting ticks on a Power 9 SMT 8 system around select_idle_core
while running schbench shows us

(units are in ticks, hence lesser is better)
Without patch
    N        Min     Max     Median         Avg      Stddev
x 130        151    1083        284   322.72308   144.41494

With patch
    N        Min     Max     Median         Avg      Stddev   Improvement
x 164         88     610        201   225.79268   106.78943        30.03%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20191206172422.6578-1-srikar@linux.vnet.ibm.com
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 199d1476bb90..b058a9ceba7f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5787,10 +5787,12 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 		bool idle = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			__cpumask_clear_cpu(cpu, cpus);
-			if (!available_idle_cpu(cpu))
+			if (!available_idle_cpu(cpu)) {
 				idle = false;
+				break;
+			}
 		}
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 
 		if (idle)
 			return core;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 03/11] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
  2020-02-12  9:36 ` [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains Mel Gorman
  2020-02-12  9:36 ` [PATCH 02/11] sched/fair: Optimize select_idle_core() Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 04/11] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

The following XFS commit:

  8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery")

changed the logic from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the
time that the dbench performance dropped quite a lot and CPU migrations
were increased.

The current pattern of the task migration is straight-forward. With XFS,
an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker.
This runs on a nearby CPU and on completion, dbench wakes up on its old CPU
as it is still idle and no migration occurs. dbench then queues the real
IO on the blk_mq_requeue_work() work item which runs on a bound kworker
which is forced to run on the same CPU as dbench. When IO completes,
the bound kworker wakes dbench but as the kworker is a bound but,
real task, the CPU is not considered idle and dbench gets migrated by
select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs
for a while but ultimately it starts a round-robin of all CPUs sharing
the same LLC. High-frequency migration on each IO completion has poor
performance overall. It has negative implications both in commication
costs and power management. mpstat confirmed that at low thread counts
that all CPUs sharing an LLC has low level of activity.

Note that even if the CIL patch was reverted, there still would
be migrations but the impact is less noticeable. It turns out that
individually the scheduler, XFS, blk-mq and workqueues all made sensible
decisions but in combination, the overall effect was sub-optimal.

This patch special cases the IO issue/completion pattern and allows
a bound kworker waker and a task wakee to stack on the same CPU if
there is a strong chance they are directly related. The expectation
is that the kworker is likely going back to sleep shortly. This is not
guaranteed as the IO could be queued asynchronously but there is a very
strong relationship between the task and kworker in this case that would
justify stacking on the same CPU instead of migrating. There should be
few concerns about kworker starvation given that the special casing is
only when the kworker is the waker.

DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem

UMA machine with 8 cores sharing LLC
                          5.5.0-rc7              5.5.0-rc7
                  tipsched-20200124           kworkerstack
Amean     1        22.63 (   0.00%)       20.54 *   9.23%*
Amean     2        25.56 (   0.00%)       23.40 *   8.44%*
Amean     4        28.63 (   0.00%)       27.85 *   2.70%*
Amean     8        37.66 (   0.00%)       37.68 (  -0.05%)
Amean     64      469.47 (   0.00%)      468.26 (   0.26%)
Stddev    1         1.00 (   0.00%)        0.72 (  28.12%)
Stddev    2         1.62 (   0.00%)        1.97 ( -21.54%)
Stddev    4         2.53 (   0.00%)        3.58 ( -41.19%)
Stddev    8         5.30 (   0.00%)        5.20 (   1.92%)
Stddev    64       86.36 (   0.00%)       94.53 (  -9.46%)

NUMA machine, 48 CPUs total, 24 CPUs share cache
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Amean     1         58.69 (   0.00%)       30.21 *  48.53%*
Amean     2         60.90 (   0.00%)       35.29 *  42.05%*
Amean     4         66.77 (   0.00%)       46.55 *  30.28%*
Amean     8         81.41 (   0.00%)       68.46 *  15.91%*
Amean     16       113.29 (   0.00%)      107.79 *   4.85%*
Amean     32       199.10 (   0.00%)      198.22 *   0.44%*
Amean     64       478.99 (   0.00%)      477.06 *   0.40%*
Amean     128     1345.26 (   0.00%)     1372.64 *  -2.04%*
Stddev    1          2.64 (   0.00%)        4.17 ( -58.08%)
Stddev    2          4.35 (   0.00%)        5.38 ( -23.73%)
Stddev    4          6.77 (   0.00%)        6.56 (   3.00%)
Stddev    8         11.61 (   0.00%)       10.91 (   6.04%)
Stddev    16        18.63 (   0.00%)       19.19 (  -3.01%)
Stddev    32        38.71 (   0.00%)       38.30 (   1.06%)
Stddev    64       100.28 (   0.00%)       91.24 (   9.02%)
Stddev    128      186.87 (   0.00%)      160.34 (  14.20%)

Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related

Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.

dbench4 Throughput (misleading but traditional)
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Hmean     1        321.41 (   0.00%)      617.82 *  92.22%*
Hmean     2        622.87 (   0.00%)     1066.80 *  71.27%*
Hmean     4       1134.56 (   0.00%)     1623.74 *  43.12%*
Hmean     8       1869.96 (   0.00%)     2212.67 *  18.33%*
Hmean     16      2673.11 (   0.00%)     2806.13 *   4.98%*
Hmean     32      3032.74 (   0.00%)     3039.54 (   0.22%)
Hmean     64      2514.25 (   0.00%)     2498.96 *  -0.61%*
Hmean     128     1778.49 (   0.00%)     1746.05 *  -1.82%*

Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 11 -----------
 kernel/sched/fair.c  | 14 ++++++++++++++
 kernel/sched/sched.h | 13 +++++++++++++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc007604..1f615a223791 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1442,17 +1442,6 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_SMP
 
-static inline bool is_per_cpu_kthread(struct task_struct *p)
-{
-	if (!(p->flags & PF_KTHREAD))
-		return false;
-
-	if (p->nr_cpus_allowed != 1)
-		return false;
-
-	return true;
-}
-
 /*
  * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b058a9ceba7f..d6b1d70e0b83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5914,6 +5914,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
 		return prev;
 
+	/*
+	 * Allow a per-cpu kthread to stack with the wakee if the
+	 * kworker thread and the tasks previous CPUs are the same.
+	 * The assumption is that the wakee queued work for the
+	 * per-cpu kthread that is now complete and the wakeup is
+	 * essentially a sync wakeup. An obvious example of this
+	 * pattern is IO completions.
+	 */
+	if (is_per_cpu_kthread(current) &&
+	    prev == smp_processor_id() &&
+	    this_rq()->nr_running <= 1) {
+		return prev;
+	}
+
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
 	if (recent_used_cpu != prev &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8ad11b..5876e6ba5903 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+#ifdef CONFIG_SMP
+static inline bool is_per_cpu_kthread(struct task_struct *p)
+{
+	if (!(p->flags & PF_KTHREAD))
+		return false;
+
+	if (p->nr_cpus_allowed != 1)
+		return false;
+
+	return true;
+}
+#endif
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 04/11] sched/numa: Trace when no candidate CPU was found on the preferred node
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (2 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 03/11] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node. The case where no candidate CPU could be found is
not traced which is an important gap. The tracepoint is not fired when
the task is not allowed to run on any CPU on the preferred node or the
task is already running on the target CPU but neither are interesting
corner cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6b1d70e0b83..4ab18fba5b82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1848,8 +1848,10 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/* No better CPU than the current one was found. */
-	if (env.best_cpu == -1)
+	if (env.best_cpu == -1) {
+		trace_sched_stick_numa(p, env.src_cpu, -1);
 		return -EAGAIN;
+	}
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (3 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 04/11] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12 14:43   ` Steven Rostedt
  2020-02-12  9:36 ` [PATCH 06/11] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node but from the trace, it's possibile to tell the
difference between "no CPU found", "migration to idle CPU failed" and
"tasks could not be swapped". Extend the tracepoint accordingly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/trace/events/sched.h | 51 +++++++++++++++++++++++++++++++++-----------
 kernel/sched/fair.c          |  6 +++---
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..3d07c0af4ab8 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -487,7 +487,11 @@ TRACE_EVENT(sched_process_hang,
 );
 #endif /* CONFIG_DETECT_HUNG_TASK */
 
-DECLARE_EVENT_CLASS(sched_move_task_template,
+/*
+ * Tracks migration of tasks from one runqueue to another. Can be used to
+ * detect if automatic NUMA balancing is bouncing between nodes
+ */
+TRACE_EVENT(sched_move_numa,
 
 	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
 
@@ -519,20 +523,43 @@ DECLARE_EVENT_CLASS(sched_move_task_template,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
-/*
- * Tracks migration of tasks from one runqueue to another. Can be used to
- * detect if automatic NUMA balancing is bouncing between nodes
- */
-DEFINE_EVENT(sched_move_task_template, sched_move_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+TRACE_EVENT(sched_stick_numa,
 
-	TP_ARGS(tsk, src_cpu, dst_cpu)
-);
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu, struct task_struct *dst_tsk, int dst_cpu),
 
-DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
-	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
 
-	TP_ARGS(tsk, src_cpu, dst_cpu)
+	TP_STRUCT__entry(
+		__field( pid_t,	src_pid			)
+		__field( pid_t,	src_tgid		)
+		__field( pid_t,	src_ngid		)
+		__field( int,	src_cpu			)
+		__field( int,	src_nid			)
+		__field( pid_t,	dst_pid			)
+		__field( pid_t,	dst_tgid		)
+		__field( pid_t,	dst_ngid		)
+		__field( int,	dst_cpu			)
+		__field( int,	dst_nid			)
+	),
+
+	TP_fast_assign(
+		__entry->src_pid	= task_pid_nr(src_tsk);
+		__entry->src_tgid	= task_tgid_nr(src_tsk);
+		__entry->src_ngid	= task_numa_group_id(src_tsk);
+		__entry->src_cpu	= src_cpu;
+		__entry->src_nid	= cpu_to_node(src_cpu);
+		__entry->dst_pid	= dst_tsk ? task_pid_nr(dst_tsk) : 0;
+		__entry->dst_tgid	= dst_tsk ? task_tgid_nr(dst_tsk) : 0;
+		__entry->dst_ngid	= dst_tsk ? task_numa_group_id(dst_tsk) : 0;
+		__entry->dst_cpu	= dst_cpu;
+		__entry->dst_nid	= dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1;
+	),
+
+	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
+			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
+			__entry->src_cpu, __entry->src_nid,
+			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
+			__entry->dst_cpu, __entry->dst_nid)
 );
 
 TRACE_EVENT(sched_swap_numa,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ab18fba5b82..6005ce28033b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1849,7 +1849,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 	/* No better CPU than the current one was found. */
 	if (env.best_cpu == -1) {
-		trace_sched_stick_numa(p, env.src_cpu, -1);
+		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
 		return -EAGAIN;
 	}
 
@@ -1858,7 +1858,7 @@ static int task_numa_migrate(struct task_struct *p)
 		ret = migrate_task_to(p, env.best_cpu);
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
-			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
+			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
 		return ret;
 	}
 
@@ -1866,7 +1866,7 @@ static int task_numa_migrate(struct task_struct *p)
 	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
-		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
+		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
 	put_task_struct(env.best_task);
 	return ret;
 }
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 06/11] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (4 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 07/11] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

task_numa_find_cpu can scan a node multiple times. Minimally it scans to
gather statistics and later to find a suitable target. In some cases, the
second scan will simply pick an idle CPU if the load is not imbalanced.

This patch caches information on an idle core while gathering statistics
and uses it immediately if load is not imbalanced to avoid a second scan
of the node runqueues. Preference is given to an idle core rather than an
idle SMT sibling to avoid packing HT siblings due to linearly scanning the
node cpumask.

As a side-effect, even when the second scan is necessary, the importance
of using select_idle_sibling is much reduced because information on idle
CPUs is cached and can be reused.

Note that this patch actually makes is harder to move to an idle CPU
as multiple tasks can race for the same idle CPU due to a race checking
numa_migrate_on. This is addressed in the next patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 123 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 107 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6005ce28033b..d2a58b19430e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1486,23 +1486,87 @@ struct numa_stats {
 
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
+
+	/* Details on idle CPUs */
+	int idle_cpu;
 };
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+		if (cpu == sibling)
+			continue;
+
+		if (!idle_cpu(cpu))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
+/* Forward declarations of select_idle_sibling helpers */
+static inline bool test_idle_cores(int cpu, bool def);
+
 /*
- * XXX borrowed from update_sg_lb_stats
+ * Gather all necessary information to make NUMA balancing placement
+ * decisions that are compatible with standard load balanced. This
+ * borrows code and logic from update_sg_lb_stats but sharing a
+ * common implementation is impractical.
  */
-static void update_numa_stats(struct numa_stats *ns, int nid)
+static void
+update_numa_stats(struct numa_stats *ns, int nid,
+		  struct task_struct *p, bool find_idle)
 {
-	int cpu;
+	int cpu, idle_core = -1;
+	int last_llc_id = -1;
+	bool check_smt = false;
 
 	memset(ns, 0, sizeof(*ns));
+	ns->idle_cpu = -1;
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
 		ns->load += cpu_runnable_load(rq);
 		ns->compute_capacity += capacity_of(cpu);
+
+		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
+			int this_llc_id;
+
+			if (READ_ONCE(rq->numa_migrate_on) ||
+			    !cpumask_test_cpu(cpu, p->cpus_ptr))
+				continue;
+
+			if (ns->idle_cpu == -1)
+				ns->idle_cpu = cpu;
+
+			if (!static_branch_likely(&sched_smt_present) ||
+			    idle_core >= 0) {
+				continue;
+			}
+
+			/* Check if idle cores exist on this LLC */
+			this_llc_id = per_cpu(sd_llc_id, cpu);
+			if (last_llc_id != this_llc_id) {
+				check_smt = test_idle_cores(cpu, false);
+				last_llc_id = this_llc_id;
+			}
+
+			/*
+			 * Prefer cores instead of packing HT siblings
+			 * and triggering future load balancing.
+			 */
+			if (check_smt && is_core_idle(cpu))
+				idle_core = cpu;
+			check_smt = false;
+		}
 	}
 
+	if (idle_core >= 0)
+		ns->idle_cpu = idle_core;
 }
 
 struct task_numa_env {
@@ -1527,7 +1591,7 @@ static void task_numa_assign(struct task_numa_env *env,
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
 	/* Bail out if run-queue part of active NUMA balance. */
-	if (xchg(&rq->numa_migrate_on, 1))
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
 		return;
 
 	/*
@@ -1691,19 +1755,39 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 assign:
-	/*
-	 * One idle CPU per node is evaluated for a task numa move.
-	 * Call select_idle_sibling to maybe find a better one.
-	 */
+	/* Evaluate an idle CPU for a task numa move. */
 	if (!cur) {
+		int cpu = env->dst_stats.idle_cpu;
+
+		/* Nothing cached so current CPU went idle since the search. */
+		if (cpu < 0)
+			cpu = env->dst_cpu;
+
 		/*
-		 * select_idle_siblings() uses an per-CPU cpumask that
-		 * can be used from IRQ context.
+		 * If the CPU is no longer truly idle and the previous best CPU
+		 * is, keep using it.
 		 */
-		local_irq_disable();
-		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+		if (!idle_cpu(cpu) && env->best_cpu >= 0 &&
+		    idle_cpu(env->best_cpu)) {
+			cpu = env->best_cpu;
+		}
+
+		/*
+		 * Use select_idle_sibling if the previously found idle CPU is
+		 * not idle any more.
+		 */
+		if (!idle_cpu(cpu)) {
+			/*
+			 * select_idle_siblings() uses an per-CPU cpumask that
+			 * can be used from IRQ context.
+			 */
+			local_irq_disable();
+			cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
-		local_irq_enable();
+			local_irq_enable();
+		}
+
+		env->dst_cpu = cpu;
 	}
 
 	task_numa_assign(env, cur, imp);
@@ -1728,6 +1812,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	 */
 	maymove = !load_too_imbalanced(src_load, dst_load, env);
 
+	/* Use an idle CPU if one has been found already */
+	if (maymove && env->dst_stats.idle_cpu >= 0) {
+		env->dst_cpu = env->dst_stats.idle_cpu;
+		task_numa_assign(env, NULL, 0);
+		return;
+	}
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
@@ -1788,10 +1879,10 @@ static int task_numa_migrate(struct task_struct *p)
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
 	taskweight = task_weight(p, env.src_nid, dist);
 	groupweight = group_weight(p, env.src_nid, dist);
-	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env.src_stats, env.src_nid, env.p, false);
 	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
 	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
-	update_numa_stats(&env.dst_stats, env.dst_nid);
+	update_numa_stats(&env.dst_stats, env.dst_nid, env.p, true);
 
 	/* Try to find a spot on the preferred nid. */
 	task_numa_find_cpu(&env, taskimp, groupimp);
@@ -1824,7 +1915,7 @@ static int task_numa_migrate(struct task_struct *p)
 
 			env.dist = dist;
 			env.dst_nid = nid;
-			update_numa_stats(&env.dst_stats, env.dst_nid);
+			update_numa_stats(&env.dst_stats, env.dst_nid, env.p, true);
 			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 07/11] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (5 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 06/11] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-12  9:36 ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

Multiple tasks can attempt to select and idle CPU but fail because
numa_migrate_on is already set and the migration fails. Instead of failing,
scan for an alternative idle CPU. select_idle_sibling is not used because
it requires IRQs to be disabled and it ignores numa_migrate_on allowing
multiple tasks to stack. This scan may still fail if there are idle
candidate CPUs due to races but if this occurs, it's best that a task
stay on an available CPU that move to a contended one.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2a58b19430e..3f518b0d9261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1590,15 +1590,34 @@ static void task_numa_assign(struct task_numa_env *env,
 {
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
-	/* Bail out if run-queue part of active NUMA balance. */
-	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1))
+	/* Check if run-queue part of active NUMA balance. */
+	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) {
+		int cpu;
+		int start = env->dst_cpu;
+
+		/* Find alternative idle CPU. */
+		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) {
+			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
+			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+				continue;
+			}
+
+			env->dst_cpu = cpu;
+			rq = cpu_rq(env->dst_cpu);
+			if (!xchg(&rq->numa_migrate_on, 1))
+				goto assign;
+		}
+
+		/* Failed to find an alternative idle CPU */
 		return;
+	}
 
+assign:
 	/*
 	 * Clear previous best_cpu/rq numa-migrate flag, since task now
 	 * found a better CPU to move/swap.
 	 */
-	if (env->best_cpu != -1) {
+	if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) {
 		rq = cpu_rq(env->best_cpu);
 		WRITE_ONCE(rq->numa_migrate_on, 0);
 	}
@@ -1772,21 +1791,6 @@ static void task_numa_compare(struct task_numa_env *env,
 			cpu = env->best_cpu;
 		}
 
-		/*
-		 * Use select_idle_sibling if the previously found idle CPU is
-		 * not idle any more.
-		 */
-		if (!idle_cpu(cpu)) {
-			/*
-			 * select_idle_siblings() uses an per-CPU cpumask that
-			 * can be used from IRQ context.
-			 */
-			local_irq_disable();
-			cpu = select_idle_sibling(env->p, env->src_cpu,
-						   env->dst_cpu);
-			local_irq_enable();
-		}
-
 		env->dst_cpu = cpu;
 	}
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (6 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 07/11] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
@ 2020-02-12  9:36 ` Mel Gorman
  2020-02-13 10:31   ` Peter Zijlstra
  2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2020-02-12  9:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML,
	Mel Gorman

When swapping tasks for NUMA balancing, it is preferred that tasks move
to or remain on their preferred node. When considering an imbalance,
encourage tasks to move to their preferred node and discourage tasks from
moving away from their preferred node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f518b0d9261..24fc90b8036a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1707,18 +1707,27 @@ static void task_numa_compare(struct task_numa_env *env,
 			goto unlock;
 	}
 
+	/* Skip this swap candidate if cannot move to the source cpu. */
+	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
+		goto unlock;
+
+	/*
+	 * Skip this swap candidate if it is not moving to its preferred
+	 * node and the best task is.
+	 */
+	if (env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid &&
+	    cur->numa_preferred_nid != env->src_nid) {
+		goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
 	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
-	 */
-	/* Skip this swap candidate if cannot move to the source cpu */
-	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
-		goto unlock;
-
-	/*
+	 *
 	 * If dst and source tasks are in the same NUMA group, or not
 	 * in any group then look only at task weights.
 	 */
@@ -1745,12 +1754,35 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
+	/* Discourage picking a task already on its preferred node */
+	if (cur->numa_preferred_nid == env->dst_nid)
+		imp -= imp / 16;
+
+	/*
+	 * Encourage picking a task that moves to its preferred node.
+	 * This potentially makes imp larger than it's maximum of
+	 * 1998 (see SMALLIMP and task_weight for why) but in this
+	 * case, it does not matter.
+	 */
+	if (cur->numa_preferred_nid == env->src_nid)
+		imp += imp / 8;
+
 	if (maymove && moveimp > imp && moveimp > env->best_imp) {
 		imp = moveimp;
 		cur = NULL;
 		goto assign;
 	}
 
+	/*
+	 * If a swap is required then prefer moving a task to its preferred
+	 * nid over a task that is not moving to a preferred nid.
+	 */
+	if (cur->numa_preferred_nid == env->src_nid && env->best_task &&
+	    env->best_task->numa_preferred_nid != env->src_nid) {
+		goto assign;
+	}
+
+
 	/*
 	 * If the NUMA importance is less than SMALLIMP,
 	 * task migration might only result in ping pong
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (7 preceding siblings ...)
  2020-02-12  9:36 ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
@ 2020-02-12 13:22 ` Vincent Guittot
  2020-02-12 14:07   ` Valentin Schneider
  2020-02-12 15:48   ` Mel Gorman
  2020-02-12 15:45 ` [PATCH 09/11] sched/fair: Split out helper to adjust imbalances between domains Mel Gorman
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-02-12 13:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

Hi Mel,

On Wed, 12 Feb 2020 at 10:36, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The NUMA balancer makes placement decisions on tasks that partially
> take the load balancer into account and vice versa but there are
> inconsistencies. This can result in placement decisions that override
> each other leading to unnecessary migrations -- both task placement and
> page placement. This is a prototype series that attempts to reconcile the
> decisions. It's a bit premature but it would also need to be reconciled
> with Vincent's series "[PATCH 0/4] remove runnable_load_avg and improve
> group_classify"
>
> The first three patches are unrelated and are either pending in tip or
> should be but they were part of the testing of this series so I have to
> mention them.
>
> The fourth and fifth patches are tracing only and was needed to get
> sensible data out of ftrace with respect to task placement for NUMA
> balancing. Patches 6-8 reduce overhead and reduce the changes of NUMA
> balancing overriding itself. Patches 9-11 try and bring the CPU placement
> decisions of NUMA balancing in line with the load balancer.

Don't know if it's only me but I can't find patches 9-11 on mailing list

>
> In terms of Vincent's patches, I have not checked but I expect conflicts
> to be with patches 10 and 11.
>
> Note that this is not necessarily a universal performance win although
> performance results are generally ok (small gains/losses depending on
> the machine and workload). However, task migrations, page migrations,
> variability and overall overhead are generally reduced.
>
> Tests are still running and take quite a long time so I do not have a
> full picture. The main reference workload I used was specjbb running one
> JVM per node which typically would be expected to split evenly. It's
> an interesting workload because the number of "warehouses" does not
> linearly related to the number of running tasks due to the creation of
> GC threads and other interfering activity. The mmtests configuration used
> is jvm-specjbb2005-multi with two runs -- one with ftrace enabling
> relevant scheduler tracepoints.
>
> The baseline is taken from late in the 5.6 merge window plus patches 1-4
> to take into account patches that are already in flight and the tracing
> patch I relied on for analysis.
>
> The headline performance of the series looks like
>
>                              baseline-v1          lboverload-v1
> Hmean     tput-1     37842.47 (   0.00%)    42391.63 *  12.02%*
> Hmean     tput-2     94225.00 (   0.00%)    91937.32 (  -2.43%)
> Hmean     tput-3    141855.04 (   0.00%)   142100.59 (   0.17%)
> Hmean     tput-4    186799.96 (   0.00%)   184338.10 (  -1.32%)
> Hmean     tput-5    229918.54 (   0.00%)   230894.68 (   0.42%)
> Hmean     tput-6    271006.38 (   0.00%)   271367.35 (   0.13%)
> Hmean     tput-7    312279.37 (   0.00%)   314141.97 (   0.60%)
> Hmean     tput-8    354916.09 (   0.00%)   357029.57 (   0.60%)
> Hmean     tput-9    397299.92 (   0.00%)   399832.32 (   0.64%)
> Hmean     tput-10   438169.79 (   0.00%)   442954.02 (   1.09%)
> Hmean     tput-11   476864.31 (   0.00%)   484322.15 (   1.56%)
> Hmean     tput-12   512327.04 (   0.00%)   519117.29 (   1.33%)
> Hmean     tput-13   528983.50 (   0.00%)   530772.34 (   0.34%)
> Hmean     tput-14   537757.24 (   0.00%)   538390.58 (   0.12%)
> Hmean     tput-15   535328.60 (   0.00%)   539402.88 (   0.76%)
> Hmean     tput-16   539356.59 (   0.00%)   545617.63 (   1.16%)
> Hmean     tput-17   535370.94 (   0.00%)   547217.95 (   2.21%)
> Hmean     tput-18   540510.94 (   0.00%)   548145.71 (   1.41%)
> Hmean     tput-19   536737.76 (   0.00%)   545281.39 (   1.59%)
> Hmean     tput-20   537509.85 (   0.00%)   543759.71 (   1.16%)
> Hmean     tput-21   534632.44 (   0.00%)   544848.03 (   1.91%)
> Hmean     tput-22   531538.29 (   0.00%)   540987.41 (   1.78%)
> Hmean     tput-23   523364.37 (   0.00%)   536640.28 (   2.54%)
> Hmean     tput-24   530613.55 (   0.00%)   531431.12 (   0.15%)
> Stddev    tput-1      1569.78 (   0.00%)      674.58 (  57.03%)
> Stddev    tput-2         8.49 (   0.00%)     1368.25 (-16025.00%)
> Stddev    tput-3      4125.26 (   0.00%)     1120.06 (  72.85%)
> Stddev    tput-4      4677.51 (   0.00%)      717.71 (  84.66%)
> Stddev    tput-5      3387.75 (   0.00%)     1774.13 (  47.63%)
> Stddev    tput-6      1400.07 (   0.00%)     1079.75 (  22.88%)
> Stddev    tput-7      4374.16 (   0.00%)     2571.75 (  41.21%)
> Stddev    tput-8      2370.22 (   0.00%)     2918.23 ( -23.12%)
> Stddev    tput-9      3893.33 (   0.00%)     2708.93 (  30.42%)
> Stddev    tput-10     6260.02 (   0.00%)     3935.05 (  37.14%)
> Stddev    tput-11     3989.50 (   0.00%)     6443.16 ( -61.50%)
> Stddev    tput-12      685.19 (   0.00%)    12999.45 (-1797.21%)
> Stddev    tput-13     3251.98 (   0.00%)     9311.18 (-186.32%)
> Stddev    tput-14     2793.78 (   0.00%)     6175.87 (-121.06%)
> Stddev    tput-15     6777.62 (   0.00%)    25942.33 (-282.76%)
> Stddev    tput-16    25057.04 (   0.00%)     4227.08 (  83.13%)
> Stddev    tput-17    22336.80 (   0.00%)    16890.66 (  24.38%)
> Stddev    tput-18     6662.36 (   0.00%)     3015.10 (  54.74%)
> Stddev    tput-19    20395.79 (   0.00%)     1098.14 (  94.62%)
> Stddev    tput-20    17140.27 (   0.00%)     9019.15 (  47.38%)
> Stddev    tput-21     5176.73 (   0.00%)     4300.62 (  16.92%)
> Stddev    tput-22    28279.32 (   0.00%)     6544.98 (  76.86%)
> Stddev    tput-23    25368.87 (   0.00%)     3621.09 (  85.73%)
> Stddev    tput-24     3082.28 (   0.00%)     2500.33 (  18.88%)
>
> Generally, this is showing a small gain in performance but it's
> borderline noise. However, in most cases, variability between
> the JVM performance is much reduced except at the point where
> a node is almost fully utilised.
>
> The high-level NUMA stats from /proc/vmstat look like this
>
> NUMA base-page range updates     1710927.00     2199691.00
> NUMA PTE updates                  871759.00     1060491.00
> NUMA PMD updates                    1639.00        2225.00
> NUMA hint faults                  772179.00      967165.00
> NUMA hint local faults %          647558.00      845357.00
> NUMA hint local percent               83.86          87.41
> NUMA pages migrated                64920.00       45254.00
> AutoNUMA cost                       3874.10        4852.08
>
> The percentage of local hits is higher (87.41% vs 84.86%). The
> number of pages migrated is reduced by 30%. The downside is
> that there are spikes when scanning is higher because in some
> cases NUMA balancing will not move a task to a local node if
> the CPU load balancer would immediately override it but it's
> not straight-forward to fix this in a universal way and should
> be a separate series.
>
> A separate run gathered information from ftrace and analysed it
> offline.
>
>                                              5.5.0           5.5.0
>                                          baseline    lboverload-v1
> Migrate failed no CPU                      1934.00         4999.00
> Migrate failed move to   idle                 0.00            0.00
> Migrate failed swap task fail               981.00         2810.00
> Task Migrated swapped                      6765.00        12609.00
> Task Migrated swapped local NID               0.00            0.00
> Task Migrated swapped within group          644.00         1105.00
> Task Migrated idle CPU                    14776.00          750.00
> Task Migrated idle CPU local NID              0.00            0.00
> Task Migrate retry                         2521.00         7564.00
> Task Migrate retry success                    0.00            0.00
> Task Migrate retry failed                  2521.00         7564.00
> Load Balance cross NUMA                 1222195.00      1223454.00
>
> "Migrate failed no CPU" is the times when NUMA balancing did not
> find a suitable page on a preferred node. This is increased because
> the series avoids making decisions that the LB would override.
>
> "Migrate failed swap task fail" is when migrate_swap fails and it
> can fail for a lot of reasons.
>
> "Task Migrated swapped" is also higher but this is somewhat positive.
> It is when two tasks are swapped to keep load neutral or improved
> from the perspective of the load balancer. The series attempts to
> swap tasks that both move to their preferred node for example.
>
> "Task Migrated idle CPU" is also reduced. Again, this is a reflection
> that the series is trying to avoid NUMA Balancer and LB fighting
> each other.
>
> "Task Migrate retry failed" happens when NUMA balancing makes multiple
> attempts to place a task on a preferred node.
>
> So broadly speaking, similar or better performance with fewer page
> migrations and less conflict between the two balancers for at least one
> workload and one machine. There is room for improvement and I need data
> on more workloads and machines but an early review would be nice.
>
>  include/trace/events/sched.h |  51 +++--
>  kernel/sched/core.c          |  11 --
>  kernel/sched/fair.c          | 430 ++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h         |  13 ++
>  4 files changed, 379 insertions(+), 126 deletions(-)
>
> --
> 2.16.4
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer
  2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
@ 2020-02-12 14:07   ` Valentin Schneider
  2020-02-12 15:48   ` Mel Gorman
  1 sibling, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2020-02-12 14:07 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Phil Auld, LKML



On 12/02/2020 13:22, Vincent Guittot wrote:
> Don't know if it's only me but I can't find patches 9-11 on mailing list
> 

Same here, doesn't show up on lore either:
https://lore.kernel.org/lkml/20200212093654.4816-1-mgorman@techsingularity.net/#r

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases
  2020-02-12  9:36 ` [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
@ 2020-02-12 14:43   ` Steven Rostedt
  2020-02-12 15:59     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2020-02-12 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Valentin Schneider, Phil Auld,
	LKML

On Wed, 12 Feb 2020 09:36:48 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> sched:sched_stick_numa is meant to fire when a task is unable to migrate
> to the preferred node but from the trace, it's possibile to tell the
> difference between "no CPU found", "migration to idle CPU failed" and
> "tasks could not be swapped". Extend the tracepoint accordingly.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/trace/events/sched.h | 51 +++++++++++++++++++++++++++++++++-----------
>  kernel/sched/fair.c          |  6 +++---
>  2 files changed, 42 insertions(+), 15 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 420e80e56e55..3d07c0af4ab8 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -487,7 +487,11 @@ TRACE_EVENT(sched_process_hang,
>  );
>  #endif /* CONFIG_DETECT_HUNG_TASK */
>  
> -DECLARE_EVENT_CLASS(sched_move_task_template,
> +/*
> + * Tracks migration of tasks from one runqueue to another. Can be used to
> + * detect if automatic NUMA balancing is bouncing between nodes
> + */
> +TRACE_EVENT(sched_move_numa,
>  
>  	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
>  
> @@ -519,20 +523,43 @@ DECLARE_EVENT_CLASS(sched_move_task_template,
>  			__entry->dst_cpu, __entry->dst_nid)
>  );
>  
> -/*
> - * Tracks migration of tasks from one runqueue to another. Can be used to
> - * detect if automatic NUMA balancing is bouncing between nodes
> - */
> -DEFINE_EVENT(sched_move_task_template, sched_move_numa,
> -	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +TRACE_EVENT(sched_stick_numa,
>  
> -	TP_ARGS(tsk, src_cpu, dst_cpu)
> -);
> +	TP_PROTO(struct task_struct *src_tsk, int src_cpu, struct task_struct *dst_tsk, int dst_cpu),
>  
> -DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
> -	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> +	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
>  
> -	TP_ARGS(tsk, src_cpu, dst_cpu)
> +	TP_STRUCT__entry(
> +		__field( pid_t,	src_pid			)
> +		__field( pid_t,	src_tgid		)
> +		__field( pid_t,	src_ngid		)
> +		__field( int,	src_cpu			)
> +		__field( int,	src_nid			)
> +		__field( pid_t,	dst_pid			)
> +		__field( pid_t,	dst_tgid		)
> +		__field( pid_t,	dst_ngid		)
> +		__field( int,	dst_cpu			)
> +		__field( int,	dst_nid			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->src_pid	= task_pid_nr(src_tsk);
> +		__entry->src_tgid	= task_tgid_nr(src_tsk);
> +		__entry->src_ngid	= task_numa_group_id(src_tsk);
> +		__entry->src_cpu	= src_cpu;
> +		__entry->src_nid	= cpu_to_node(src_cpu);
> +		__entry->dst_pid	= dst_tsk ? task_pid_nr(dst_tsk) : 0;
> +		__entry->dst_tgid	= dst_tsk ? task_tgid_nr(dst_tsk) : 0;
> +		__entry->dst_ngid	= dst_tsk ? task_numa_group_id(dst_tsk) : 0;
> +		__entry->dst_cpu	= dst_cpu;
> +		__entry->dst_nid	= dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1;
> +	),
> +
> +	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
> +			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
> +			__entry->src_cpu, __entry->src_nid,
> +			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> +			__entry->dst_cpu, __entry->dst_nid)
>  );
>  

The above looks the same as the below sched_swap_numa. Can you make a
DECLARE_EVENT_CLASS() and merge the two for sched_swap_numa?

Note, most the footprint of a trace event happens in the
DECLARE_EVENT_CLASS() (a TRACE_EVENT() is just a DECLARE_EVENT_CLASS()
and DEFINE_EVENT() put together). The more DECLARE_EVENT_CLASS()s you
can share, the less the footprint is.

-- Steve


>  TRACE_EVENT(sched_swap_numa,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4ab18fba5b82..6005ce28033b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1849,7 +1849,7 @@ static int task_numa_migrate(struct task_struct *p)
>  
>  	/* No better CPU than the current one was found. */
>  	if (env.best_cpu == -1) {
> -		trace_sched_stick_numa(p, env.src_cpu, -1);
> +		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
>  		return -EAGAIN;
>  	}
>  
> @@ -1858,7 +1858,7 @@ static int task_numa_migrate(struct task_struct *p)
>  		ret = migrate_task_to(p, env.best_cpu);
>  		WRITE_ONCE(best_rq->numa_migrate_on, 0);
>  		if (ret != 0)
> -			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
> +			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
>  		return ret;
>  	}
>  
> @@ -1866,7 +1866,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	WRITE_ONCE(best_rq->numa_migrate_on, 0);
>  
>  	if (ret != 0)
> -		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
> +		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
>  	put_task_struct(env.best_task);
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 09/11] sched/fair: Split out helper to adjust imbalances between domains
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (8 preceding siblings ...)
  2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
@ 2020-02-12 15:45 ` Mel Gorman
  2020-02-12 15:46 ` [PATCH 10/11] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12 15:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

The patch "sched/fair: Allow a small load imbalance between low utilisation
SD_NUMA domains" allows an imbalance when the busiest group has very
few tasks. Move the check to a helper function as it is needed by a later
patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24fc90b8036a..b2476ef0b056 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8758,6 +8758,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+{
+	unsigned int imbalance_min;
+
+	/*
+	 * Allow a small imbalance based on a simple pair of communicating
+	 * tasks that remain local when the source domain is almost idle.
+	 */
+	imbalance_min = 2;
+	if (src_nr_running <= imbalance_min)
+		return 0;
+
+	return imbalance;
+}
+
 /**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
@@ -8854,18 +8869,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA) {
-			unsigned int imbalance_min;
-
-			/*
-			 * Allow a small imbalance based on a simple pair of
-			 * communicating tasks that remain local when the
-			 * source domain is almost idle.
-			 */
-			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
-				env->imbalance = 0;
-		}
+		if (env->sd->flags & SD_NUMA)
+			env->imbalance = adjust_numa_imbalance(env->imbalance,
+						busiest->sum_nr_running);
 
 		return;
 	}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 10/11] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (9 preceding siblings ...)
  2020-02-12 15:45 ` [PATCH 09/11] sched/fair: Split out helper to adjust imbalances between domains Mel Gorman
@ 2020-02-12 15:46 ` Mel Gorman
  2020-02-12 15:46 ` [PATCH 11/11] sched/numa: Use similar logic to the load balancer for moving between overloaded domains Mel Gorman
       [not found] ` <20200214041232.18904-1-hdanton@sina.com>
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12 15:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

The standard load balancer generally allows an imbalance to exist if
a domain has spare capacity. This patch uses similar logic within NUMA
balancing when moving a task to a preferred node. This is not a perfect
comparison with the load balancer but should be a close enough match
when the destination domain has spare capacity and the imbalance is not
too large.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 114 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 79 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b2476ef0b056..69e41204cfae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1473,21 +1473,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
-
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
-
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
-	unsigned long load;
+	unsigned long group_load;
+	unsigned long group_util;
 
 	/* Total compute capacity of CPUs on a node */
-	unsigned long compute_capacity;
+	unsigned long group_capacity;
+
+	unsigned int sum_nr_running;
 
 	/* Details on idle CPUs */
+	unsigned int group_weight;
+	int nr_idle;
 	int idle_cpu;
 };
 
@@ -1511,6 +1509,22 @@ static inline bool is_core_idle(int cpu)
 /* Forward declarations of select_idle_sibling helpers */
 static inline bool test_idle_cores(int cpu, bool def);
 
+/* Forward declarations of lb helpers */
+static unsigned long cpu_load(struct rq *rq);
+static inline unsigned long cpu_util(int cpu);
+static inline bool __lb_has_capacity(unsigned int imbalance_pct,
+	unsigned int sum_nr_running, unsigned int group_weight,
+	unsigned long group_capacity, unsigned long group_util);
+static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
+
+/* NUMA Balancing equivalents for LB helpers */
+static inline bool
+numa_has_capacity(unsigned int imbalance_pct, struct numa_stats *ns)
+{
+	return __lb_has_capacity(imbalance_pct, ns->sum_nr_running + 1,
+		ns->group_weight, ns->group_capacity, ns->group_util);
+}
+
 /*
  * Gather all necessary information to make NUMA balancing placement
  * decisions that are compatible with standard load balanced. This
@@ -1529,14 +1543,20 @@ update_numa_stats(struct numa_stats *ns, int nid,
 	ns->idle_cpu = -1;
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
+		unsigned int nr_running = rq->nr_running;
 
-		ns->load += cpu_runnable_load(rq);
-		ns->compute_capacity += capacity_of(cpu);
+		ns->group_load += cpu_load(rq);
+		ns->group_util += cpu_util(cpu);
+		ns->group_capacity += capacity_of(cpu);
+		ns->group_weight++;
+		ns->sum_nr_running += nr_running;
 
-		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
+		if (!nr_running && idle_cpu(cpu)) {
 			int this_llc_id;
 
-			if (READ_ONCE(rq->numa_migrate_on) ||
+			ns->nr_idle++;
+
+			if (!find_idle || READ_ONCE(rq->numa_migrate_on) ||
 			    !cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
@@ -1646,13 +1666,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	 * ------------ vs ---------
 	 * src_capacity    dst_capacity
 	 */
-	src_capacity = env->src_stats.compute_capacity;
-	dst_capacity = env->dst_stats.compute_capacity;
+	src_capacity = env->src_stats.group_capacity;
+	dst_capacity = env->dst_stats.group_capacity;
 
 	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
 
-	orig_src_load = env->src_stats.load;
-	orig_dst_load = env->dst_stats.load;
+	orig_src_load = env->src_stats.group_load;
+	orig_dst_load = env->dst_stats.group_load;
 
 	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
 
@@ -1799,8 +1819,8 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (!load)
 		goto assign;
 
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
+	dst_load = env->dst_stats.group_load + load;
+	src_load = env->src_stats.group_load - load;
 
 	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
@@ -1838,23 +1858,38 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
-
 	/*
-	 * If the improvement from just moving env->p direction is better
-	 * than swapping tasks around, check if a move is possible.
+	 * If the load balancer is unlikely to interfere with the task after
+	 * a migration then use an idle CPU.
 	 */
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
+	if (env->dst_stats.idle_cpu >= 0) {
+		unsigned int imbalance;
+		int src_running, dst_running;
 
-	/* Use an idle CPU if one has been found already */
-	if (maymove && env->dst_stats.idle_cpu >= 0) {
-		env->dst_cpu = env->dst_stats.idle_cpu;
-		task_numa_assign(env, NULL, 0);
-		return;
+		/* Would movement cause an imbalance? */
+		src_running = env->src_stats.sum_nr_running - 1;
+		dst_running = env->src_stats.sum_nr_running + 1;
+		imbalance = max(0, dst_running - src_running);
+		imbalance = adjust_numa_imbalance(imbalance, src_running);
+
+		/* Use idle CPU there is spare capacity and no imbalance */
+		if (numa_has_capacity(env->imbalance_pct, &env->dst_stats) &&
+		    !imbalance) {
+			env->dst_cpu = env->dst_stats.idle_cpu;
+			task_numa_assign(env, NULL, 0);
+			return;
+		}
 	}
 
+	/*
+	 * If using an idle CPU would cause an imbalance that would likely
+	 * be overridden by the load balancer, consider the load instead.
+	 */
+	load = task_h_load(env->p);
+	dst_load = env->dst_stats.group_load + load;
+	src_load = env->src_stats.group_load - load;
+	maymove = !load_too_imbalanced(src_load, dst_load, env);
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
@@ -8048,18 +8083,27 @@ static inline int sg_imbalanced(struct sched_group *group)
  * any benefit for the load balance.
  */
 static inline bool
-group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
+__lb_has_capacity(unsigned int imbalance_pct, unsigned int sum_nr_running,
+	unsigned int group_weight, unsigned long group_capacity,
+	unsigned long group_util)
 {
-	if (sgs->sum_nr_running < sgs->group_weight)
+	if (sum_nr_running < group_weight)
 		return true;
 
-	if ((sgs->group_capacity * 100) >
-			(sgs->group_util * imbalance_pct))
+	if ((group_capacity * 100) >
+			(group_util * imbalance_pct))
 		return true;
 
 	return false;
 }
 
+static inline bool
+group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
+{
+	return __lb_has_capacity(imbalance_pct, sgs->sum_nr_running,
+		sgs->group_weight, sgs->group_capacity, sgs->group_util);
+}
+
 /*
  *  group_is_overloaded returns true if the group has more tasks than it can
  *  handle.
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 11/11] sched/numa: Use similar logic to the load balancer for moving between overloaded domains
  2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
                   ` (10 preceding siblings ...)
  2020-02-12 15:46 ` [PATCH 10/11] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
@ 2020-02-12 15:46 ` Mel Gorman
       [not found] ` <20200214041232.18904-1-hdanton@sina.com>
  12 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12 15:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

The NUMA balancer compares task loads to determine if a task can move
to the preferred domain but this is not what the load balancer does.
Migrating a task to a preferred domain risks the NUMA balancer and load
balancer overriding each others decisions.

This patch brings the NUMA balancer more in line with the load balancer by
considering two cases when an idle CPU cannot be used. If the preferred
node has no spare capacity but a move to an idle CPU would not cause
a load imbalance then it's allowed. Alternatively, when comparing tasks
for swapping, it'll allow the swap if the load imbalance is reduced.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 132 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 75 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69e41204cfae..b5a93c345e97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1605,7 +1605,7 @@ struct task_numa_env {
 	int best_cpu;
 };
 
-static void task_numa_assign(struct task_numa_env *env,
+static bool task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
 	struct rq *rq = cpu_rq(env->dst_cpu);
@@ -1629,7 +1629,7 @@ static void task_numa_assign(struct task_numa_env *env,
 		}
 
 		/* Failed to find an alternative idle CPU */
-		return;
+		return false;
 	}
 
 assign:
@@ -1650,34 +1650,39 @@ static void task_numa_assign(struct task_numa_env *env,
 	env->best_task = p;
 	env->best_imp = imp;
 	env->best_cpu = env->dst_cpu;
+	return true;
 }
 
-static bool load_too_imbalanced(long src_load, long dst_load,
-				struct task_numa_env *env)
+/*
+ * For overloaded domains, the load balancer compares average
+ * load -- See group_overloaded case in update_sd_pick_busiest.
+ * Return true if the load balancer has more work to do after
+ * a move.
+ */
+static bool load_degrades_imbalance(struct task_numa_env *env, long src_load, long dst_load)
 {
-	long imb, old_imb;
-	long orig_src_load, orig_dst_load;
-	long src_capacity, dst_capacity;
+	long cur_imb, new_imb;
+	long src_avg_load, dst_avg_load;
 
-	/*
-	 * The load is corrected for the CPU capacity available on each node.
-	 *
-	 * src_load        dst_load
-	 * ------------ vs ---------
-	 * src_capacity    dst_capacity
-	 */
-	src_capacity = env->src_stats.group_capacity;
-	dst_capacity = env->dst_stats.group_capacity;
-
-	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
+	if (src_load == dst_load)
+		return false;
 
-	orig_src_load = env->src_stats.group_load;
-	orig_dst_load = env->dst_stats.group_load;
+	/* Current imbalance */
+	src_avg_load = (env->src_stats.group_load * SCHED_CAPACITY_SCALE) /
+			env->src_stats.group_capacity;
+	dst_avg_load = (env->dst_stats.group_load * SCHED_CAPACITY_SCALE) /
+			env->dst_stats.group_capacity;
+	cur_imb = abs(dst_avg_load - src_avg_load);
 
-	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
+	/* Post-move imbalance */
+	src_avg_load = ((env->src_stats.group_load + dst_load - src_load) * SCHED_CAPACITY_SCALE) /
+			env->src_stats.group_capacity;
+	dst_avg_load = ((env->dst_stats.group_load + src_load - dst_load) * SCHED_CAPACITY_SCALE) /
+			env->dst_stats.group_capacity;
+	new_imb = abs(dst_avg_load - src_avg_load);
 
-	/* Would this change make things worse? */
-	return (imb > old_imb);
+	/* Does the move make the imbalance worse? */
+	return new_imb > cur_imb;
 }
 
 /*
@@ -1693,7 +1698,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
@@ -1703,10 +1708,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	long src_load, dst_load;
 	int dist = env->dist;
 	long moveimp = imp;
-	long load;
 
 	if (READ_ONCE(dst_rq->numa_migrate_on))
-		return;
+		return false;
 
 	rcu_read_lock();
 	cur = rcu_dereference(dst_rq->curr);
@@ -1802,7 +1806,6 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto assign;
 	}
 
-
 	/*
 	 * If the NUMA importance is less than SMALLIMP,
 	 * task migration might only result in ping pong
@@ -1812,17 +1815,10 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
 		goto unlock;
 
-	/*
-	 * In the overloaded case, try and keep the load balanced.
-	 */
-	load = task_h_load(env->p) - task_h_load(cur);
-	if (!load)
-		goto assign;
-
-	dst_load = env->dst_stats.group_load + load;
-	src_load = env->src_stats.group_load - load;
-
-	if (load_too_imbalanced(src_load, dst_load, env))
+	/* In the overloaded case, try and keep the load balanced. */
+	src_load = task_h_load(env->p);
+	dst_load = task_h_load(cur);
+	if (load_degrades_imbalance(env, src_load, dst_load))
 		goto unlock;
 
 assign:
@@ -1849,20 +1845,41 @@ static void task_numa_compare(struct task_numa_env *env,
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
+
+	/*
+	 * If a move to idle is allowed because there is capacity or load
+	 * balance improves then stop the search. While a better swap
+	 * candidate may exist, a search is not free.
+	 */
+	if (maymove && !cur)
+		return true;
+
+	/*
+	 * If a swap candidate must be identified and the best one moves to
+	 * its preferred node then stop the search. Search is not free.
+	 */
+	if (!maymove && env->best_task &&
+	    env->best_task->numa_preferred_nid == env->src_nid) {
+		return true;
+	}
+
+	return false;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
-	long src_load, dst_load, load;
-	bool maymove = false;
 	int cpu;
+	bool has_capacity;
+	bool maymove = false;
+
+	has_capacity = numa_has_capacity(env->imbalance_pct, &env->dst_stats);
 
 	/*
 	 * If the load balancer is unlikely to interfere with the task after
-	 * a migration then use an idle CPU.
+	 * a migration then try use an idle CPU.
 	 */
-	if (env->dst_stats.idle_cpu >= 0) {
+	if (has_capacity) {
 		unsigned int imbalance;
 		int src_running, dst_running;
 
@@ -1872,31 +1889,32 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, src_running);
 
-		/* Use idle CPU there is spare capacity and no imbalance */
-		if (numa_has_capacity(env->imbalance_pct, &env->dst_stats) &&
-		    !imbalance) {
+		/* Use idle CPU if there is no imbalance */
+		if (!imbalance) {
 			env->dst_cpu = env->dst_stats.idle_cpu;
-			task_numa_assign(env, NULL, 0);
-			return;
+			if (env->dst_cpu >= 0 && task_numa_assign(env, NULL, 0))
+				return;
+
+			/*
+			 * A race prevented finding or using an idle CPU but
+			 * searching for swap candidates may still find an
+			 * idle CPU.
+			 */
+			maymove = true;
 		}
+	} else {
+		/* Fully busy/overloaded. Allow a move if improves balance */
+		maymove = !load_degrades_imbalance(env, task_h_load(env->p), 0);
 	}
 
-	/*
-	 * If using an idle CPU would cause an imbalance that would likely
-	 * be overridden by the load balancer, consider the load instead.
-	 */
-	load = task_h_load(env->p);
-	dst_load = env->dst_stats.group_load + load;
-	src_load = env->src_stats.group_load - load;
-	maymove = !load_too_imbalanced(src_load, dst_load, env);
-
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer
  2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
  2020-02-12 14:07   ` Valentin Schneider
@ 2020-02-12 15:48   ` Mel Gorman
  2020-02-12 16:13     ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2020-02-12 15:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

On Wed, Feb 12, 2020 at 02:22:03PM +0100, Vincent Guittot wrote:
> Hi Mel,
> 
> On Wed, 12 Feb 2020 at 10:36, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The NUMA balancer makes placement decisions on tasks that partially
> > take the load balancer into account and vice versa but there are
> > inconsistencies. This can result in placement decisions that override
> > each other leading to unnecessary migrations -- both task placement and
> > page placement. This is a prototype series that attempts to reconcile the
> > decisions. It's a bit premature but it would also need to be reconciled
> > with Vincent's series "[PATCH 0/4] remove runnable_load_avg and improve
> > group_classify"
> >
> > The first three patches are unrelated and are either pending in tip or
> > should be but they were part of the testing of this series so I have to
> > mention them.
> >
> > The fourth and fifth patches are tracing only and was needed to get
> > sensible data out of ftrace with respect to task placement for NUMA
> > balancing. Patches 6-8 reduce overhead and reduce the changes of NUMA
> > balancing overriding itself. Patches 9-11 try and bring the CPU placement
> > decisions of NUMA balancing in line with the load balancer.
> 
> Don't know if it's only me but I can't find patches 9-11 on mailing list
> 

I think my outgoing SMTP must have decided I was spamming. I tried
resending just those patches.

At the moment, I'm redoing a series in top of tip taking the tracing
patches, yours on top (for testing) and the minor optimisations to see
what that gets me.  The reconcilation between NUMA balancing and load
balancing (patches 9-11) can be redone on top if the rest look ok.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases
  2020-02-12 14:43   ` Steven Rostedt
@ 2020-02-12 15:59     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-12 15:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Ben Segall, Valentin Schneider, Phil Auld,
	LKML

On Wed, Feb 12, 2020 at 09:43:08AM -0500, Steven Rostedt wrote:
> > -DEFINE_EVENT(sched_move_task_template, sched_move_numa,
> > -	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> > +TRACE_EVENT(sched_stick_numa,
> >  
> > -	TP_ARGS(tsk, src_cpu, dst_cpu)
> > -);
> > +	TP_PROTO(struct task_struct *src_tsk, int src_cpu, struct task_struct *dst_tsk, int dst_cpu),
> >  
> > -DEFINE_EVENT(sched_move_task_template, sched_stick_numa,
> > -	TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu),
> > +	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
> >  
> > -	TP_ARGS(tsk, src_cpu, dst_cpu)
> > +	TP_STRUCT__entry(
> > +		__field( pid_t,	src_pid			)
> > +		__field( pid_t,	src_tgid		)
> > +		__field( pid_t,	src_ngid		)
> > +		__field( int,	src_cpu			)
> > +		__field( int,	src_nid			)
> > +		__field( pid_t,	dst_pid			)
> > +		__field( pid_t,	dst_tgid		)
> > +		__field( pid_t,	dst_ngid		)
> > +		__field( int,	dst_cpu			)
> > +		__field( int,	dst_nid			)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->src_pid	= task_pid_nr(src_tsk);
> > +		__entry->src_tgid	= task_tgid_nr(src_tsk);
> > +		__entry->src_ngid	= task_numa_group_id(src_tsk);
> > +		__entry->src_cpu	= src_cpu;
> > +		__entry->src_nid	= cpu_to_node(src_cpu);
> > +		__entry->dst_pid	= dst_tsk ? task_pid_nr(dst_tsk) : 0;
> > +		__entry->dst_tgid	= dst_tsk ? task_tgid_nr(dst_tsk) : 0;
> > +		__entry->dst_ngid	= dst_tsk ? task_numa_group_id(dst_tsk) : 0;
> > +		__entry->dst_cpu	= dst_cpu;
> > +		__entry->dst_nid	= dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1;
> > +	),
> > +
> > +	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
> > +			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
> > +			__entry->src_cpu, __entry->src_nid,
> > +			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
> > +			__entry->dst_cpu, __entry->dst_nid)
> >  );
> >  
> 
> The above looks the same as the below sched_swap_numa. Can you make a
> DECLARE_EVENT_CLASS() and merge the two for sched_swap_numa?
> 
> Note, most the footprint of a trace event happens in the
> DECLARE_EVENT_CLASS() (a TRACE_EVENT() is just a DECLARE_EVENT_CLASS()
> and DEFINE_EVENT() put together). The more DECLARE_EVENT_CLASS()s you
> can share, the less the footprint is.
> 

No problem, I've it fixed aka, it builds. Thanks Steven

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3d07c0af4ab8..f5b75c5fef7e 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -523,9 +523,10 @@ TRACE_EVENT(sched_move_numa,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
-TRACE_EVENT(sched_stick_numa,
+DECLARE_EVENT_CLASS(sched_numa_pair_template,
 
-	TP_PROTO(struct task_struct *src_tsk, int src_cpu, struct task_struct *dst_tsk, int dst_cpu),
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
 
 	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
 
@@ -562,46 +563,23 @@ TRACE_EVENT(sched_stick_numa,
 			__entry->dst_cpu, __entry->dst_nid)
 );
 
-TRACE_EVENT(sched_swap_numa,
+DEFINE_EVENT(sched_numa_pair_template, sched_stick_numa,
 
 	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
 		 struct task_struct *dst_tsk, int dst_cpu),
 
-	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu),
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
+);
 
-	TP_STRUCT__entry(
-		__field( pid_t,	src_pid			)
-		__field( pid_t,	src_tgid		)
-		__field( pid_t,	src_ngid		)
-		__field( int,	src_cpu			)
-		__field( int,	src_nid			)
-		__field( pid_t,	dst_pid			)
-		__field( pid_t,	dst_tgid		)
-		__field( pid_t,	dst_ngid		)
-		__field( int,	dst_cpu			)
-		__field( int,	dst_nid			)
-	),
+DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
 
-	TP_fast_assign(
-		__entry->src_pid	= task_pid_nr(src_tsk);
-		__entry->src_tgid	= task_tgid_nr(src_tsk);
-		__entry->src_ngid	= task_numa_group_id(src_tsk);
-		__entry->src_cpu	= src_cpu;
-		__entry->src_nid	= cpu_to_node(src_cpu);
-		__entry->dst_pid	= task_pid_nr(dst_tsk);
-		__entry->dst_tgid	= task_tgid_nr(dst_tsk);
-		__entry->dst_ngid	= task_numa_group_id(dst_tsk);
-		__entry->dst_cpu	= dst_cpu;
-		__entry->dst_nid	= cpu_to_node(dst_cpu);
-	),
+	TP_PROTO(struct task_struct *src_tsk, int src_cpu,
+		 struct task_struct *dst_tsk, int dst_cpu),
 
-	TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d",
-			__entry->src_pid, __entry->src_tgid, __entry->src_ngid,
-			__entry->src_cpu, __entry->src_nid,
-			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
-			__entry->dst_cpu, __entry->dst_nid)
+	TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu)
 );
 
+
 /*
  * Tracepoint for waking a polling cpu without an IPI.
  */

> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer
  2020-02-12 15:48   ` Mel Gorman
@ 2020-02-12 16:13     ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-02-12 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

On Wed, 12 Feb 2020 at 16:48, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Feb 12, 2020 at 02:22:03PM +0100, Vincent Guittot wrote:
> > Hi Mel,
> >
> > On Wed, 12 Feb 2020 at 10:36, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > The NUMA balancer makes placement decisions on tasks that partially
> > > take the load balancer into account and vice versa but there are
> > > inconsistencies. This can result in placement decisions that override
> > > each other leading to unnecessary migrations -- both task placement and
> > > page placement. This is a prototype series that attempts to reconcile the
> > > decisions. It's a bit premature but it would also need to be reconciled
> > > with Vincent's series "[PATCH 0/4] remove runnable_load_avg and improve
> > > group_classify"
> > >
> > > The first three patches are unrelated and are either pending in tip or
> > > should be but they were part of the testing of this series so I have to
> > > mention them.
> > >
> > > The fourth and fifth patches are tracing only and was needed to get
> > > sensible data out of ftrace with respect to task placement for NUMA
> > > balancing. Patches 6-8 reduce overhead and reduce the changes of NUMA
> > > balancing overriding itself. Patches 9-11 try and bring the CPU placement
> > > decisions of NUMA balancing in line with the load balancer.
> >
> > Don't know if it's only me but I can't find patches 9-11 on mailing list
> >
>
> I think my outgoing SMTP must have decided I was spamming. I tried
> resending just those patches.

I received them.
Thanks

>
> At the moment, I'm redoing a series in top of tip taking the tracing
> patches, yours on top (for testing) and the minor optimisations to see
> what that gets me.  The reconcilation between NUMA balancing and load
> balancing (patches 9-11) can be redone on top if the rest look ok.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node
  2020-02-12  9:36 ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
@ 2020-02-13 10:31   ` Peter Zijlstra
  2020-02-13 11:18     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-02-13 10:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

On Wed, Feb 12, 2020 at 09:36:51AM +0000, Mel Gorman wrote:
> When swapping tasks for NUMA balancing, it is preferred that tasks move
> to or remain on their preferred node. When considering an imbalance,
> encourage tasks to move to their preferred node and discourage tasks from
> moving away from their preferred node.

Wasn't there an issue for workloads that span multiple nodes?

Say a 4 node system with 2 warehouses? Then each JVM will want 2 nodes,
instead of a single node, and strong preferred node stuff makes it
difficult to achieve this.

I forgot how we dealt with these cases, just something I worry about
when reading this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node
  2020-02-13 10:31   ` Peter Zijlstra
@ 2020-02-13 11:18     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-13 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Valentin Schneider, Phil Auld, LKML

On Thu, Feb 13, 2020 at 11:31:08AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 12, 2020 at 09:36:51AM +0000, Mel Gorman wrote:
> > When swapping tasks for NUMA balancing, it is preferred that tasks move
> > to or remain on their preferred node. When considering an imbalance,
> > encourage tasks to move to their preferred node and discourage tasks from
> > moving away from their preferred node.
> 
> Wasn't there an issue for workloads that span multiple nodes?
> 

Sortof, yes -- specifically workloads that could not fit inside a node
for whatever reason.

> Say a 4 node system with 2 warehouses? Then each JVM will want 2 nodes,
> instead of a single node, and strong preferred node stuff makes it
> difficult to achieve this.
> 
> I forgot how we dealt with these cases, just something I worry about
> when reading this.

We deal with it in task_numa_migrate() by considering nodes other
than the preferred node for placement -- see "Look at other nodes in
these cases" followed by a sched_setnuma if the preferred node doesn't
match.

We do not do any special casing as such in task_numa_compare other than
finding the best improvement so we can pick a task belonging to a group
spanning multiple nodes with or without this patch. A workload spanning
multiple nodes in itself does not justify a full search if it can be
avoided.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node
       [not found] ` <20200214041232.18904-1-hdanton@sina.com>
@ 2020-02-14  7:50   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2020-02-14  7:50 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Valentin Schneider,
	Phil Auld, LKML

On Fri, Feb 14, 2020 at 12:12:32PM +0800, Hillf Danton wrote:
> > +	if (cur->numa_preferred_nid == env->dst_nid)
> > +		imp -= imp / 16;
> > +
> > +	/*
> > +	 * Encourage picking a task that moves to its preferred node.
> > +	 * This potentially makes imp larger than it's maximum of
> > +	 * 1998 (see SMALLIMP and task_weight for why) but in this
> > +	 * case, it does not matter.
> > +	 */
> > +	if (cur->numa_preferred_nid == env->src_nid)
> > +		imp += imp / 8;
> > +
> >  	if (maymove && moveimp > imp && moveimp > env->best_imp) {
> >  		imp = moveimp;
> >  		cur = NULL;
> >  		goto assign;
> >  	}
> >  
> > +	/*
> > +	 * If a swap is required then prefer moving a task to its preferred
> > +	 * nid over a task that is not moving to a preferred nid.
> 
> after checking if imp is above SMALLIMP.
> 

It is preferable to move a task to its preferred node over one that
does not even if the improvement is lsss than SMALLIMP. The reasoning is
that NUMA balancing retries moving tasks to their preferred node
periodically and moving "now" reduces the chance of a task having to
retry its move later.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-02-14  7:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
2020-02-12  9:36 ` [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains Mel Gorman
2020-02-12  9:36 ` [PATCH 02/11] sched/fair: Optimize select_idle_core() Mel Gorman
2020-02-12  9:36 ` [PATCH 03/11] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-12  9:36 ` [PATCH 04/11] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-12  9:36 ` [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-12 14:43   ` Steven Rostedt
2020-02-12 15:59     ` Mel Gorman
2020-02-12  9:36 ` [PATCH 06/11] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-12  9:36 ` [PATCH 07/11] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-12  9:36 ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-13 10:31   ` Peter Zijlstra
2020-02-13 11:18     ` Mel Gorman
2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
2020-02-12 14:07   ` Valentin Schneider
2020-02-12 15:48   ` Mel Gorman
2020-02-12 16:13     ` Vincent Guittot
2020-02-12 15:45 ` [PATCH 09/11] sched/fair: Split out helper to adjust imbalances between domains Mel Gorman
2020-02-12 15:46 ` [PATCH 10/11] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-12 15:46 ` [PATCH 11/11] sched/numa: Use similar logic to the load balancer for moving between overloaded domains Mel Gorman
     [not found] ` <20200214041232.18904-1-hdanton@sina.com>
2020-02-14  7:50   ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).