All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING
@ 2021-05-13 15:49 Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 1/6] sched/topology: Introduce sched_group::flags Ricardo Neri
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri

This is v3 the series. v1 and v2 patches and test results can be found in
[1] and [2], respectively.

=== Problem statement ===
On SMT-enabled hardware, ASYM_PACKING can cause the load balancer to choose
low priority CPUs over medium priority CPUs.

When balancing load in scheduling domains with the SD_ASYM_PACKING flag,
idle CPUs of higher priority pull tasks from CPUs of lower priority. This
balancing is done by comparing pairs of scheduling groups. There may also
be scheduling groups composed of CPUs that are SMT siblings.

When using SD_ASYM_PACKING on x86 with Intel Turbo Boost Max Technology 3.0
(ITMT), the priorities of a scheduling group of CPUs that has N SMT
siblings are assigned as N*P, N*P/2, N*P/3, ..., P, where P is the
priority assigned by the hardware to the physical core and N is the number
of SMT siblings.

Systems such as Intel Comet Lake can have some cores supporting SMT, while
others do not. As a result, it is possible to have medium non-SMT
priorities, Q, such that N*P > Q > P.

When comparing groups for load balancing, the priority of the CPU doing the
load balancing is only compared with the preferred CPU of the candidate
busiest group (N*P vs Q in the example above). Thus, scheduling groups
with a preferred CPU with priority N*P always pull tasks from the
scheduling group with priority Q and then such tasks are spread across the
“SMT” domain. Conversely, since N*P > Q, CPUs with priority Q cannot
pull tasks from a group with a preferred CPU with priority N*P, even
though Q > P.

Doing load balancing based on load (i.e. if the busiest group is of type
group_overloaded) will not help if the system is not fully busy as the
involved groups will have only one task and load balancing will
not be deemed as necessary.

The behavior described above results in leaving CPUs with medium priority
idle, while CPUs with lower priority are busy. Furthermore, such CPUs of
lower priority are SMT siblings of high priority CPUs, which are also busy.

This patchset fixes this behavior by also checking the idle state of the
SMT siblings of both the CPU doing the load balance and the busiest
candidate group.

I ran a few benchmarks with and without this version of the patchset on
an Intel(R) Core(TM) i9-7900X CPU. I kept online both SMT siblings of two
high priority cores. I offlined the lower priority SMT siblings of four
low priority cores. I offlined the rest of the cores. The resulting
priority configuration meets the N*P > Q > P condition described above.

The baseline for the results is an unmodified v5.12 kernel. Results
show a comparative percentage of improvement (positive) or degradation
(negative). Each test case is repeated five times, and the standard
deviation among repetitions is also documented.

In order to judge only the improvements this patchset provides, Table 1
shows the results when setting the CPU's frequency at 1000MHz. It can be
observed that the patches bring an overall positive impact for tests
that exhibit low variance. Some of the test cases, however, show a high
variance and it is difficult to assess the impact. Still, a few test
cases with low variance exhibit slight degradation.

Table 2 shows the results when using hardware-controlled performance
performance states (HWP), a common use case. In this case, the impact of
the patches is also overall positive with a few test cases showing slight
degradation.

Thanks and BR,
Ricardo

========
Changes since v2:
  * Removed arch_sched_asym_prefer_early() and re-evaluation of the
    candidate busiest group in update_sd_pick_busiest(). (PeterZ)
  * Introduced sched_group::flags to reflect the properties of CPUs
    in the scheduling group. This helps to identify scheduling groups
    whose CPUs are SMT siblings. (PeterZ)
  * Modified update_sg_lb_stats() to get statistics of the scheduling
    domain as an argument. This provides it with the statistics of the
    local domain. (PeterZ)
  * Introduced sched_asym() to decide if a busiest candidate group can
    be marked for asymmetric packing.
  * Reworded patch 1 for clarity. (PeterZ)

========
Changes since v1:
  * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
    tasks. Instead, reclassify the candidate busiest group, as it
    may still be selected. (PeterZ)
  * Avoid an expensive and unnecessary call to cpumask_weight() when
    determining if a sched_group is comprised of SMT siblings.
    (PeterZ).
  * Updated test results using the v2 patches.

========      Table 1. Test results of patches at 10000MHz     ========
=======================================================================
hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            group-2          1.00 (  4.17)   +0.41 (  1.35)
process-pipe            group-4          1.00 (  5.26)   +4.02 (  3.80)
process-pipe            group-8          1.00 ( 12.57)   +1.34 ( 16.28)
process-sockets         group-2          1.00 (  1.04)   +0.83 (  0.58)
process-sockets         group-4          1.00 (  0.57)   -0.38 (  1.13)
process-sockets         group-8          1.00 (  6.39)   +1.15 (  3.44)
threads-pipe            group-2          1.00 (  2.50)   +1.47 (  1.78)
threads-pipe            group-4          1.00 (  5.55)   +2.68 (  4.88)
threads-pipe            group-8          1.00 ( 12.85)   -0.89 ( 10.56)
threads-sockets         group-2          1.00 (  2.11)   +1.77 (  0.77)
threads-sockets         group-4          1.00 (  0.70)   +1.11 (  0.92)
threads-sockets         group-8          1.00 (  1.20)   -1.79 (  1.88)

netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  thread-2         1.00 (  0.72)   -1.42 (  0.45)
TCP_RR                  thread-4         1.00 (  0.38)   +4.57 (  0.44)
TCP_RR                  thread-8         1.00 (  9.29)   +1.09 ( 15.02)
TCP_RR                  thread-16        1.00 ( 21.18)   +0.88 ( 23.62)
UDP_RR                  thread-2         1.00 (  0.23)   -2.41 (  0.49)
UDP_RR                  thread-4         1.00 (  0.19)   +4.54 (  0.37)
UDP_RR                  thread-8         1.00 (  6.83)   -0.78 (  7.70)
UDP_RR                  thread-16        1.00 ( 20.85)   -0.00 ( 19.52)

tbench
======
case                    load            baseline(std%)  compare%( std%)
loopback                thread-2         1.00 (  0.15)   -1.48 (  0.15)
loopback                thread-4         1.00 (  0.25)   +3.65 (  0.36)
loopback                thread-8         1.00 (  0.16)   +0.74 (  0.28)
loopback                thread-16        1.00 (  0.32)   +0.85 (  0.32)

schbench
========
case                    load            baseline(std%)  compare%( std%)
normal                  mthread-2        1.00 (  0.10)   -0.05 (  0.00)
normal                  mthread-4        1.00 (  0.00)   +0.00 (  0.00)
normal                  mthread-8        1.00 (  4.81)   +5.57 (  2.61)

========      Table 2. Test results of patches with HWP        ========
=======================================================================

hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            group-2          1.00 (  2.59)   +2.93 (  2.10)
process-pipe            group-4          1.00 (  4.05)   +5.82 (  4.55)
process-pipe            group-8          1.00 (  7.37)   -0.09 ( 15.19)
process-sockets         group-2          1.00 (  1.64)   -1.48 (  1.61)
process-sockets         group-4          1.00 (  0.95)   -1.97 (  1.44)
process-sockets         group-8          1.00 (  1.84)   -0.64 (  2.08)
threads-pipe            group-2          1.00 (  3.02)   -0.09 (  1.70)
threads-pipe            group-4          1.00 (  5.39)   +3.42 (  4.59)
threads-pipe            group-8          1.00 (  5.56)   +1.96 ( 10.98)
threads-sockets         group-2          1.00 (  1.10)   +1.99 (  1.67)
threads-sockets         group-4          1.00 (  0.45)   +1.56 (  1.10)
threads-sockets         group-8          1.00 (  3.56)   +3.27 (  1.47)

netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  thread-2         1.00 (  0.36)  +21.26 (  1.40)
TCP_RR                  thread-4         1.00 (  0.31)   +1.74 (  0.31)
TCP_RR                  thread-8         1.00 ( 11.70)   -1.03 ( 13.55)
TCP_RR                  thread-16        1.00 ( 23.49)   -0.43 ( 22.79)
UDP_RR                  thread-2         1.00 (  0.19)  +16.29 (  0.97)
UDP_RR                  thread-4         1.00 (  0.13)   +3.51 (  0.51)
UDP_RR                  thread-8         1.00 (  8.95)   +0.05 ( 10.84)
UDP_RR                  thread-16        1.00 ( 21.54)   +0.10 ( 22.89)

tbench
======
case                    load            baseline(std%)  compare%( std%)
loopback                thread-2         1.00 (  0.22)  +10.81 (  1.23)
loopback                thread-4         1.00 (  2.56)  +61.43 (  0.56)
loopback                thread-8         1.00 (  0.41)   -0.10 (  0.42)
loopback                thread-16        1.00 (  0.60)   +0.59 (  0.21)

schbench
========
case                    load            baseline(std%)  compare%( std%)
normal                  mthread-2        1.00 (  0.00)   +0.00 (  0.00)
normal                  mthread-4        1.00 (  0.00)   +0.00 (  0.00)
normal                  mthread-8        1.00 (  0.00)   -2.67 (  5.20)

========
[1]. https://lore.kernel.org/lkml/20210406041108.7416-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20210414020436.12980-1-ricardo.neri-calderon@linux.intel.com/

Ricardo Neri (6):
  sched/topology: Introduce sched_group::flags
  sched/fair: Optimize checking for group_asym_packing
  sched/fair: Provide update_sg_lb_stats() with sched domain statistics
  sched/fair: Carve out logic to mark a group for asymmetric packing
  sched/fair: Consider SMT in ASYM_PACKING load balance
  x86/sched: Enable SMT checks for asymmetric packing in load balancing

 arch/x86/kernel/itmt.c         |  10 +++
 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c            | 135 ++++++++++++++++++++++++++++++---
 kernel/sched/sched.h           |   1 +
 kernel/sched/topology.c        |   7 +-
 5 files changed, 142 insertions(+), 12 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/6] sched/topology: Introduce sched_group::flags
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 2/6] sched/fair: Optimize checking for group_asym_packing Ricardo Neri
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

There exist situations in which the load balance needs to know the
properties of the CPUs in a scheduling group. When using asymmetric
packing, for instance, the load balancer needs to know not only the
state of dst_cpu but also of its SMT siblings, if any.

Use the flags of the child scheduling domains to initialize scheduling
group flags. This will reflect the properties of the CPUs in the
group.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/sched.h    | 1 +
 kernel/sched/topology.c | 7 +++++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f0194cee0ba..1afd82c1743d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1795,6 +1795,7 @@ struct sched_group {
 	unsigned int		group_weight;
 	struct sched_group_capacity *sgc;
 	int			asym_prefer_cpu;	/* CPU of highest priority in group */
+	int			flags;
 
 	/*
 	 * The CPUs this group covers.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 053115b55f89..1caab1a5e5c8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -916,10 +916,12 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 		return NULL;
 
 	sg_span = sched_group_span(sg);
-	if (sd->child)
+	if (sd->child) {
 		cpumask_copy(sg_span, sched_domain_span(sd->child));
-	else
+		sg->flags = sd->child->flags;
+	} else {
 		cpumask_copy(sg_span, sched_domain_span(sd));
+	}
 
 	atomic_inc(&sg->ref);
 	return sg;
@@ -1169,6 +1171,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 	if (child) {
 		cpumask_copy(sched_group_span(sg), sched_domain_span(child));
 		cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
+		sg->flags = child->flags;
 	} else {
 		cpumask_set_cpu(cpu, sched_group_span(sg));
 		cpumask_set_cpu(cpu, group_balance_mask(sg));
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/6] sched/fair: Optimize checking for group_asym_packing
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 1/6] sched/topology: Introduce sched_group::flags Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 3/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics Ricardo Neri
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

sched_asmy_prefer(a, b) returns true if the priority of cpu a is bigger
than the priority of b. dst_cpu belongs to the local group. It follows
that its priority is equal or less than the priority of the preferred
CPU of the local group. Thus, sched_asmy_prefer() always returns false
when called on the local group.

By checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Reworded commit message for clarity. (Peter Z)

Changes since v1:
  * None
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 161b92aa1c79..96b023432a1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8507,7 +8507,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	}
 
 	/* Check if dst CPU is idle and preferred to this group */
-	if (env->sd->flags & SD_ASYM_PACKING &&
+	if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
 	    env->idle != CPU_NOT_IDLE &&
 	    sgs->sum_h_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 1/6] sched/topology: Introduce sched_group::flags Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 2/6] sched/fair: Optimize checking for group_asym_packing Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing Ricardo Neri
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

Before deciding to pull tasks when using asymmetric packing of tasks,
on some architectures (e.g., x86) it is necessary to know not only the
state of dst_cpu but also of its SMT siblings. The decision to classify
a candidate busiest group as group_asym_packing is done in
update_sg_lb_stats(). Give this function access to the scheduling domain
statistics, which contains the statistics of the local group.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96b023432a1c..c8c04e9d0d3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8455,6 +8455,7 @@ group_type group_classify(unsigned int imbalance_pct,
  * @sg_status: Holds flag indicating the status of the sched_group
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
+				      struct sd_lb_stats *sds,
 				      struct sched_group *group,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
@@ -8463,7 +8464,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	memset(sgs, 0, sizeof(*sgs));
 
-	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+	local_group = group == sds->local;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -9026,7 +9027,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, sgs, &sg_status);
+		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
 
 		if (local_group)
 			goto next_group;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
                   ` (2 preceding siblings ...)
  2021-05-13 15:49 ` [PATCH v3 3/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  2021-05-17 14:21   ` Dietmar Eggemann
  2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
  2021-05-13 15:49 ` [PATCH v3 6/6] x86/sched: Enable SMT checks for asymmetric packing in load balancing Ricardo Neri
  5 siblings, 1 reply; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

Currently when balancing load, a candidate busiest group is marked for
asymmetric packing if, among other factors, dst_cpu has higher priority
than the preferred CPU of the candidate busiest group. However, other
factors influence the decision, such as the state of the SMT siblings of
dst_cpu, if any.

Thus, create a separate function, sched_asym(), in which logic for such
decisions can be implemented. A subsequent changeset will introduce logic
to deal with SMT.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Introduced this patch.

Changes since v1:
  * N/A
---
 kernel/sched/fair.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8c04e9d0d3b..c8b66a5d593e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+static inline bool
+sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
+	   struct sched_group *group)
+{
+	/*
+	 * Because sd->groups starts with the local group, anything that isn't
+	 * the local group will have access to the local state.
+	 */
+	if (group == sds->local)
+		return false;
+
+	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -8507,18 +8521,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
+	sgs->group_capacity = group->sgc->capacity;
+
+	sgs->group_weight = group->group_weight;
+
 	/* Check if dst CPU is idle and preferred to this group */
 	if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
-	    env->idle != CPU_NOT_IDLE &&
-	    sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+	    env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
+	    sched_asym(env, sds, sgs, group)) {
 		sgs->group_asym_packing = 1;
 	}
 
-	sgs->group_capacity = group->sgc->capacity;
-
-	sgs->group_weight = group->group_weight;
-
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
                   ` (3 preceding siblings ...)
  2021-05-13 15:49 ` [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  2021-05-14  9:47   ` Peter Zijlstra
                     ` (2 more replies)
  2021-05-13 15:49 ` [PATCH v3 6/6] x86/sched: Enable SMT checks for asymmetric packing in load balancing Ricardo Neri
  5 siblings, 3 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.

If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.

Implement asym_can_pull_tasks() to inspect the state of the SMT siblings
of both dst_cpu and the CPUs in the candidate busiest group.

To keep the legacy behavior, add an arch_asym_check_smt_siblings() to
skip these new checks by default. Individual architectures must override
the mentioned function to use asym_can_pull_tasks().

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Reworded the commit message to reflect updates in code.
  * Corrected misrepresentation of dst_cpu as the CPU doing the load
    balancing. (PeterZ)
  * Removed call to arch_asym_check_smt_siblings() as it is now called in
    sched_asym().

Changes since v1:
  * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
    tasks. Instead, reclassify the candidate busiest group, as it
    may still be selected. (PeterZ)
  * Avoid an expensive and unnecessary call to cpumask_weight() when
    determining if a sched_group is comprised of SMT siblings.
    (PeterZ).
---
 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..43bdb8b1e1df 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern bool arch_asym_check_smt_siblings(void);
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8b66a5d593e..3d6cc027e6e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
 	return -cpu;
 }
 
+/*
+ * For asym packing, first check the state of SMT siblings before deciding to
+ * pull tasks.
+ */
+bool __weak arch_asym_check_smt_siblings(void)
+{
+	return false;
+}
+
 /*
  * The margin used when comparing utilization with CPU capacity.
  *
@@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+/**
+ * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
+ * @dst_cpu:	Destination CPU of the load balancing
+ * @sds:	Load-balancing data with statistics of the local group
+ * @sgs:	Load-balancing statistics of the candidate busiest group
+ * @sg:		The candidate busiet group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
+ * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
+ * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
+ * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
+ * between the number of busy CPUs is 2 or more. If the difference is of 1,
+ * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
+ * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
+ * has lower priority.
+ */
+static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+				struct sg_lb_stats *sgs, struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+	int cpu, local_busy_cpus, sg_busy_cpus;
+	bool local_is_smt, sg_is_smt;
+
+	cpu = group_first_cpu(sg);
+	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
+	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
+
+	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+	if (!local_is_smt) {
+		/*
+		 * If we are here, @dst_cpu is idle and does not have SMT
+		 * siblings. Pull tasks if candidate group has two or more
+		 * busy CPUs.
+		 */
+		if (sg_is_smt && sg_busy_cpus >= 2)
+			return true;
+
+		/*
+		 * @dst_cpu does not have SMT siblings. @sg may have SMT
+		 * siblings and only one is busy. In such case, @dst_cpu
+		 * can help if it has higher priority and is idle.
+		 */
+		return !sds->local_stat.group_util &&
+		       sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+	}
+
+	/* @dst_cpu has SMT siblings. */
+
+	local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
+
+	if (sg_is_smt) {
+		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
+
+		/* Local can always help to even the number busy CPUs. */
+		if (busy_cpus_delta >= 2)
+			return true;
+
+		if (busy_cpus_delta == 1)
+			return sched_asym_prefer(dst_cpu,
+						 sg->asym_prefer_cpu);
+
+		return false;
+	}
+
+	/*
+	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
+	 * up with more than one busy SMT sibling and only pull tasks if there
+	 * are not busy CPUs. As CPUs move in and out of idle state frequently,
+	 * also check the group utilization to smoother the decision.
+	 */
+	if (!local_busy_cpus && !sds->local_stat.group_util)
+		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+	return false;
+#else
+	return true;
+#endif
+}
+
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
@@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
 	if (group == sds->local)
 		return false;
 
+	if (arch_asym_check_smt_siblings())
+		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }
 
@@ -9463,6 +9558,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    nr_running == 1)
 			continue;
 
+		/* Make sure we only pull tasks from a CPU of lower priority */
+		if ((env->sd->flags & SD_ASYM_PACKING) &&
+		    sched_asym_prefer(i, env->dst_cpu) &&
+		    nr_running == 1)
+			continue;
+
 		switch (env->migration_type) {
 		case migrate_load:
 			/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 6/6] x86/sched: Enable SMT checks for asymmetric packing in load balancing
  2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
                   ` (4 preceding siblings ...)
  2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
@ 2021-05-13 15:49 ` Ricardo Neri
  5 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-13 15:49 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Len Brown, Srinivas Pandruvada, Tim Chen, Aubrey Li,
	Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Ricardo Neri, Aubrey Li,
	Daniel Bristot de Oliveira

ITMT relies on asymmetric packing of tasks to ensure CPUs are populated in
priority order. When balancing load, the scheduler compares scheduling
groups in pairs, and compares only the priority of the CPUs of highest
priority in the group. This may result on CPUs with medium priority being
overlooked. A recent change introduced logic to also consider the idle
state of the SMT siblings of the CPU doing the load balance. Enable those
checks for x86 when using ITMT.

Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
  * Removed arch_sched_asym_prefer_early() as it is no longer needed.

Changes since v1:
  * None
---
 arch/x86/kernel/itmt.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 1afbdd1dd777..3c8f0cead574 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -28,6 +28,8 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
 
 /* Boolean to track if system has ITMT capabilities */
 static bool __read_mostly sched_itmt_capable;
+/* Boolean to activate checks on the state of SMT siblings */
+static bool __read_mostly sched_itmt_smt_checks;
 
 /*
  * Boolean to control whether we want to move processes to cpu capable
@@ -124,6 +126,8 @@ int sched_set_itmt_support(void)
 
 	sysctl_sched_itmt_enabled = 1;
 
+	sched_itmt_smt_checks = true;
+
 	x86_topology_update = true;
 	rebuild_sched_domains();
 
@@ -160,6 +164,7 @@ void sched_clear_itmt_support(void)
 	if (sysctl_sched_itmt_enabled) {
 		/* disable sched_itmt if we are no longer ITMT capable */
 		sysctl_sched_itmt_enabled = 0;
+		sched_itmt_smt_checks = false;
 		x86_topology_update = true;
 		rebuild_sched_domains();
 	}
@@ -167,6 +172,11 @@ void sched_clear_itmt_support(void)
 	mutex_unlock(&itmt_update_mutex);
 }
 
+bool arch_asym_check_smt_siblings(void)
+{
+	return sched_itmt_smt_checks;
+}
+
 int arch_asym_cpu_priority(int cpu)
 {
 	return per_cpu(sched_core_priority, cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
@ 2021-05-14  9:47   ` Peter Zijlstra
  2021-05-15  2:14     ` Ricardo Neri
  2021-05-17 15:18   ` Dietmar Eggemann
  2021-05-17 22:28   ` Joel Fernandes
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2021-05-14  9:47 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
>  include/linux/sched/topology.h |   1 +
>  kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
>  2 files changed, 102 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..43bdb8b1e1df 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
>  #endif
>  
>  extern int arch_asym_cpu_priority(int cpu);
> +extern bool arch_asym_check_smt_siblings(void);
>  
>  struct sched_domain_attr {
>  	int relax_domain_level;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8b66a5d593e..3d6cc027e6e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
>  	return -cpu;
>  }
>  
> +/*
> + * For asym packing, first check the state of SMT siblings before deciding to
> + * pull tasks.
> + */
> +bool __weak arch_asym_check_smt_siblings(void)
> +{
> +	return false;
> +}
> +
>  /*
>   * The margin used when comparing utilization with CPU capacity.
>   *

> @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
>  	if (group == sds->local)
>  		return false;
>  
> +	if (arch_asym_check_smt_siblings())
> +		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
>  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>  }

So I'm thinking that this is a property of having ASYM_PACKING at a core
level, rather than some arch special. Wouldn't something like this be
more appropriate?

---
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
-extern bool arch_asym_check_smt_siblings(void);
 
 struct sched_domain_attr {
 	int relax_domain_level;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
 }
 
 /*
- * For asym packing, first check the state of SMT siblings before deciding to
- * pull tasks.
- */
-bool __weak arch_asym_check_smt_siblings(void)
-{
-	return false;
-}
-
-/*
  * The margin used when comparing utilization with CPU capacity.
  *
  * (default: ~20%)
@@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
 	if (group == sds->local)
 		return false;
 
-	if (arch_asym_check_smt_siblings())
+	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+	    (group->flags & SD_SHARE_CPUCAPACITY))
 		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
 
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-14  9:47   ` Peter Zijlstra
@ 2021-05-15  2:14     ` Ricardo Neri
  2021-05-18 19:07         ` Ricardo Neri
  0 siblings, 1 reply; 23+ messages in thread
From: Ricardo Neri @ 2021-05-15  2:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> >  include/linux/sched/topology.h |   1 +
> >  kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
> >  2 files changed, 102 insertions(+)
> > 
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 8f0f778b7c91..43bdb8b1e1df 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >  
> >  extern int arch_asym_cpu_priority(int cpu);
> > +extern bool arch_asym_check_smt_siblings(void);
> >  
> >  struct sched_domain_attr {
> >  	int relax_domain_level;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c8b66a5d593e..3d6cc027e6e6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> >  	return -cpu;
> >  }
> >  
> > +/*
> > + * For asym packing, first check the state of SMT siblings before deciding to
> > + * pull tasks.
> > + */
> > +bool __weak arch_asym_check_smt_siblings(void)
> > +{
> > +	return false;
> > +}
> > +
> >  /*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> 
> > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
> >  	if (group == sds->local)
> >  		return false;
> >  
> > +	if (arch_asym_check_smt_siblings())
> > +		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > +
> >  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> >  }
> 
> So I'm thinking that this is a property of having ASYM_PACKING at a core
> level, rather than some arch special. Wouldn't something like this be
> more appropriate?
> 
> ---
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
>  #endif
>  
>  extern int arch_asym_cpu_priority(int cpu);
> -extern bool arch_asym_check_smt_siblings(void);
>  
>  struct sched_domain_attr {
>  	int relax_domain_level;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
>  }
>  
>  /*
> - * For asym packing, first check the state of SMT siblings before deciding to
> - * pull tasks.
> - */
> -bool __weak arch_asym_check_smt_siblings(void)
> -{
> -	return false;
> -}
> -
> -/*
>   * The margin used when comparing utilization with CPU capacity.
>   *
>   * (default: ~20%)
> @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
>  	if (group == sds->local)
>  		return false;
>  
> -	if (arch_asym_check_smt_siblings())
> +	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> +	    (group->flags & SD_SHARE_CPUCAPACITY))
>  		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);

Thanks Peter for the quick review! This makes sense to me. The only
reason we proposed arch_asym_check_smt_siblings() is because we were
about breaking powerpc (I need to study how they set priorities for SMT,
if applicable). If you think this is not an issue I can post a
v4 with this update.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing
  2021-05-13 15:49 ` [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing Ricardo Neri
@ 2021-05-17 14:21   ` Dietmar Eggemann
  2021-05-18 19:18     ` Ricardo Neri
  0 siblings, 1 reply; 23+ messages in thread
From: Dietmar Eggemann @ 2021-05-17 14:21 UTC (permalink / raw)
  To: Ricardo Neri, Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On 13/05/2021 17:49, Ricardo Neri wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8c04e9d0d3b..c8b66a5d593e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
>  	return group_has_spare;
>  }
>  
> +static inline bool
> +sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
> +	   struct sched_group *group)
> +{
> +	/*
> +	 * Because sd->groups starts with the local group, anything that isn't
> +	 * the local group will have access to the local state.
> +	 */
> +	if (group == sds->local)
> +		return false;

sched_asym() is called under if(!local_group ...) from
update_sg_lb_stats(). So why checking this here again?

[...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
  2021-05-14  9:47   ` Peter Zijlstra
@ 2021-05-17 15:18   ` Dietmar Eggemann
  2021-05-18 19:10     ` Ricardo Neri
  2021-05-17 22:28   ` Joel Fernandes
  2 siblings, 1 reply; 23+ messages in thread
From: Dietmar Eggemann @ 2021-05-17 15:18 UTC (permalink / raw)
  To: Ricardo Neri, Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On 13/05/2021 17:49, Ricardo Neri wrote:

[...]

> @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
>  	return group_has_spare;
>  }
>  
> +/**
> + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu:	Destination CPU of the load balancing
> + * @sds:	Load-balancing data with statistics of the local group
> + * @sgs:	Load-balancing statistics of the candidate busiest group
> + * @sg:		The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +				struct sg_lb_stats *sgs, struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +	int cpu, local_busy_cpus, sg_busy_cpus;
> +	bool local_is_smt, sg_is_smt;
> +
> +	cpu = group_first_cpu(sg);

Looks like `cpu` isn't used.

[...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
  2021-05-14  9:47   ` Peter Zijlstra
  2021-05-17 15:18   ` Dietmar Eggemann
@ 2021-05-17 22:28   ` Joel Fernandes
  2021-05-17 22:34     ` Joel Fernandes
  2 siblings, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2021-05-17 22:28 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, linux-kernel, Aubrey Li,
	Daniel Bristot de Oliveira

On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
> 
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
> 
> Implement asym_can_pull_tasks() to inspect the state of the SMT siblings
> of both dst_cpu and the CPUs in the candidate busiest group.
> 
> To keep the legacy behavior, add an arch_asym_check_smt_siblings() to
> skip these new checks by default. Individual architectures must override
> the mentioned function to use asym_can_pull_tasks().
> 
> Cc: Aubrey Li <aubrey.li@intel.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Quentin Perret <qperret@google.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
>     balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
>     sched_asym().
> 
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
>     tasks. Instead, reclassify the candidate busiest group, as it
>     may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
>     determining if a sched_group is comprised of SMT siblings.
>     (PeterZ).
> ---
>  include/linux/sched/topology.h |   1 +
>  kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
>  2 files changed, 102 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..43bdb8b1e1df 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
>  #endif
>  
>  extern int arch_asym_cpu_priority(int cpu);
> +extern bool arch_asym_check_smt_siblings(void);
>  
>  struct sched_domain_attr {
>  	int relax_domain_level;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8b66a5d593e..3d6cc027e6e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
>  	return -cpu;
>  }
>  
> +/*
> + * For asym packing, first check the state of SMT siblings before deciding to
> + * pull tasks.
> + */
> +bool __weak arch_asym_check_smt_siblings(void)
> +{
> +	return false;
> +}
> +
>  /*
>   * The margin used when comparing utilization with CPU capacity.
>   *
> @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
>  	return group_has_spare;
>  }
>  
> +/**
> + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu:	Destination CPU of the load balancing
> + * @sds:	Load-balancing data with statistics of the local group
> + * @sgs:	Load-balancing statistics of the candidate busiest group
> + * @sg:		The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +				struct sg_lb_stats *sgs, struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +	int cpu, local_busy_cpus, sg_busy_cpus;
> +	bool local_is_smt, sg_is_smt;
> +
> +	cpu = group_first_cpu(sg);
> +	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +	if (!local_is_smt) {
> +		/*
> +		 * If we are here, @dst_cpu is idle and does not have SMT
> +		 * siblings. Pull tasks if candidate group has two or more
> +		 * busy CPUs.
> +		 */
> +		if (sg_is_smt && sg_busy_cpus >= 2)
> +			return true;
> +
> +		/*
> +		 * @dst_cpu does not have SMT siblings. @sg may have SMT
> +		 * siblings and only one is busy. In such case, @dst_cpu
> +		 * can help if it has higher priority and is idle.
> +		 */
> +		return !sds->local_stat.group_util &&
> +		       sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +	}
> +
> +	/* @dst_cpu has SMT siblings. */
> +
> +	local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
> +
> +	if (sg_is_smt) {
> +		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> +		/* Local can always help to even the number busy CPUs. */
> +		if (busy_cpus_delta >= 2)
> +			return true;
> +
> +		if (busy_cpus_delta == 1)
> +			return sched_asym_prefer(dst_cpu,
> +						 sg->asym_prefer_cpu);
> +
> +		return false;
> +	}
> +
> +	/*
> +	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
> +	 * up with more than one busy SMT sibling and only pull tasks if there
> +	 * are not busy CPUs. As CPUs move in and out of idle state frequently,
> +	 * also check the group utilization to smoother the decision.

nit: s/smoother/smoothen/

> +	 */
> +	if (!local_busy_cpus && !sds->local_stat.group_util)
> +		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);

Hmm, I am not sure but is it possible that there some local_busy_cpus yet
group_util is 0? If not just check for !group_util ?

FWIW on the series:
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel

> +
> +	return false;
> +#else
> +	return true;
> +#endif
> +}
> +
>  static inline bool
>  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
>  	   struct sched_group *group)
> @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
>  	if (group == sds->local)
>  		return false;
>  
> +	if (arch_asym_check_smt_siblings())
> +		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
>  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>  }
>  
> @@ -9463,6 +9558,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  		    nr_running == 1)
>  			continue;
>  
> +		/* Make sure we only pull tasks from a CPU of lower priority */
> +		if ((env->sd->flags & SD_ASYM_PACKING) &&
> +		    sched_asym_prefer(i, env->dst_cpu) &&
> +		    nr_running == 1)
> +			continue;
> +
>  		switch (env->migration_type) {
>  		case migrate_load:
>  			/*
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-17 22:28   ` Joel Fernandes
@ 2021-05-17 22:34     ` Joel Fernandes
  0 siblings, 0 replies; 23+ messages in thread
From: Joel Fernandes @ 2021-05-17 22:34 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, LKML, Aubrey Li,
	Daniel Bristot de Oliveira

On Mon, May 17, 2021 at 6:28 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > check for the idle state of the destination CPU, dst_cpu, but also of
> > its SMT siblings.
> >
[...]
> > +     if (!local_is_smt) {
> > +             /*
> > +              * If we are here, @dst_cpu is idle and does not have SMT
> > +              * siblings. Pull tasks if candidate group has two or more
> > +              * busy CPUs.
> > +              */
> > +             if (sg_is_smt && sg_busy_cpus >= 2)
> > +                     return true;
> > +
> > +             /*
> > +              * @dst_cpu does not have SMT siblings. @sg may have SMT
> > +              * siblings and only one is busy. In such case, @dst_cpu
> > +              * can help if it has higher priority and is idle.
> > +              */
> > +             return !sds->local_stat.group_util &&
> > +                    sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > +     }
> > +
> > +     /* @dst_cpu has SMT siblings. */
> > +
> > +     local_busy_cpus = sds->local->group_weight - sds->local_stat.idle_cpus;
> > +
> > +     if (sg_is_smt) {
> > +             int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > +
> > +             /* Local can always help to even the number busy CPUs. */
> > +             if (busy_cpus_delta >= 2)
> > +                     return true;
> > +
> > +             if (busy_cpus_delta == 1)
> > +                     return sched_asym_prefer(dst_cpu,
> > +                                              sg->asym_prefer_cpu);
> > +
> > +             return false;
> > +     }
> > +
> > +     /*
> > +      * @sg does not have SMT siblings. Ensure that @sds::local does not end
> > +      * up with more than one busy SMT sibling and only pull tasks if there
> > +      * are not busy CPUs. As CPUs move in and out of idle state frequently,
> > +      * also check the group utilization to smoother the decision.
>
> nit: s/smoother/smoothen/
>
> > +      */
> > +     if (!local_busy_cpus && !sds->local_stat.group_util)
> > +             return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
>
> Hmm, I am not sure but is it possible that there some local_busy_cpus yet
> group_util is 0? If not just check for !group_util ?

Sorry - I meant here, "yet group_util is not 0..."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-15  2:14     ` Ricardo Neri
@ 2021-05-18 19:07         ` Ricardo Neri
  0 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-18 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira,
	Michael Ellerman, linuxppc-dev

On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > >  include/linux/sched/topology.h |   1 +
> > >  kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 102 insertions(+)
> > > 
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..43bdb8b1e1df 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > >  #endif
> > >  
> > >  extern int arch_asym_cpu_priority(int cpu);
> > > +extern bool arch_asym_check_smt_siblings(void);
> > >  
> > >  struct sched_domain_attr {
> > >  	int relax_domain_level;
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c8b66a5d593e..3d6cc027e6e6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > >  	return -cpu;
> > >  }
> > >  
> > > +/*
> > > + * For asym packing, first check the state of SMT siblings before deciding to
> > > + * pull tasks.
> > > + */
> > > +bool __weak arch_asym_check_smt_siblings(void)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > >  /*
> > >   * The margin used when comparing utilization with CPU capacity.
> > >   *
> > 
> > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
> > >  	if (group == sds->local)
> > >  		return false;
> > >  
> > > +	if (arch_asym_check_smt_siblings())
> > > +		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > > +
> > >  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > >  }
> > 
> > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > level, rather than some arch special. Wouldn't something like this be
> > more appropriate?
> > 
> > ---
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >  
> >  extern int arch_asym_cpu_priority(int cpu);
> > -extern bool arch_asym_check_smt_siblings(void);
> >  
> >  struct sched_domain_attr {
> >  	int relax_domain_level;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> >  }
> >  
> >  /*
> > - * For asym packing, first check the state of SMT siblings before deciding to
> > - * pull tasks.
> > - */
> > -bool __weak arch_asym_check_smt_siblings(void)
> > -{
> > -	return false;
> > -}
> > -
> > -/*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> >   * (default: ~20%)
> > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> >  	if (group == sds->local)
> >  		return false;
> >  
> > -	if (arch_asym_check_smt_siblings())
> > +	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> > +	    (group->flags & SD_SHARE_CPUCAPACITY))
> >  		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> 
> Thanks Peter for the quick review! This makes sense to me. The only
> reason we proposed arch_asym_check_smt_siblings() is because we were
> about breaking powerpc (I need to study how they set priorities for SMT,
> if applicable). If you think this is not an issue I can post a
> v4 with this update.

As far as I can see, priorities in powerpc are set by the CPU number.
However, I am not sure how CPUs are enumerated? If CPUs in brackets are
SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
[1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
core would need to be busy before a new core is used.

Still, I think the issue described in the cover letter may be
reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
able to help since CPU0 has the highest priority.

I am cc'ing the linuxppc list to get some feedback.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
@ 2021-05-18 19:07         ` Ricardo Neri
  0 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-18 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Len Brown, Quentin Perret, Vincent Guittot,
	Ravi V. Shankar, linux-kernel, Aubrey Li,
	Daniel Bristot de Oliveira, Ricardo Neri, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, linuxppc-dev, Mel Gorman,
	Srinivas Pandruvada, Joel Fernandes (Google),
	Tim Chen, Ingo Molnar, Aubrey Li

On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > >  include/linux/sched/topology.h |   1 +
> > >  kernel/sched/fair.c            | 101 +++++++++++++++++++++++++++++++++
> > >  2 files changed, 102 insertions(+)
> > > 
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..43bdb8b1e1df 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > >  #endif
> > >  
> > >  extern int arch_asym_cpu_priority(int cpu);
> > > +extern bool arch_asym_check_smt_siblings(void);
> > >  
> > >  struct sched_domain_attr {
> > >  	int relax_domain_level;
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c8b66a5d593e..3d6cc027e6e6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > >  	return -cpu;
> > >  }
> > >  
> > > +/*
> > > + * For asym packing, first check the state of SMT siblings before deciding to
> > > + * pull tasks.
> > > + */
> > > +bool __weak arch_asym_check_smt_siblings(void)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > >  /*
> > >   * The margin used when comparing utilization with CPU capacity.
> > >   *
> > 
> > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
> > >  	if (group == sds->local)
> > >  		return false;
> > >  
> > > +	if (arch_asym_check_smt_siblings())
> > > +		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > > +
> > >  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > >  }
> > 
> > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > level, rather than some arch special. Wouldn't something like this be
> > more appropriate?
> > 
> > ---
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >  
> >  extern int arch_asym_cpu_priority(int cpu);
> > -extern bool arch_asym_check_smt_siblings(void);
> >  
> >  struct sched_domain_attr {
> >  	int relax_domain_level;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> >  }
> >  
> >  /*
> > - * For asym packing, first check the state of SMT siblings before deciding to
> > - * pull tasks.
> > - */
> > -bool __weak arch_asym_check_smt_siblings(void)
> > -{
> > -	return false;
> > -}
> > -
> > -/*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> >   * (default: ~20%)
> > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> >  	if (group == sds->local)
> >  		return false;
> >  
> > -	if (arch_asym_check_smt_siblings())
> > +	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> > +	    (group->flags & SD_SHARE_CPUCAPACITY))
> >  		return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> 
> Thanks Peter for the quick review! This makes sense to me. The only
> reason we proposed arch_asym_check_smt_siblings() is because we were
> about breaking powerpc (I need to study how they set priorities for SMT,
> if applicable). If you think this is not an issue I can post a
> v4 with this update.

As far as I can see, priorities in powerpc are set by the CPU number.
However, I am not sure how CPUs are enumerated? If CPUs in brackets are
SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
[1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
core would need to be busy before a new core is used.

Still, I think the issue described in the cover letter may be
reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
able to help since CPU0 has the highest priority.

I am cc'ing the linuxppc list to get some feedback.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-17 15:18   ` Dietmar Eggemann
@ 2021-05-18 19:10     ` Ricardo Neri
  0 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-18 19:10 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Len Brown, Srinivas Pandruvada, Tim Chen,
	Aubrey Li, Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On Mon, May 17, 2021 at 05:18:29PM +0200, Dietmar Eggemann wrote:
> On 13/05/2021 17:49, Ricardo Neri wrote:
> 
> [...]
> 
> > @@ -8447,6 +8456,89 @@ group_type group_classify(unsigned int imbalance_pct,
> >  	return group_has_spare;
> >  }
> >  
> > +/**
> > + * asym_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> > + * @dst_cpu:	Destination CPU of the load balancing
> > + * @sds:	Load-balancing data with statistics of the local group
> > + * @sgs:	Load-balancing statistics of the candidate busiest group
> > + * @sg:		The candidate busiet group
> > + *
> > + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > + *
> > + * If both @dst_cpu and @sg have SMT siblings. Even the number of idle CPUs
> > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> > + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> > + * has lower priority.
> > + */
> > +static bool asym_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > +				struct sg_lb_stats *sgs, struct sched_group *sg)
> > +{
> > +#ifdef CONFIG_SCHED_SMT
> > +	int cpu, local_busy_cpus, sg_busy_cpus;
> > +	bool local_is_smt, sg_is_smt;
> > +
> > +	cpu = group_first_cpu(sg);
> 
> Looks like `cpu` isn't used.

Thank you very much for your feedback Dietmar!

Ah! That is correct. I will remove this variable.

Thanks and BR,
Ricardo
> 
> [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing
  2021-05-17 14:21   ` Dietmar Eggemann
@ 2021-05-18 19:18     ` Ricardo Neri
  0 siblings, 0 replies; 23+ messages in thread
From: Ricardo Neri @ 2021-05-18 19:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra (Intel),
	Ingo Molnar, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Len Brown, Srinivas Pandruvada, Tim Chen,
	Aubrey Li, Ravi V. Shankar, Ricardo Neri, Quentin Perret,
	Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira

On Mon, May 17, 2021 at 04:21:36PM +0200, Dietmar Eggemann wrote:
> On 13/05/2021 17:49, Ricardo Neri wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c8c04e9d0d3b..c8b66a5d593e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8447,6 +8447,20 @@ group_type group_classify(unsigned int imbalance_pct,
> >  	return group_has_spare;
> >  }
> >  
> > +static inline bool
> > +sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
> > +	   struct sched_group *group)
> > +{
> > +	/*
> > +	 * Because sd->groups starts with the local group, anything that isn't
> > +	 * the local group will have access to the local state.
> > +	 */
> > +	if (group == sds->local)
> > +		return false;
> 
> sched_asym() is called under if(!local_group ...) from
> update_sg_lb_stats(). So why checking this here again?

This is true. It looks to me that this check is not needed. I makes
sense to me to keep the check in update_sg_lb_stats() so that we can
avoid the extra checks, right?

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-18 19:07         ` Ricardo Neri
@ 2021-05-19  9:59           ` Peter Zijlstra
  -1 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2021-05-19  9:59 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Len Brown,
	Srinivas Pandruvada, Tim Chen, Aubrey Li, Ravi V. Shankar,
	Ricardo Neri, Quentin Perret, Joel Fernandes (Google),
	linux-kernel, Aubrey Li, Daniel Bristot de Oliveira,
	Michael Ellerman, linuxppc-dev

On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:

> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > level, rather than some arch special. Wouldn't something like this be
> > > more appropriate?

> > Thanks Peter for the quick review! This makes sense to me. The only
> > reason we proposed arch_asym_check_smt_siblings() is because we were
> > about breaking powerpc (I need to study how they set priorities for SMT,
> > if applicable). If you think this is not an issue I can post a
> > v4 with this update.
> 
> As far as I can see, priorities in powerpc are set by the CPU number.
> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> core would need to be busy before a new core is used.
> 
> Still, I think the issue described in the cover letter may be
> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> able to help since CPU0 has the highest priority.
> 
> I am cc'ing the linuxppc list to get some feedback.

IIRC the concern with Power is that their Cores can go faster if the
higher SMT siblings are unused.

That is, suppose you have an SMT4 Core with only a single active task,
then if only SMT0 is used it can reach max performance, but if the
active sibling is SMT1 it can not reach max performance, and if the only
active sibling is SMT2 it goes slower still.

So they need to pack the tasks to the lowest SMT siblings, and have the
highest SMT siblings idle (where possible) in order to increase
performance.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
@ 2021-05-19  9:59           ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2021-05-19  9:59 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Juri Lelli, Len Brown, Quentin Perret, Vincent Guittot,
	Ravi V. Shankar, linux-kernel, Aubrey Li,
	Daniel Bristot de Oliveira, Ricardo Neri, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, linuxppc-dev, Mel Gorman,
	Srinivas Pandruvada, Joel Fernandes (Google),
	Tim Chen, Ingo Molnar, Aubrey Li

On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:

> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > level, rather than some arch special. Wouldn't something like this be
> > > more appropriate?

> > Thanks Peter for the quick review! This makes sense to me. The only
> > reason we proposed arch_asym_check_smt_siblings() is because we were
> > about breaking powerpc (I need to study how they set priorities for SMT,
> > if applicable). If you think this is not an issue I can post a
> > v4 with this update.
> 
> As far as I can see, priorities in powerpc are set by the CPU number.
> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> core would need to be busy before a new core is used.
> 
> Still, I think the issue described in the cover letter may be
> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> able to help since CPU0 has the highest priority.
> 
> I am cc'ing the linuxppc list to get some feedback.

IIRC the concern with Power is that their Cores can go faster if the
higher SMT siblings are unused.

That is, suppose you have an SMT4 Core with only a single active task,
then if only SMT0 is used it can reach max performance, but if the
active sibling is SMT1 it can not reach max performance, and if the only
active sibling is SMT2 it goes slower still.

So they need to pack the tasks to the lowest SMT siblings, and have the
highest SMT siblings idle (where possible) in order to increase
performance.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-19  9:59           ` Peter Zijlstra
@ 2021-05-19 11:09             ` Nicholas Piggin
  -1 siblings, 0 replies; 23+ messages in thread
From: Nicholas Piggin @ 2021-05-19 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ricardo Neri
  Cc: Aubrey Li, Aubrey Li, Daniel Bristot de Oliveira, Ben Segall,
	Dietmar Eggemann, Joel Fernandes (Google),
	Juri Lelli, Len Brown, linux-kernel, linuxppc-dev, Mel Gorman,
	Ingo Molnar, Quentin Perret, Ravi V. Shankar, Ricardo Neri,
	Steven Rostedt, Srinivas Pandruvada, Tim Chen, Vincent Guittot

Excerpts from Peter Zijlstra's message of May 19, 2021 7:59 pm:
> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
>> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
>> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
>> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
>> > > level, rather than some arch special. Wouldn't something like this be
>> > > more appropriate?
> 
>> > Thanks Peter for the quick review! This makes sense to me. The only
>> > reason we proposed arch_asym_check_smt_siblings() is because we were
>> > about breaking powerpc (I need to study how they set priorities for SMT,
>> > if applicable). If you think this is not an issue I can post a
>> > v4 with this update.
>> 
>> As far as I can see, priorities in powerpc are set by the CPU number.
>> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
>> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
>> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
>> core would need to be busy before a new core is used.
>> 
>> Still, I think the issue described in the cover letter may be
>> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
>> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
>> able to help since CPU0 has the highest priority.
>> 
>> I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.

That's correct.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
@ 2021-05-19 11:09             ` Nicholas Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nicholas Piggin @ 2021-05-19 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Ricardo Neri
  Cc: Juri Lelli, Len Brown, Ricardo Neri, Vincent Guittot, Tim Chen,
	Ravi V. Shankar, Quentin Perret, linuxppc-dev, Aubrey Li,
	linux-kernel, Steven Rostedt, Ingo Molnar, Ben Segall, Aubrey Li,
	Srinivas Pandruvada, Joel Fernandes (Google),
	Daniel Bristot de Oliveira, Dietmar Eggemann, Mel Gorman

Excerpts from Peter Zijlstra's message of May 19, 2021 7:59 pm:
> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
>> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
>> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
>> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
>> > > level, rather than some arch special. Wouldn't something like this be
>> > > more appropriate?
> 
>> > Thanks Peter for the quick review! This makes sense to me. The only
>> > reason we proposed arch_asym_check_smt_siblings() is because we were
>> > about breaking powerpc (I need to study how they set priorities for SMT,
>> > if applicable). If you think this is not an issue I can post a
>> > v4 with this update.
>> 
>> As far as I can see, priorities in powerpc are set by the CPU number.
>> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
>> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
>> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
>> core would need to be busy before a new core is used.
>> 
>> Still, I think the issue described in the cover letter may be
>> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
>> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
>> able to help since CPU0 has the highest priority.
>> 
>> I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.

That's correct.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
  2021-05-19  9:59           ` Peter Zijlstra
@ 2021-05-19 12:05             ` Srikar Dronamraju
  -1 siblings, 0 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2021-05-19 12:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ricardo Neri, Juri Lelli, Len Brown, Quentin Perret,
	Vincent Guittot, Ravi V. Shankar, linux-kernel, Aubrey Li,
	Daniel Bristot de Oliveira, Ricardo Neri, Steven Rostedt,
	Dietmar Eggemann, Ben Segall, linuxppc-dev, Mel Gorman,
	Srinivas Pandruvada, Joel Fernandes (Google),
	Tim Chen, Ingo Molnar, Aubrey Li

* Peter Zijlstra <peterz@infradead.org> [2021-05-19 11:59:48]:

> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
> > > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > > level, rather than some arch special. Wouldn't something like this be
> > > > more appropriate?
> 
> > > Thanks Peter for the quick review! This makes sense to me. The only
> > > reason we proposed arch_asym_check_smt_siblings() is because we were
> > > about breaking powerpc (I need to study how they set priorities for SMT,
> > > if applicable). If you think this is not an issue I can post a
> > > v4 with this update.
> > 
> > As far as I can see, priorities in powerpc are set by the CPU number.
> > However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> > SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> > [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> > core would need to be busy before a new core is used.
> > 
> > Still, I think the issue described in the cover letter may be
> > reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> > tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> > able to help since CPU0 has the highest priority.
> > 
> > I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.
> 
> 

If you are referring to SD_ASYM_PACKING, then packing tasks to lowest SMT
was needed in POWER7 timeframe. So if there was one thread running, then
running on the lowest sibling provided the best performance as if running in
single threaded mode.

However recent chips like POWER8/ POWER9 / POWER10 dont need SD_ASYM_PACKING
since the hardware itself does the switch. So even if task is place on a
higher sibling within the core, we dont need to switch the task to a lower
sibling for it to perform better. Now running only one thread running on any
sibling, its expected to run in single threaded mode.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance
@ 2021-05-19 12:05             ` Srikar Dronamraju
  0 siblings, 0 replies; 23+ messages in thread
From: Srikar Dronamraju @ 2021-05-19 12:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Ravi V. Shankar, linuxppc-dev, Vincent Guittot,
	Tim Chen, Len Brown, Quentin Perret, Ricardo Neri, Aubrey Li,
	linux-kernel, Steven Rostedt, Ricardo Neri, Ben Segall,
	Ingo Molnar, Mel Gorman, Srinivas Pandruvada,
	Joel Fernandes (Google),
	Daniel Bristot de Oliveira, Dietmar Eggemann, Aubrey Li

* Peter Zijlstra <peterz@infradead.org> [2021-05-19 11:59:48]:

> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
> > > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > > level, rather than some arch special. Wouldn't something like this be
> > > > more appropriate?
> 
> > > Thanks Peter for the quick review! This makes sense to me. The only
> > > reason we proposed arch_asym_check_smt_siblings() is because we were
> > > about breaking powerpc (I need to study how they set priorities for SMT,
> > > if applicable). If you think this is not an issue I can post a
> > > v4 with this update.
> > 
> > As far as I can see, priorities in powerpc are set by the CPU number.
> > However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> > SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> > [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> > core would need to be busy before a new core is used.
> > 
> > Still, I think the issue described in the cover letter may be
> > reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> > tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> > able to help since CPU0 has the highest priority.
> > 
> > I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.
> 
> 

If you are referring to SD_ASYM_PACKING, then packing tasks to lowest SMT
was needed in POWER7 timeframe. So if there was one thread running, then
running on the lowest sibling provided the best performance as if running in
single threaded mode.

However recent chips like POWER8/ POWER9 / POWER10 dont need SD_ASYM_PACKING
since the hardware itself does the switch. So even if task is place on a
higher sibling within the core, we dont need to switch the task to a lower
sibling for it to perform better. Now running only one thread running on any
sibling, its expected to run in single threaded mode.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-05-19 12:09 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-13 15:49 [PATCH v3 0/6] sched/fair: Fix load balancing of SMT siblings with ASYM_PACKING Ricardo Neri
2021-05-13 15:49 ` [PATCH v3 1/6] sched/topology: Introduce sched_group::flags Ricardo Neri
2021-05-13 15:49 ` [PATCH v3 2/6] sched/fair: Optimize checking for group_asym_packing Ricardo Neri
2021-05-13 15:49 ` [PATCH v3 3/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics Ricardo Neri
2021-05-13 15:49 ` [PATCH v3 4/6] sched/fair: Carve out logic to mark a group for asymmetric packing Ricardo Neri
2021-05-17 14:21   ` Dietmar Eggemann
2021-05-18 19:18     ` Ricardo Neri
2021-05-13 15:49 ` [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance Ricardo Neri
2021-05-14  9:47   ` Peter Zijlstra
2021-05-15  2:14     ` Ricardo Neri
2021-05-18 19:07       ` Ricardo Neri
2021-05-18 19:07         ` Ricardo Neri
2021-05-19  9:59         ` Peter Zijlstra
2021-05-19  9:59           ` Peter Zijlstra
2021-05-19 11:09           ` Nicholas Piggin
2021-05-19 11:09             ` Nicholas Piggin
2021-05-19 12:05           ` Srikar Dronamraju
2021-05-19 12:05             ` Srikar Dronamraju
2021-05-17 15:18   ` Dietmar Eggemann
2021-05-18 19:10     ` Ricardo Neri
2021-05-17 22:28   ` Joel Fernandes
2021-05-17 22:34     ` Joel Fernandes
2021-05-13 15:49 ` [PATCH v3 6/6] x86/sched: Enable SMT checks for asymmetric packing in load balancing Ricardo Neri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.