All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
@ 2023-04-06 20:31 Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA Ricardo Neri
                   ` (12 more replies)
  0 siblings, 13 replies; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri

Hi,

This is v4 of this series. Previous versions can be found here [1], [2],
and here [3]. To avoid duplication, I do not include the cover letter of
the original submission. You can read it in [1].

This patchset applies cleanly on today's master branch of the tip tree.

Changes since v3:

Nobody liked the proposed changes to the setting of prefer_sibling.
Instead, I tweaked the solution that Dietmar proposed. Now the busiest
group, not the local group, determines the setting of prefer_sibling.

Vincent suggested improvements to the logic to decide whether to follow
asym_packing priorities. Peter suggested to wrap that in a helper function.
I added sched_use_asym_prio().

Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
does not assume that all child domains have the requested flag.

Tim found that asym_active_balance() needs to also check for the idle
states of the SMT siblings of lb_env::dst_cpu. I added such check.

I wrongly assumed that asym_packing could only be used when the busiest
group had exactly one busy CPU. This broke asym_packing balancing at the
DIE domain. I limited this check to balances between cores at the MC
level.

As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
and placed its logic in sched_asym(). Also, sched_asym() uses
sched_smt_active() to skip checks when not needed.

I also added a patch from Chen Yu to enable asym_packing balancing in
Meteor Lake, which has CPUs of different maximum frequency in more than
one die.

Hopefully, these patches are in sufficiently good shape to be merged?

Thank you for your feedback and I look forward to getting more of it!

New patches: 8, 12
Updated patches: 2, 3, 4, 6, 7
Unchanged patches: 1, 5, 9, 10, 11

BR,
Ricardo

[1]. https://lore.kernel.org/lkml/20220825225529.26465-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20221122203532.15013-1-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/20230207045838.11243-1-ricardo.neri-calderon@linux.intel.com/

Chen Yu (1):
  x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
    processors

Ricardo Neri (11):
  sched/fair: Move is_core_idle() out of CONFIG_NUMA
  sched/fair: Only do asym_packing load balancing from fully idle SMT
    cores
  sched/fair: Simplify asym_packing logic for SMT cores
  sched/fair: Let low-priority cores help high-priority busy SMT cores
  sched/fair: Keep a fully_busy SMT sched group as busiest
  sched/fair: Use the busiest group to set prefer_sibling
  sched/fair: Do not even the number of busy CPUs via asym_packing
  sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
  sched/topology: Remove SHARED_CHILD from ASYM_PACKING
  x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
  x86/sched/itmt: Give all SMT siblings of a core the same priority

 arch/x86/kernel/itmt.c         |  23 +---
 arch/x86/kernel/smpboot.c      |   4 +-
 include/linux/sched/sd_flags.h |   5 +-
 kernel/sched/fair.c            | 216 +++++++++++++++++----------------
 kernel/sched/sched.h           |  22 +++-
 5 files changed, 138 insertions(+), 132 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 02/12] sched/fair: Only do asym_packing load balancing from fully idle SMT cores Ricardo Neri
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4dc1950ae5ae..57c106fa721d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+		if (cpu == sibling)
+			continue;
+
+		if (!idle_cpu(sibling))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
 #ifdef CONFIG_NUMA
 #define NUMA_IMBALANCE_MIN 2
 
@@ -1700,23 +1717,6 @@ struct numa_stats {
 	int idle_cpu;
 };
 
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
-	int sibling;
-
-	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
-		if (cpu == sibling)
-			continue;
-
-		if (!idle_cpu(sibling))
-			return false;
-	}
-#endif
-
-	return true;
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 02/12] sched/fair: Only do asym_packing load balancing from fully idle SMT cores
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 03/12] sched/fair: Simplify asym_packing logic for " Ricardo Neri
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.

When balancing load within an SMT core this consideration is not relevant.
Follow the priorities that hardware indicates.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Improved the logic to determine whether CPU priority should be followed.
   Also, wrapped this logic in a helper function. (Vincent G./ Peter)
 * Used sched_smt_active() to avoid pointless calls of is_core_idle().
   (Dietmar)
 * Ensure that the core is idle in asym_active_balance(). (Tim)
 * Used sched_use_asym_prio() to check for fully idle SMT cores in
   sched_asym().
 * Removed check for fully idle core inside asym_smt_can_pull_tasks().
   Now such condition is verified outside the function.

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 60 +++++++++++++++++++++++++++++++++------------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57c106fa721d..ec7ddbfd1136 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9273,6 +9273,29 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+/**
+ * sched_use_asym_prio - Check whether asym_packing priority must be used
+ * @sd:		The scheduling domain of the load balancing
+ * @cpu:	A CPU
+ *
+ * Always use CPU priority when balancing load between SMT siblings. When
+ * balancing load between cores, it is not sufficient that @cpu is idle. Only
+ * use CPU priority if the whole core is idle.
+ *
+ * Returns: True if the priority of @cpu must be followed. False otherwise.
+ */
+static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (!sched_smt_active())
+		return true;
+
+	return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+#else
+	return true;
+#endif
+}
+
 /**
  * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
  * @dst_cpu:	Destination CPU of the load balancing
@@ -9283,6 +9306,9 @@ group_type group_classify(unsigned int imbalance_pct,
  * Check the state of the SMT siblings of both @sds::local and @sg and decide
  * if @dst_cpu can pull tasks.
  *
+ * This function must be called only if all the SMT siblings of @dst_cpu are
+ * idle, if any.
+ *
  * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
  * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
  * only if @dst_cpu has higher priority.
@@ -9292,8 +9318,7 @@ group_type group_classify(unsigned int imbalance_pct,
  * Bigger imbalances in the number of busy CPUs will be dealt with in
  * update_sd_pick_busiest().
  *
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
+ * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
  *
  * Return: true if @dst_cpu can pull tasks, false otherwise.
  */
@@ -9341,15 +9366,8 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 		return false;
 	}
 
-	/*
-	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
-	 * up with more than one busy SMT sibling and only pull tasks if there
-	 * are not busy CPUs (i.e., no CPU has running tasks).
-	 */
-	if (!sds->local_stat.sum_nr_running)
-		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-	return false;
+	/* If we are here @dst_cpu has SMT siblings and are also idle. */
+	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 #else
 	/* Always return false so that callers deal with non-SMT cases. */
 	return false;
@@ -9360,7 +9378,11 @@ static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
 {
-	/* Only do SMT checks if either local or candidate have SMT siblings */
+	/* Ensure that the whole local core is idle, if applicable. */
+	if (!sched_use_asym_prio(env->sd, env->dst_cpu))
+		return false;
+
+	/* Only do SMT checks if either local or candidate have SMT siblings. */
 	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
 	    (group->flags & SD_SHARE_CPUCAPACITY))
 		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
@@ -10565,11 +10587,13 @@ static inline bool
 asym_active_balance(struct lb_env *env)
 {
 	/*
-	 * ASYM_PACKING needs to force migrate tasks from busy but
-	 * lower priority CPUs in order to pack all tasks in the
-	 * highest priority CPUs.
+	 * ASYM_PACKING needs to force migrate tasks from busy but lower
+	 * priority CPUs in order to pack all tasks in the highest priority
+	 * CPUs. When done between cores, do it only if the whole core if the
+	 * whole core is idle.
 	 */
 	return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
+	       sched_use_asym_prio(env->sd, env->dst_cpu) &&
 	       sched_asym_prefer(env->dst_cpu, env->src_cpu);
 }
 
@@ -11304,9 +11328,13 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_PACKING; see if there's a more preferred CPU
 		 * currently idle; in which case, kick the ILB to move tasks
 		 * around.
+		 *
+		 * When balancing betwen cores, all the SMT siblings of the
+		 * preferred CPU must be idle.
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
-			if (sched_asym_prefer(i, cpu)) {
+			if (sched_use_asym_prio(sd, i) &&
+			    sched_asym_prefer(i, cpu)) {
 				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 				goto unlock;
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 03/12] sched/fair: Simplify asym_packing logic for SMT cores
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 02/12] sched/fair: Only do asym_packing load balancing from fully idle SMT cores Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 04/12] sched/fair: Let low-priority cores help high-priority busy " Ricardo Neri
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

Callers of asym_smt_can_pull_tasks() check the idle state of the
destination CPU and its SMT siblings, if any. No extra checks are needed
in such function.

Since SMT cores divide capacity among its siblings, priorities only really
make sense if only one sibling is active. This is true for SMT2, SMT4,
SMT8, etc. Do not use asym_packing load balance for this case. Instead,
let find_busiest_group() handle imbalances.

When balancing non-SMT cores or at higher scheduling domains (e.g.,
between MC scheduling groups), continue using priorities.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Dropped checks for a destination SMT core.
 * Limited the check of a single busy CPU to SMT cores. This fixes a bug
   when balancing between MC scheduling groups of different priority.

Changes since v2:
 * Updated documentation of the function to reflect the new behavior.
   (Dietmar)

Changes since v1:
 * Reworded commit message and inline comments for clarity.
 * Stated that this changeset does not impact SMT4 or SMT8.
---
 kernel/sched/fair.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec7ddbfd1136..b6bbe0300635 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9313,12 +9313,9 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
  * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
  * only if @dst_cpu has higher priority.
  *
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
- *
- * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
+ * When dealing with SMT cores, only use priorities if the SMT core has exactly
+ * one busy sibling. find_busiest_group() will handle bigger imbalances in the
+ * number of busy CPUs.
  *
  * Return: true if @dst_cpu can pull tasks, false otherwise.
  */
@@ -9327,12 +9324,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 				    struct sched_group *sg)
 {
 #ifdef CONFIG_SCHED_SMT
-	bool local_is_smt, sg_is_smt;
+	bool local_is_smt;
 	int sg_busy_cpus;
 
 	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
 	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
 
 	if (!local_is_smt) {
@@ -9353,21 +9348,17 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 	}
 
-	/* @dst_cpu has SMT siblings. */
-
-	if (sg_is_smt) {
-		int local_busy_cpus = sds->local->group_weight -
-				      sds->local_stat.idle_cpus;
-		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
-		if (busy_cpus_delta == 1)
-			return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
+	/*
+	 * If we are here @dst_cpu has SMT siblings and are also idle.
+	 *
+	 * CPU priorities does not make sense for SMT cores with more than one
+	 * busy sibling.
+	 */
+	if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
 		return false;
-	}
 
-	/* If we are here @dst_cpu has SMT siblings and are also idle. */
 	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
 #else
 	/* Always return false so that callers deal with non-SMT cases. */
 	return false;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 04/12] sched/fair: Let low-priority cores help high-priority busy SMT cores
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (2 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 03/12] sched/fair: Simplify asym_packing logic for " Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest Ricardo Neri
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.

When balancing load from an SMT core, also consider the idle state of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.

Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.

Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Improved the logic to determine whether CPU priority should be followed.
   Also, wrapped this logic in a helper function. (Vincent G./ Peter)

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6bbe0300635..b151e93ec316 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10488,8 +10488,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    nr_running == 1)
 			continue;
 
-		/* Make sure we only pull tasks from a CPU of lower priority */
+		/*
+		 * Make sure we only pull tasks from a CPU of lower priority
+		 * when balancing between SMT siblings.
+		 *
+		 * If balancing between cores, let lower priority CPUs help
+		 * SMT cores with more than one busy sibling.
+		 */
 		if ((env->sd->flags & SD_ASYM_PACKING) &&
+		    sched_use_asym_prio(env->sd, i) &&
 		    sched_asym_prefer(i, env->dst_cpu) &&
 		    nr_running == 1)
 			continue;
@@ -10582,10 +10589,15 @@ asym_active_balance(struct lb_env *env)
 	 * priority CPUs in order to pack all tasks in the highest priority
 	 * CPUs. When done between cores, do it only if the whole core if the
 	 * whole core is idle.
+	 *
+	 * If @env::src_cpu is an SMT core with busy siblings, let
+	 * the lower priority @env::dst_cpu help it. Do not follow
+	 * CPU priority.
 	 */
 	return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
 	       sched_use_asym_prio(env->sd, env->dst_cpu) &&
-	       sched_asym_prefer(env->dst_cpu, env->src_cpu);
+	       (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+		!sched_use_asym_prio(env->sd, env->src_cpu));
 }
 
 static inline bool
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (3 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 04/12] sched/fair: Let low-priority cores help high-priority busy " Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-05-12 18:41   ` [PATCH v4 05/12] " Shrikanth Hegde
  2023-04-06 20:31 ` [PATCH v4 06/12] sched/fair: Use the busiest group to set prefer_sibling Ricardo Neri
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b151e93ec316..ea23a5163bfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		 * contention when accessing shared HW resources.
 		 *
 		 * XXX for now avg_load is not computed and always 0 so we
-		 * select the 1st one.
+		 * select the 1st one, except if @sg is composed of SMT
+		 * siblings.
 		 */
-		if (sgs->avg_load <= busiest->avg_load)
+
+		if (sgs->avg_load < busiest->avg_load)
 			return false;
+
+		if (sgs->avg_load == busiest->avg_load) {
+			/*
+			 * SMT sched groups need more help than non-SMT groups.
+			 * If @sg happens to also be SMT, either choice is good.
+			 */
+			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+				return false;
+		}
+
 		break;
 
 	case group_has_spare:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 06/12] sched/fair: Use the busiest group to set prefer_sibling
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (4 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 07/12] sched/fair: Do not even the number of busy CPUs via asym_packing Ricardo Neri
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.

Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.

There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Dropped patch that set prefer_sibling from the current sched domain
   level. Instead, uses the flags of the busiest group's child sched
   domain. (Vincent G, PeterZ, Dietmar)

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea23a5163bfa..477cb5777036 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10056,7 +10056,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10097,8 +10096,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-	/* Tag domain that child domain prefers tasks go to siblings first */
-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+	/*
+	 * Indicate that the child domain of the busiest group prefers tasks
+	 * go to a child's sibling domains first. NB the flags of a sched group
+	 * are those of the child domain.
+	 */
+	if (sds->busiest)
+		sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
 
 
 	if (env->sd->flags & SD_NUMA)
@@ -10398,7 +10402,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 			goto out_balanced;
 	}
 
-	/* Try to move all excess tasks to child's sibling domain */
+	/*
+	 * Try to move all excess tasks to a sibling domain of the busiest
+	 * group's child domain.
+	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 07/12] sched/fair: Do not even the number of busy CPUs via asym_packing
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (5 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 06/12] sched/fair: Use the busiest group to set prefer_sibling Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 08/12] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain() Ricardo Neri
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.

After this change, the same logic apply to SMT and non-SMT local groups.
It makes less sense having a separate function to deal specifically with
SMT. Fold the logic in asym_smt_can_pull_tasks() into sched_asym().

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Removed unnecessary local variable sg_busy_cpus.
 * Folded asym_smt_can_pull_tasks() into sched_asym() and simplify
   further. (Dietmar)

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 86 +++++++++++----------------------------------
 1 file changed, 21 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 477cb5777036..145ca52f9629 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9297,74 +9297,26 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
 }
 
 /**
- * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
- * @dst_cpu:	Destination CPU of the load balancing
+ * sched_asym - Check if the destination CPU can do asym_packing load balance
+ * @env:	The load balancing environment
  * @sds:	Load-balancing data with statistics of the local group
  * @sgs:	Load-balancing statistics of the candidate busiest group
- * @sg:		The candidate busiest group
+ * @group:	The candidate busiest group
  *
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * @env::dst_cpu can do asym_packing if it has higher priority than the
+ * preferred CPU of @group.
  *
- * This function must be called only if all the SMT siblings of @dst_cpu are
- * idle, if any.
+ * SMT is a special case. If we are balancing load between cores, @env::dst_cpu
+ * can do asym_packing balance only if all its SMT siblings are idle. Also, it
+ * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
+ * imbalances in the number of CPUS are dealt with in find_busiest_group().
  *
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
+ * If we are balancing load within an SMT core, or at DIE domain level, always
+ * proceed.
  *
- * When dealing with SMT cores, only use priorities if the SMT core has exactly
- * one busy sibling. find_busiest_group() will handle bigger imbalances in the
- * number of busy CPUs.
- *
- * Return: true if @dst_cpu can pull tasks, false otherwise.
+ * Return: true if @env::dst_cpu can do with asym_packing load balance. False
+ * otherwise.
  */
-static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
-				    struct sg_lb_stats *sgs,
-				    struct sched_group *sg)
-{
-#ifdef CONFIG_SCHED_SMT
-	bool local_is_smt;
-	int sg_busy_cpus;
-
-	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
-
-	if (!local_is_smt) {
-		/*
-		 * If we are here, @dst_cpu is idle and does not have SMT
-		 * siblings. Pull tasks if candidate group has two or more
-		 * busy CPUs.
-		 */
-		if (sg_busy_cpus >= 2) /* implies sg_is_smt */
-			return true;
-
-		/*
-		 * @dst_cpu does not have SMT siblings. @sg may have SMT
-		 * siblings and only one is busy. In such case, @dst_cpu
-		 * can help if it has higher priority and is idle (i.e.,
-		 * it has no running tasks).
-		 */
-		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-	}
-
-	/*
-	 * If we are here @dst_cpu has SMT siblings and are also idle.
-	 *
-	 * CPU priorities does not make sense for SMT cores with more than one
-	 * busy sibling.
-	 */
-	if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
-		return false;
-
-	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-#else
-	/* Always return false so that callers deal with non-SMT cases. */
-	return false;
-#endif
-}
-
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
@@ -9373,10 +9325,14 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
 	if (!sched_use_asym_prio(env->sd, env->dst_cpu))
 		return false;
 
-	/* Only do SMT checks if either local or candidate have SMT siblings. */
-	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
-	    (group->flags & SD_SHARE_CPUCAPACITY))
-		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+	/*
+	 * CPU priorities does not make sense for SMT cores with more than one
+	 * busy sibling.
+	 */
+	if (group->flags & SD_SHARE_CPUCAPACITY) {
+		if (sgs->group_weight - sgs->idle_cpus != 1)
+			return false;
+	}
 
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 08/12] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (6 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 07/12] sched/fair: Do not even the number of busy CPUs via asym_packing Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 09/12] sched/topology: Remove SHARED_CHILD from ASYM_PACKING Ricardo Neri
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

Do not assume that all the children of a scheduling domain have a given
flag. Check whether it has the SDF_SHARED_CHILD meta flag.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Introduced this patch.

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 kernel/sched/sched.h | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 060616944d7a..70abce91b549 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1772,6 +1772,13 @@ queue_balance_callback(struct rq *rq,
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
 			__sd; __sd = __sd->parent)
 
+/* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:	The CPU whose highest level of sched domain is to
@@ -1779,16 +1786,25 @@ queue_balance_callback(struct rq *rq,
  * @flag:	The flag to check for the highest sched_domain
  *		for the given CPU.
  *
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains @flag. If @flag has
+ * the SDF_SHARED_CHILD metaflag, all the children domains also have @flag.
  */
 static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 {
 	struct sched_domain *sd, *hsd = NULL;
 
 	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
+		if (sd->flags & flag) {
+			hsd = sd;
+			continue;
+		}
+
+		/*
+		 * Stop the search if @flag is known to be shared at lower
+		 * levels. It will not be found further up.
+		 */
+		if (flag & SD_SHARED_CHILD_MASK)
 			break;
-		hsd = sd;
 	}
 
 	return hsd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 09/12] sched/topology: Remove SHARED_CHILD from ASYM_PACKING
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (7 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 08/12] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain() Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 10/12] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags Ricardo Neri
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

Only x86 and Power7 use ASYM_PACKING. They use it differently.

Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.

X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.

Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Removed spurious space.

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 include/linux/sched/sd_flags.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..fad77b5172e2 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
 /*
  * Place busy tasks earlier in the domain
  *
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- *               up, but currently assumed to be set from the base domain
- *               upwards (see update_top_cache_domain()).
  * NEEDS_GROUPS: Load balancing flag.
  */
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
 
 /*
  * Prefer to place tasks in a sibling domain
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 10/12] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (8 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 09/12] sched/topology: Remove SHARED_CHILD from ASYM_PACKING Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 11/12] x86/sched/itmt: Give all SMT siblings of a core the same priority Ricardo Neri
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 arch/x86/kernel/smpboot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 851477f7d728..cf7f42189e86 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -552,7 +552,7 @@ static int x86_core_flags(void)
 #ifdef CONFIG_SCHED_SMT
 static int x86_smt_flags(void)
 {
-	return cpu_smt_flags() | x86_sched_itmt_flags();
+	return cpu_smt_flags();
 }
 #endif
 #ifdef CONFIG_SCHED_CLUSTER
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 11/12] x86/sched/itmt: Give all SMT siblings of a core the same priority
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (9 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 10/12] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
  2023-04-06 20:31 ` [PATCH v4 12/12] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors Ricardo Neri
  2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Ricardo Neri, Tim C . Chen

X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Reworded commit message for clarity.
---
 arch/x86/kernel/itmt.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 670eb08b972a..ee4fe8cdb857 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)
 
 /**
  * sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio:	Priority of cpu core
- * @core_cpu:	The cpu number associated with the core
+ * @prio:	Priority of @cpu
+ * @cpu:	The CPU number
  *
  * The pstate driver will find out the max boost frequency
  * and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
  * frequency will receive higher priority.
  *
  * No need to rebuild sched domain after updating
  * the CPU priorities. The sched domains have no
  * dependency on CPU priorities.
  */
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
 {
-	int cpu, i = 1;
-
-	for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
-		int smt_prio;
-
-		/*
-		 * Ensure that the siblings are moved to the end
-		 * of the priority chain and only used when
-		 * all other high priority cpus are out of capacity.
-		 */
-		smt_prio = prio * smp_num_siblings / (i * i);
-		per_cpu(sched_core_priority, cpu) = smt_prio;
-		i++;
-	}
+	per_cpu(sched_core_priority, cpu) = prio;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 12/12] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (10 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 11/12] x86/sched/itmt: Give all SMT siblings of a core the same priority Ricardo Neri
@ 2023-04-06 20:31 ` Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Chen Yu
  2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
  12 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-04-06 20:31 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Chen Yu, Tim C . Chen

From: Chen Yu <yu.c.chen@intel.com>

Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Changes since v3:
 * Introduced this patch.

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/smpboot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index cf7f42189e86..8d966f618413 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -583,7 +583,7 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
                   ` (11 preceding siblings ...)
  2023-04-06 20:31 ` [PATCH v4 12/12] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors Ricardo Neri
@ 2023-04-29 15:32 ` Peter Zijlstra
  2023-05-01 18:30   ` Tim Chen
                     ` (2 more replies)
  12 siblings, 3 replies; 34+ messages in thread
From: Peter Zijlstra @ 2023-04-29 15:32 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Shrikanth Hegde, Srikar Dronamraju,
	naveen.n.rao

On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> Hi,
> 
> This is v4 of this series. Previous versions can be found here [1], [2],
> and here [3]. To avoid duplication, I do not include the cover letter of
> the original submission. You can read it in [1].
> 
> This patchset applies cleanly on today's master branch of the tip tree.
> 
> Changes since v3:
> 
> Nobody liked the proposed changes to the setting of prefer_sibling.
> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> group, not the local group, determines the setting of prefer_sibling.
> 
> Vincent suggested improvements to the logic to decide whether to follow
> asym_packing priorities. Peter suggested to wrap that in a helper function.
> I added sched_use_asym_prio().
> 
> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> does not assume that all child domains have the requested flag.
> 
> Tim found that asym_active_balance() needs to also check for the idle
> states of the SMT siblings of lb_env::dst_cpu. I added such check.
> 
> I wrongly assumed that asym_packing could only be used when the busiest
> group had exactly one busy CPU. This broke asym_packing balancing at the
> DIE domain. I limited this check to balances between cores at the MC
> level.
> 
> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> and placed its logic in sched_asym(). Also, sched_asym() uses
> sched_smt_active() to skip checks when not needed.
> 
> I also added a patch from Chen Yu to enable asym_packing balancing in
> Meteor Lake, which has CPUs of different maximum frequency in more than
> one die.

Is the actual topology of Meteor Lake already public? This patch made me
wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
but I can't remember (one of the raisins why the endless calls are such
a frigging waste of time) and I can't seem to find the answer using
Google either.

> Hopefully, these patches are in sufficiently good shape to be merged?

Changelogs are very sparse towards the end and I had to reverse engineer
some of it which is a shame. But yeah, on a first reading the code looks
mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
start to make a little sense. Mostly they utterly fail to answer the
very fundament "why did you do this" question.

Also, you seem to have forgotten to Cc our friends from IBM such that
they might verify you didn't break their Power7 stuff -- or do you have
a Power7 yourself to verify and forgot to mention that?

> Chen Yu (1):
>   x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
>     processors
> 
> Ricardo Neri (11):
>   sched/fair: Move is_core_idle() out of CONFIG_NUMA
>   sched/fair: Only do asym_packing load balancing from fully idle SMT
>     cores
>   sched/fair: Simplify asym_packing logic for SMT cores
>   sched/fair: Let low-priority cores help high-priority busy SMT cores
>   sched/fair: Keep a fully_busy SMT sched group as busiest
>   sched/fair: Use the busiest group to set prefer_sibling
>   sched/fair: Do not even the number of busy CPUs via asym_packing
>   sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
>   sched/topology: Remove SHARED_CHILD from ASYM_PACKING
>   x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
>   x86/sched/itmt: Give all SMT siblings of a core the same priority
> 
>  arch/x86/kernel/itmt.c         |  23 +---
>  arch/x86/kernel/smpboot.c      |   4 +-
>  include/linux/sched/sd_flags.h |   5 +-
>  kernel/sched/fair.c            | 216 +++++++++++++++++----------------
>  kernel/sched/sched.h           |  22 +++-
>  5 files changed, 138 insertions(+), 132 deletions(-)

I'm going to start to queue this and hopefully push out post -rc1 if
nobody objects.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
@ 2023-05-01 18:30   ` Tim Chen
  2023-05-02  1:42   ` Ricardo Neri
  2023-05-12 18:23   ` Shrikanth Hegde
  2 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2023-05-01 18:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ricardo Neri
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao

On Sat, 2023-04-29 at 17:32 +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> > Hi,
> > 
> > This is v4 of this series. Previous versions can be found here [1], [2],
> > and here [3]. To avoid duplication, I do not include the cover letter of
> > the original submission. You can read it in [1].
> > 
> > This patchset applies cleanly on today's master branch of the tip tree.
> > 
> > Changes since v3:
> > 
> > Nobody liked the proposed changes to the setting of prefer_sibling.
> > Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> > group, not the local group, determines the setting of prefer_sibling.
> > 
> > Vincent suggested improvements to the logic to decide whether to follow
> > asym_packing priorities. Peter suggested to wrap that in a helper function.
> > I added sched_use_asym_prio().
> > 
> > Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> > rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> > does not assume that all child domains have the requested flag.
> > 
> > Tim found that asym_active_balance() needs to also check for the idle
> > states of the SMT siblings of lb_env::dst_cpu. I added such check.
> > 
> > I wrongly assumed that asym_packing could only be used when the busiest
> > group had exactly one busy CPU. This broke asym_packing balancing at the
> > DIE domain. I limited this check to balances between cores at the MC
> > level.
> > 
> > As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> > and placed its logic in sched_asym(). Also, sched_asym() uses
> > sched_smt_active() to skip checks when not needed.
> > 
> > I also added a patch from Chen Yu to enable asym_packing balancing in
> > Meteor Lake, which has CPUs of different maximum frequency in more than
> > one die.
> 
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.

There are a bunch of fixes that are needed for SCHED_CLUSTER to work
properly on hybrid_topology.  I'll clean them up and post them on
top of Ricardo's current patch set this week.

Tim

> 
> > Hopefully, these patches are in sufficiently good shape to be merged?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
  2023-05-01 18:30   ` Tim Chen
@ 2023-05-02  1:42   ` Ricardo Neri
  2023-05-02  1:52     ` Steven Rostedt
  2023-05-12 18:23   ` Shrikanth Hegde
  2 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-05-02  1:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Shrikanth Hegde, Srikar Dronamraju,
	naveen.n.rao

On Sat, Apr 29, 2023 at 05:32:19PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> > Hi,
> > 
> > This is v4 of this series. Previous versions can be found here [1], [2],
> > and here [3]. To avoid duplication, I do not include the cover letter of
> > the original submission. You can read it in [1].
> > 
> > This patchset applies cleanly on today's master branch of the tip tree.
> > 
> > Changes since v3:
> > 
> > Nobody liked the proposed changes to the setting of prefer_sibling.
> > Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> > group, not the local group, determines the setting of prefer_sibling.
> > 
> > Vincent suggested improvements to the logic to decide whether to follow
> > asym_packing priorities. Peter suggested to wrap that in a helper function.
> > I added sched_use_asym_prio().
> > 
> > Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> > rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> > does not assume that all child domains have the requested flag.
> > 
> > Tim found that asym_active_balance() needs to also check for the idle
> > states of the SMT siblings of lb_env::dst_cpu. I added such check.
> > 
> > I wrongly assumed that asym_packing could only be used when the busiest
> > group had exactly one busy CPU. This broke asym_packing balancing at the
> > DIE domain. I limited this check to balances between cores at the MC
> > level.
> > 
> > As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> > and placed its logic in sched_asym(). Also, sched_asym() uses
> > sched_smt_active() to skip checks when not needed.
> > 
> > I also added a patch from Chen Yu to enable asym_packing balancing in
> > Meteor Lake, which has CPUs of different maximum frequency in more than
> > one die.
> 
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,

Indeed, Meteor Lake will need SCHED_CLUSTER as does Alder Lake. This is in
addition to multi-die support.


> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.
> 
> > Hopefully, these patches are in sufficiently good shape to be merged?
> 
> Changelogs are very sparse towards the end and I had to reverse engineer
> some of it which is a shame. But yeah, on a first reading the code looks
> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> start to make a little sense. Mostly they utterly fail to answer the
> very fundament "why did you do this" question.

I am sorry changelogs are not sufficiently clear. I thought stating the
overall goal in the cover letter was enough. In the future, would you
prefer that I repeat the cover letter instead of referring to it? Should
individual changelogs state the overall goal?

> 
> Also, you seem to have forgotten to Cc our friends from IBM such that
> they might verify you didn't break their Power7 stuff -- or do you have
> a Power7 yourself to verify and forgot to mention that?

I do not have a Power7 system. I did emulate it on an x86 system by
setting all cores with identical sg->asym_prefer_cpu. Within, cores, SMT
siblings had asymmetric priorities. It was only SMT2, though.

> 
> > Chen Yu (1):
> >   x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
> >     processors
> > 
> > Ricardo Neri (11):
> >   sched/fair: Move is_core_idle() out of CONFIG_NUMA
> >   sched/fair: Only do asym_packing load balancing from fully idle SMT
> >     cores
> >   sched/fair: Simplify asym_packing logic for SMT cores
> >   sched/fair: Let low-priority cores help high-priority busy SMT cores
> >   sched/fair: Keep a fully_busy SMT sched group as busiest
> >   sched/fair: Use the busiest group to set prefer_sibling
> >   sched/fair: Do not even the number of busy CPUs via asym_packing
> >   sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
> >   sched/topology: Remove SHARED_CHILD from ASYM_PACKING
> >   x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
> >   x86/sched/itmt: Give all SMT siblings of a core the same priority
> > 
> >  arch/x86/kernel/itmt.c         |  23 +---
> >  arch/x86/kernel/smpboot.c      |   4 +-
> >  include/linux/sched/sd_flags.h |   5 +-
> >  kernel/sched/fair.c            | 216 +++++++++++++++++----------------
> >  kernel/sched/sched.h           |  22 +++-
> >  5 files changed, 138 insertions(+), 132 deletions(-)
> 
> I'm going to start to queue this and hopefully push out post -rc1 if
> nobody objects.

Thanks! Will it be content for v6.4 or v6.5?

BR,
Ricardo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-05-02  1:42   ` Ricardo Neri
@ 2023-05-02  1:52     ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2023-05-02  1:52 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V. Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J. Wysocki,
	Srinivas Pandruvada, Tim Chen, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao

On Mon, 1 May 2023 18:42:55 -0700
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> wrote:

> I am sorry changelogs are not sufficiently clear. I thought stating the
> overall goal in the cover letter was enough. In the future, would you
> prefer that I repeat the cover letter instead of referring to it? Should
> individual changelogs state the overall goal?

Yes. Every commit should have a change log that explains why that commit
was done without having to look elsewhere.

The cover letter should be the summary of what the patches do. But 5 to 10
years from now, when a git bisect comes across a commit, there's no
guarantee that a cover letter will be easily found. The change log may be
the only thing a developer debugging some code will have to understand what
the change was for.

I'm guilty of having poor change logs which I myself suffered from, as I
don't remember what I meant. So I've been more adamant on adding more
detail to my change logs which has saved me a few times.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip: sched/core] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
  2023-04-06 20:31 ` [PATCH v4 12/12] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Chen Yu
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Chen Yu @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Chen Yu, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     044f0e27dec6e30bb8875a4a12c5f2594964e93f
Gitweb:        https://git.kernel.org/tip/044f0e27dec6e30bb8875a4a12c5f2594964e93f
Author:        Chen Yu <yu.c.chen@intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:48 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:38 +02:00

x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors

Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.

Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
---
 arch/x86/kernel/smpboot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a335abd..34066f6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -583,7 +583,7 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] x86/sched/itmt: Give all SMT siblings of a core the same priority
  2023-04-06 20:31 ` [PATCH v4 11/12] x86/sched/itmt: Give all SMT siblings of a core the same priority Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel),
	Len Brown, Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     046a5a95c3b0425cfe79e43021d8ee90c1c4f8c9
Gitweb:        https://git.kernel.org/tip/046a5a95c3b0425cfe79e43021d8ee90c1c4f8c9
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:47 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:38 +02:00

x86/sched/itmt: Give all SMT siblings of a core the same priority

X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-12-ricardo.neri-calderon@linux.intel.com
---
 arch/x86/kernel/itmt.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 670eb08..ee4fe8c 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)
 
 /**
  * sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio:	Priority of cpu core
- * @core_cpu:	The cpu number associated with the core
+ * @prio:	Priority of @cpu
+ * @cpu:	The CPU number
  *
  * The pstate driver will find out the max boost frequency
  * and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
  * frequency will receive higher priority.
  *
  * No need to rebuild sched domain after updating
  * the CPU priorities. The sched domains have no
  * dependency on CPU priorities.
  */
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
 {
-	int cpu, i = 1;
-
-	for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
-		int smt_prio;
-
-		/*
-		 * Ensure that the siblings are moved to the end
-		 * of the priority chain and only used when
-		 * all other high priority cpus are out of capacity.
-		 */
-		smt_prio = prio * smp_num_siblings / (i * i);
-		per_cpu(sched_core_priority, cpu) = smt_prio;
-		i++;
-	}
+	per_cpu(sched_core_priority, cpu) = prio;
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/topology: Remove SHARED_CHILD from ASYM_PACKING
  2023-04-06 20:31 ` [PATCH v4 09/12] sched/topology: Remove SHARED_CHILD from ASYM_PACKING Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, Ricardo Neri, Peter Zijlstra (Intel),
	Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ca528cc501896a808dc79c3c0544369d23b331c8
Gitweb:        https://git.kernel.org/tip/ca528cc501896a808dc79c3c0544369d23b331c8
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:45 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:37 +02:00

sched/topology: Remove SHARED_CHILD from ASYM_PACKING

Only x86 and Power7 use ASYM_PACKING. They use it differently.

Power7 has cores of equal priority, but the SMT siblings of a core have
different priorities. Parent scheduling domains do not need (nor have) the
ASYM_PACKING flag. SHARED_CHILD is not needed. Using SHARED_PARENT would
cause the topology debug code to complain.

X86 has cores of different priority, but all the SMT siblings of the core
have equal priority. It needs ASYM_PACKING at the MC level, but not at the
SMT level (it also needs it at upper levels if they have scheduling groups
of different priority). Removing ASYM_PACKING from the SMT domain causes
the topology debug code to complain.

Remove SHARED_CHILD for now. We still need a topology check that satisfies
both architectures.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-10-ricardo.neri-calderon@linux.intel.com
---
 include/linux/sched/sd_flags.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66..fad77b5 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
 /*
  * Place busy tasks earlier in the domain
  *
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- *               up, but currently assumed to be set from the base domain
- *               upwards (see update_top_cache_domain()).
  * NEEDS_GROUPS: Load balancing flag.
  */
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
 
 /*
  * Prefer to place tasks in a sibling domain

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
  2023-04-06 20:31 ` [PATCH v4 10/12] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel), Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     995998ebdebd09b85c28cc46068d8a0744113837
Gitweb:        https://git.kernel.org/tip/995998ebdebd09b85c28cc46068d8a0744113837
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:46 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:37 +02:00

x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags

There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com
---
 arch/x86/kernel/smpboot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce..a335abd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -552,7 +552,7 @@ static int x86_core_flags(void)
 #ifdef CONFIG_SCHED_SMT
 static int x86_smt_flags(void)
 {
-	return cpu_smt_flags() | x86_sched_itmt_flags();
+	return cpu_smt_flags();
 }
 #endif
 #ifdef CONFIG_SCHED_CLUSTER

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
  2023-04-06 20:31 ` [PATCH v4 08/12] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain() Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ionela Voinescu, Ricardo Neri, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     40b4d3dc328265c8ec6688657d74813edf785c83
Gitweb:        https://git.kernel.org/tip/40b4d3dc328265c8ec6688657d74813edf785c83
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:44 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:36 +02:00

sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()

Do not assume that all the children of a scheduling domain have a given
flag. Check whether it has the SDF_SHARED_CHILD meta flag.

Suggested-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-9-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/sched.h | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0..6784462 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1772,6 +1772,13 @@ queue_balance_callback(struct rq *rq,
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
 			__sd; __sd = __sd->parent)
 
+/* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:	The CPU whose highest level of sched domain is to
@@ -1779,16 +1786,25 @@ queue_balance_callback(struct rq *rq,
  * @flag:	The flag to check for the highest sched_domain
  *		for the given CPU.
  *
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains @flag. If @flag has
+ * the SDF_SHARED_CHILD metaflag, all the children domains also have @flag.
  */
 static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 {
 	struct sched_domain *sd, *hsd = NULL;
 
 	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
+		if (sd->flags & flag) {
+			hsd = sd;
+			continue;
+		}
+
+		/*
+		 * Stop the search if @flag is known to be shared at lower
+		 * levels. It will not be found further up.
+		 */
+		if (flag & SD_SHARED_CHILD_MASK)
 			break;
-		hsd = sd;
 	}
 
 	return hsd;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Do not even the number of busy CPUs via asym_packing
  2023-04-06 20:31 ` [PATCH v4 07/12] sched/fair: Do not even the number of busy CPUs via asym_packing Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel), Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c9ca07886aaa40225a29e5c1e46ac31d2e14f53a
Gitweb:        https://git.kernel.org/tip/c9ca07886aaa40225a29e5c1e46ac31d2e14f53a
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:43 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:36 +02:00

sched/fair: Do not even the number of busy CPUs via asym_packing

Now that find_busiest_group() triggers load balancing between a fully_
busy SMT2 core and an idle non-SMT core, it is no longer needed to force
balancing via asym_packing. Use asym_packing only as intended: when there
is high-priority CPU that is idle.

After this change, the same logic apply to SMT and non-SMT local groups.
It makes less sense having a separate function to deal specifically with
SMT. Fold the logic in asym_smt_can_pull_tasks() into sched_asym().

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-8-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 86 ++++++++++----------------------------------
 1 file changed, 21 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bb8934..48b6f0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9350,74 +9350,26 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
 }
 
 /**
- * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
- * @dst_cpu:	Destination CPU of the load balancing
+ * sched_asym - Check if the destination CPU can do asym_packing load balance
+ * @env:	The load balancing environment
  * @sds:	Load-balancing data with statistics of the local group
  * @sgs:	Load-balancing statistics of the candidate busiest group
- * @sg:		The candidate busiest group
+ * @group:	The candidate busiest group
  *
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * @env::dst_cpu can do asym_packing if it has higher priority than the
+ * preferred CPU of @group.
  *
- * This function must be called only if all the SMT siblings of @dst_cpu are
- * idle, if any.
+ * SMT is a special case. If we are balancing load between cores, @env::dst_cpu
+ * can do asym_packing balance only if all its SMT siblings are idle. Also, it
+ * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
+ * imbalances in the number of CPUS are dealt with in find_busiest_group().
  *
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
+ * If we are balancing load within an SMT core, or at DIE domain level, always
+ * proceed.
  *
- * When dealing with SMT cores, only use priorities if the SMT core has exactly
- * one busy sibling. find_busiest_group() will handle bigger imbalances in the
- * number of busy CPUs.
- *
- * Return: true if @dst_cpu can pull tasks, false otherwise.
+ * Return: true if @env::dst_cpu can do with asym_packing load balance. False
+ * otherwise.
  */
-static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
-				    struct sg_lb_stats *sgs,
-				    struct sched_group *sg)
-{
-#ifdef CONFIG_SCHED_SMT
-	bool local_is_smt;
-	int sg_busy_cpus;
-
-	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
-
-	if (!local_is_smt) {
-		/*
-		 * If we are here, @dst_cpu is idle and does not have SMT
-		 * siblings. Pull tasks if candidate group has two or more
-		 * busy CPUs.
-		 */
-		if (sg_busy_cpus >= 2) /* implies sg_is_smt */
-			return true;
-
-		/*
-		 * @dst_cpu does not have SMT siblings. @sg may have SMT
-		 * siblings and only one is busy. In such case, @dst_cpu
-		 * can help if it has higher priority and is idle (i.e.,
-		 * it has no running tasks).
-		 */
-		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-	}
-
-	/*
-	 * If we are here @dst_cpu has SMT siblings and are also idle.
-	 *
-	 * CPU priorities does not make sense for SMT cores with more than one
-	 * busy sibling.
-	 */
-	if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
-		return false;
-
-	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-#else
-	/* Always return false so that callers deal with non-SMT cases. */
-	return false;
-#endif
-}
-
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
@@ -9426,10 +9378,14 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
 	if (!sched_use_asym_prio(env->sd, env->dst_cpu))
 		return false;
 
-	/* Only do SMT checks if either local or candidate have SMT siblings. */
-	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
-	    (group->flags & SD_SHARE_CPUCAPACITY))
-		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+	/*
+	 * CPU priorities does not make sense for SMT cores with more than one
+	 * busy sibling.
+	 */
+	if (group->flags & SD_SHARE_CPUCAPACITY) {
+		if (sgs->group_weight - sgs->idle_cpus != 1)
+			return false;
+	}
 
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Use the busiest group to set prefer_sibling
  2023-04-06 20:31 ` [PATCH v4 06/12] sched/fair: Use the busiest group to set prefer_sibling Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Ricardo Neri, Peter Zijlstra (Intel),
	Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     43726bdedd29797d8e1fee2e7300a6d2b9a74ba8
Gitweb:        https://git.kernel.org/tip/43726bdedd29797d8e1fee2e7300a6d2b9a74ba8
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:42 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00

sched/fair: Use the busiest group to set prefer_sibling

The prefer_sibling setting acts on the busiest group to move excess tasks
to the local group. This should be done as per request of the child of the
busiest group's sched domain, not the local group's.

Using the flags of the child domain of the local group works fortuitously
if both groups have child domains.

There are cases, however, in which the busiest group's sched domain has
child but the local group's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING of the busiest group's child
domain will not be honored. We are left with a fully busy SMT core and an
idle non-SMT core.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-7-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a9f040..3bb8934 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10109,7 +10109,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10150,8 +10149,13 @@ next_group:
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-	/* Tag domain that child domain prefers tasks go to siblings first */
-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+	/*
+	 * Indicate that the child domain of the busiest group prefers tasks
+	 * go to a child's sibling domains first. NB the flags of a sched group
+	 * are those of the child domain.
+	 */
+	if (sds->busiest)
+		sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
 
 
 	if (env->sd->flags & SD_NUMA)
@@ -10461,7 +10465,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 			goto out_balanced;
 	}
 
-	/* Try to move all excess tasks to child's sibling domain */
+	/*
+	 * Try to move all excess tasks to a sibling domain of the busiest
+	 * group's child domain.
+	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Keep a fully_busy SMT sched group as busiest
  2023-04-06 20:31 ` [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  2023-05-12 18:41   ` [PATCH v4 05/12] " Shrikanth Hegde
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel), Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5fd6d7f43958cb62da105c8413eac3e78480f09a
Gitweb:        https://git.kernel.org/tip/5fd6d7f43958cb62da105c8413eac3e78480f09a
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:41 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00

sched/fair: Keep a fully_busy SMT sched group as busiest

When comparing two fully_busy scheduling groups, keep the current busiest
group if it represents an SMT core. Tasks in such scheduling group share
CPU resources and need more help than tasks in a non-SMT fully_busy group.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-6-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85ce249..4a9f040 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9619,10 +9619,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		 * contention when accessing shared HW resources.
 		 *
 		 * XXX for now avg_load is not computed and always 0 so we
-		 * select the 1st one.
+		 * select the 1st one, except if @sg is composed of SMT
+		 * siblings.
 		 */
-		if (sgs->avg_load <= busiest->avg_load)
+
+		if (sgs->avg_load < busiest->avg_load)
 			return false;
+
+		if (sgs->avg_load == busiest->avg_load) {
+			/*
+			 * SMT sched groups need more help than non-SMT groups.
+			 * If @sg happens to also be SMT, either choice is good.
+			 */
+			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+				return false;
+		}
+
 		break;
 
 	case group_has_spare:

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Let low-priority cores help high-priority busy SMT cores
  2023-04-06 20:31 ` [PATCH v4 04/12] sched/fair: Let low-priority cores help high-priority busy " Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, Ricardo Neri, Peter Zijlstra (Intel),
	Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     18ad34532755feb5b9f4284b07769b1bfec18ab3
Gitweb:        https://git.kernel.org/tip/18ad34532755feb5b9f4284b07769b1bfec18ab3
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:40 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:35 +02:00

sched/fair: Let low-priority cores help high-priority busy SMT cores

Using asym_packing priorities within an SMT core is straightforward. Just
follow the priorities that hardware indicates.

When balancing load from an SMT core, also consider the idle state of its
siblings. Priorities do not reflect that an SMT core divides its throughput
among all its busy siblings. They only makes sense when exactly one sibling
is busy.

Indicate that active balance is needed if the destination CPU has lower
priority than the source CPU but the latter has busy SMT siblings.

Make find_busiest_queue() not skip higher-priority SMT cores with more than
busy sibling.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-5-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8a02ae..85ce249 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10551,8 +10551,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    nr_running == 1)
 			continue;
 
-		/* Make sure we only pull tasks from a CPU of lower priority */
+		/*
+		 * Make sure we only pull tasks from a CPU of lower priority
+		 * when balancing between SMT siblings.
+		 *
+		 * If balancing between cores, let lower priority CPUs help
+		 * SMT cores with more than one busy sibling.
+		 */
 		if ((env->sd->flags & SD_ASYM_PACKING) &&
+		    sched_use_asym_prio(env->sd, i) &&
 		    sched_asym_prefer(i, env->dst_cpu) &&
 		    nr_running == 1)
 			continue;
@@ -10645,10 +10652,15 @@ asym_active_balance(struct lb_env *env)
 	 * priority CPUs in order to pack all tasks in the highest priority
 	 * CPUs. When done between cores, do it only if the whole core if the
 	 * whole core is idle.
+	 *
+	 * If @env::src_cpu is an SMT core with busy siblings, let
+	 * the lower priority @env::dst_cpu help it. Do not follow
+	 * CPU priority.
 	 */
 	return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
 	       sched_use_asym_prio(env->sd, env->dst_cpu) &&
-	       sched_asym_prefer(env->dst_cpu, env->src_cpu);
+	       (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+		!sched_use_asym_prio(env->sd, env->src_cpu));
 }
 
 static inline bool

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Simplify asym_packing logic for SMT cores
  2023-04-06 20:31 ` [PATCH v4 03/12] sched/fair: Simplify asym_packing logic for " Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel),
	Len Brown, Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ef7657d4d2d6a8456aa624010de456c32a135fe9
Gitweb:        https://git.kernel.org/tip/ef7657d4d2d6a8456aa624010de456c32a135fe9
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:39 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:34 +02:00

sched/fair: Simplify asym_packing logic for SMT cores

Callers of asym_smt_can_pull_tasks() check the idle state of the
destination CPU and its SMT siblings, if any. No extra checks are needed
in such function.

Since SMT cores divide capacity among its siblings, priorities only really
make sense if only one sibling is active. This is true for SMT2, SMT4,
SMT8, etc. Do not use asym_packing load balance for this case. Instead,
let find_busiest_group() handle imbalances.

When balancing non-SMT cores or at higher scheduling domains (e.g.,
between MC scheduling groups), continue using priorities.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-4-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 713d03e..a8a02ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9366,12 +9366,9 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
  * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
  * only if @dst_cpu has higher priority.
  *
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
- *
- * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
+ * When dealing with SMT cores, only use priorities if the SMT core has exactly
+ * one busy sibling. find_busiest_group() will handle bigger imbalances in the
+ * number of busy CPUs.
  *
  * Return: true if @dst_cpu can pull tasks, false otherwise.
  */
@@ -9380,12 +9377,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 				    struct sched_group *sg)
 {
 #ifdef CONFIG_SCHED_SMT
-	bool local_is_smt, sg_is_smt;
+	bool local_is_smt;
 	int sg_busy_cpus;
 
 	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
 	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
 
 	if (!local_is_smt) {
@@ -9406,21 +9401,17 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 	}
 
-	/* @dst_cpu has SMT siblings. */
-
-	if (sg_is_smt) {
-		int local_busy_cpus = sds->local->group_weight -
-				      sds->local_stat.idle_cpus;
-		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
-		if (busy_cpus_delta == 1)
-			return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
+	/*
+	 * If we are here @dst_cpu has SMT siblings and are also idle.
+	 *
+	 * CPU priorities does not make sense for SMT cores with more than one
+	 * busy sibling.
+	 */
+	if (group->flags & SD_SHARE_CPUCAPACITY && sg_busy_cpus != 1)
 		return false;
-	}
 
-	/* If we are here @dst_cpu has SMT siblings and are also idle. */
 	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
 #else
 	/* Always return false so that callers deal with non-SMT cases. */
 	return false;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Move is_core_idle() out of CONFIG_NUMA
  2023-04-06 20:31 ` [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ricardo Neri, Peter Zijlstra (Intel), Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8b36d07f1d63de102d464f44a89704bc62d00811
Gitweb:        https://git.kernel.org/tip/8b36d07f1d63de102d464f44a89704bc62d00811
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:37 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:33 +02:00

sched/fair: Move is_core_idle() out of CONFIG_NUMA

asym_packing needs this function to determine whether an SMT core is a
suitable destination for load balancing.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-2-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f..a47208d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+	int sibling;
+
+	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+		if (cpu == sibling)
+			continue;
+
+		if (!idle_cpu(sibling))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
 #ifdef CONFIG_NUMA
 #define NUMA_IMBALANCE_MIN 2
 
@@ -1700,23 +1717,6 @@ struct numa_stats {
 	int idle_cpu;
 };
 
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
-	int sibling;
-
-	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
-		if (cpu == sibling)
-			continue;
-
-		if (!idle_cpu(sibling))
-			return false;
-	}
-#endif
-
-	return true;
-}
-
 struct task_numa_env {
 	struct task_struct *p;
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Only do asym_packing load balancing from fully idle SMT cores
  2023-04-06 20:31 ` [PATCH v4 02/12] sched/fair: Only do asym_packing load balancing from fully idle SMT cores Ricardo Neri
@ 2023-05-10 13:49   ` tip-bot2 for Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Ricardo Neri @ 2023-05-10 13:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, Ricardo Neri, Peter Zijlstra (Intel),
	Zhang Rui, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     eefefa716c9fa6aa73159f09954b7eeba4cafd09
Gitweb:        https://git.kernel.org/tip/eefefa716c9fa6aa73159f09954b7eeba4cafd09
Author:        Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
AuthorDate:    Thu, 06 Apr 2023 13:31:38 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:34 +02:00

sched/fair: Only do asym_packing load balancing from fully idle SMT cores

When balancing load between cores, all the SMT siblings of the destination
CPU, if any, must be idle. Otherwise, pulling new tasks degrades the
throughput of the busy SMT siblings. The overall throughput of the system
remains the same.

When balancing load within an SMT core this consideration is not relevant.
Follow the priorities that hardware indicates.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-3-ricardo.neri-calderon@linux.intel.com
---
 kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a47208d..713d03e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9331,6 +9331,25 @@ group_type group_classify(unsigned int imbalance_pct,
 }
 
 /**
+ * sched_use_asym_prio - Check whether asym_packing priority must be used
+ * @sd:		The scheduling domain of the load balancing
+ * @cpu:	A CPU
+ *
+ * Always use CPU priority when balancing load between SMT siblings. When
+ * balancing load between cores, it is not sufficient that @cpu is idle. Only
+ * use CPU priority if the whole core is idle.
+ *
+ * Returns: True if the priority of @cpu must be followed. False otherwise.
+ */
+static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
+{
+	if (!sched_smt_active())
+		return true;
+
+	return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+}
+
+/**
  * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
  * @dst_cpu:	Destination CPU of the load balancing
  * @sds:	Load-balancing data with statistics of the local group
@@ -9340,6 +9359,9 @@ group_type group_classify(unsigned int imbalance_pct,
  * Check the state of the SMT siblings of both @sds::local and @sg and decide
  * if @dst_cpu can pull tasks.
  *
+ * This function must be called only if all the SMT siblings of @dst_cpu are
+ * idle, if any.
+ *
  * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
  * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
  * only if @dst_cpu has higher priority.
@@ -9349,8 +9371,7 @@ group_type group_classify(unsigned int imbalance_pct,
  * Bigger imbalances in the number of busy CPUs will be dealt with in
  * update_sd_pick_busiest().
  *
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
+ * If @sg does not have SMT siblings, only pull tasks if @sg has lower priority.
  *
  * Return: true if @dst_cpu can pull tasks, false otherwise.
  */
@@ -9398,15 +9419,8 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 		return false;
 	}
 
-	/*
-	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
-	 * up with more than one busy SMT sibling and only pull tasks if there
-	 * are not busy CPUs (i.e., no CPU has running tasks).
-	 */
-	if (!sds->local_stat.sum_nr_running)
-		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-	return false;
+	/* If we are here @dst_cpu has SMT siblings and are also idle. */
+	return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 #else
 	/* Always return false so that callers deal with non-SMT cases. */
 	return false;
@@ -9417,7 +9431,11 @@ static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
 {
-	/* Only do SMT checks if either local or candidate have SMT siblings */
+	/* Ensure that the whole local core is idle, if applicable. */
+	if (!sched_use_asym_prio(env->sd, env->dst_cpu))
+		return false;
+
+	/* Only do SMT checks if either local or candidate have SMT siblings. */
 	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
 	    (group->flags & SD_SHARE_CPUCAPACITY))
 		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
@@ -10632,11 +10650,13 @@ static inline bool
 asym_active_balance(struct lb_env *env)
 {
 	/*
-	 * ASYM_PACKING needs to force migrate tasks from busy but
-	 * lower priority CPUs in order to pack all tasks in the
-	 * highest priority CPUs.
+	 * ASYM_PACKING needs to force migrate tasks from busy but lower
+	 * priority CPUs in order to pack all tasks in the highest priority
+	 * CPUs. When done between cores, do it only if the whole core if the
+	 * whole core is idle.
 	 */
 	return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
+	       sched_use_asym_prio(env->sd, env->dst_cpu) &&
 	       sched_asym_prefer(env->dst_cpu, env->src_cpu);
 }
 
@@ -11371,9 +11391,13 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_PACKING; see if there's a more preferred CPU
 		 * currently idle; in which case, kick the ILB to move tasks
 		 * around.
+		 *
+		 * When balancing betwen cores, all the SMT siblings of the
+		 * preferred CPU must be idle.
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
-			if (sched_asym_prefer(i, cpu)) {
+			if (sched_use_asym_prio(sd, i) &&
+			    sched_asym_prefer(i, cpu)) {
 				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 				goto unlock;
 			}

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
  2023-05-01 18:30   ` Tim Chen
  2023-05-02  1:42   ` Ricardo Neri
@ 2023-05-12 18:23   ` Shrikanth Hegde
  2023-05-19  0:03     ` Ricardo Neri
  2 siblings, 1 reply; 34+ messages in thread
From: Shrikanth Hegde @ 2023-05-12 18:23 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Srikar Dronamraju, naveen.n.rao,
	Peter Zijlstra



On 4/29/23 9:02 PM, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
>> Hi,
>>
>> This is v4 of this series. Previous versions can be found here [1], [2],
>> and here [3]. To avoid duplication, I do not include the cover letter of
>> the original submission. You can read it in [1].
>>
>> This patchset applies cleanly on today's master branch of the tip tree.
>>
>> Changes since v3:
>>
>> Nobody liked the proposed changes to the setting of prefer_sibling.
>> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
>> group, not the local group, determines the setting of prefer_sibling.
>>
>> Vincent suggested improvements to the logic to decide whether to follow
>> asym_packing priorities. Peter suggested to wrap that in a helper function.
>> I added sched_use_asym_prio().
>>
>> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
>> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
>> does not assume that all child domains have the requested flag.
>>
>> Tim found that asym_active_balance() needs to also check for the idle
>> states of the SMT siblings of lb_env::dst_cpu. I added such check.
>>
>> I wrongly assumed that asym_packing could only be used when the busiest
>> group had exactly one busy CPU. This broke asym_packing balancing at the
>> DIE domain. I limited this check to balances between cores at the MC
>> level.
>>
>> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
>> and placed its logic in sched_asym(). Also, sched_asym() uses
>> sched_smt_active() to skip checks when not needed.
>>
>> I also added a patch from Chen Yu to enable asym_packing balancing in
>> Meteor Lake, which has CPUs of different maximum frequency in more than
>> one die.
> 
> Is the actual topology of Meteor Lake already public? This patch made me
> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> but I can't remember (one of the raisins why the endless calls are such
> a frigging waste of time) and I can't seem to find the answer using
> Google either.
> 
>> Hopefully, these patches are in sufficiently good shape to be merged?
> 
> Changelogs are very sparse towards the end and I had to reverse engineer
> some of it which is a shame. But yeah, on a first reading the code looks
> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> start to make a little sense. Mostly they utterly fail to answer the
> very fundament "why did you do this" question.
> 
> Also, you seem to have forgotten to Cc our friends from IBM such that
> they might verify you didn't break their Power7 stuff -- or do you have
> a Power7 yourself to verify and forgot to mention that?

Very good patch series in addressing asym packing. Interesting discussions as
well. Took me quite sometime to get through to understand and do a little bit
of testing.

Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
show ASYM_PACKING present only for SMT domain.

We don't see any regressions/gain due to patch. SMT priorities are honored when
tasks are scheduled and load_balanced.


> 
>> Chen Yu (1):
>>   x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
>>     processors
>>
>> Ricardo Neri (11):
>>   sched/fair: Move is_core_idle() out of CONFIG_NUMA
>>   sched/fair: Only do asym_packing load balancing from fully idle SMT
>>     cores
>>   sched/fair: Simplify asym_packing logic for SMT cores
>>   sched/fair: Let low-priority cores help high-priority busy SMT cores
>>   sched/fair: Keep a fully_busy SMT sched group as busiest
>>   sched/fair: Use the busiest group to set prefer_sibling
>>   sched/fair: Do not even the number of busy CPUs via asym_packing
>>   sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain()
>>   sched/topology: Remove SHARED_CHILD from ASYM_PACKING
>>   x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
>>   x86/sched/itmt: Give all SMT siblings of a core the same priority
>>
>>  arch/x86/kernel/itmt.c         |  23 +---
>>  arch/x86/kernel/smpboot.c      |   4 +-
>>  include/linux/sched/sd_flags.h |   5 +-
>>  kernel/sched/fair.c            | 216 +++++++++++++++++----------------
>>  kernel/sched/sched.h           |  22 +++-
>>  5 files changed, 138 insertions(+), 132 deletions(-)
> 
> I'm going to start to queue this and hopefully push out post -rc1 if
> nobody objects.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest
  2023-04-06 20:31 ` [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest Ricardo Neri
  2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
@ 2023-05-12 18:41   ` Shrikanth Hegde
  2023-05-19  0:01     ` Ricardo Neri
  1 sibling, 1 reply; 34+ messages in thread
From: Shrikanth Hegde @ 2023-05-12 18:41 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Tim C . Chen, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, shrikanth hegde, Srikar Dronamraju



On 4/7/23 2:01 AM, Ricardo Neri wrote:
> When comparing two fully_busy scheduling groups, keep the current busiest
> group if it represents an SMT core. Tasks in such scheduling group share
> CPU resources and need more help than tasks in a non-SMT fully_busy group.
>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: Zhang Rui <rui.zhang@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * None
>
> Changes since v2:
>  * Introduced this patch.
>
> Changes since v1:
>  * N/A
> ---
>  kernel/sched/fair.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b151e93ec316..ea23a5163bfa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		 * contention when accessing shared HW resources.
>  		 *
>  		 * XXX for now avg_load is not computed and always 0 so we
> -		 * select the 1st one.
> +		 * select the 1st one, except if @sg is composed of SMT
> +		 * siblings.
>  		 */
> -		if (sgs->avg_load <= busiest->avg_load)
> +
> +		if (sgs->avg_load < busiest->avg_load)
>  			return false;
> +
> +		if (sgs->avg_load == busiest->avg_load) {
> +			/*
> +			 * SMT sched groups need more help than non-SMT groups.
> +			 * If @sg happens to also be SMT, either choice is good.
> +			 */
> +			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> +				return false;
> +		}
> +
>  		break;

IIUC,

Earlier, we used to go to out_balanced if sgs->avg_load <= busiest->avg_load.
Now we go only if it is less. lets say sgs->avg_load == busiest->avg_load,
then we will return true in MC,DIE domain. This might end up traversing
multiple such group's and pick the last one as the busiest instead of
first. I guess eventually any load balance if exists will be fixed.  But
this might cause slight overhead. would it?



nit: There is typo in [2/12]  if the whole core is repeated.
+	 * CPUs. When done between cores, do it only if the whole core if the
+	 * whole core is idle.

Mentioning in this reply instead, to avoid sending another mail reply for this.


>  
>  	case group_has_spare:


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest
  2023-05-12 18:41   ` [PATCH v4 05/12] " Shrikanth Hegde
@ 2023-05-19  0:01     ` Ricardo Neri
  0 siblings, 0 replies; 34+ messages in thread
From: Ricardo Neri @ 2023-05-19  0:01 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Tim C . Chen, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Srikar Dronamraju

On Sat, May 13, 2023 at 12:11:45AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 4/7/23 2:01 AM, Ricardo Neri wrote:
> > When comparing two fully_busy scheduling groups, keep the current busiest
> > group if it represents an SMT core. Tasks in such scheduling group share
> > CPU resources and need more help than tasks in a non-SMT fully_busy group.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: x86@kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Tested-by: Zhang Rui <rui.zhang@intel.com>
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * None
> >
> > Changes since v2:
> >  * Introduced this patch.
> >
> > Changes since v1:
> >  * N/A
> > ---
> >  kernel/sched/fair.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b151e93ec316..ea23a5163bfa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9566,10 +9566,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		 * contention when accessing shared HW resources.
> >  		 *
> >  		 * XXX for now avg_load is not computed and always 0 so we
> > -		 * select the 1st one.
> > +		 * select the 1st one, except if @sg is composed of SMT
> > +		 * siblings.
> >  		 */
> > -		if (sgs->avg_load <= busiest->avg_load)
> > +
> > +		if (sgs->avg_load < busiest->avg_load)
> >  			return false;
> > +
> > +		if (sgs->avg_load == busiest->avg_load) {
> > +			/*
> > +			 * SMT sched groups need more help than non-SMT groups.
> > +			 * If @sg happens to also be SMT, either choice is good.
> > +			 */
> > +			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
> > +				return false;
> > +		}
> > +
> >  		break;
> 

Thank you very much for your review!

> IIUC,
> 
> Earlier, we used to go to out_balanced if sgs->avg_load <= busiest->avg_load.
> Now we go only if it is less.

In this particular case we are comparing to fully_busy groups. Both
sgs->avg_load and busiest->avg_load are equal to zero 0.

> lets say sgs->avg_load == busiest->avg_load,
> then we will return true in MC,DIE domain. This might end up traversing
> multiple such group's and pick the last one as the busiest instead of
> first.

Yes, that is correct. But we traverse all sched groups from
update_sd_lb_stats() anyway. We are here because both sgs and busiest are
of type fully_busy and we need to break a tie. Previously we always kept
on selecting sgs as busiest.

> I guess eventually any load balance if exists will be fixed.  But
> this might cause slight overhead. would it?
> 
> 
> 
> nit: There is typo in [2/12]  if the whole core is repeated.
> +	 * CPUs. When done between cores, do it only if the whole core if the
> +	 * whole core is idle.
> 
> Mentioning in this reply instead, to avoid sending another mail reply for this.

Ah! I read my patches dozens of times and I still missed this. Thank you
for noting. I will post a trivial patch to fix it.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-05-12 18:23   ` Shrikanth Hegde
@ 2023-05-19  0:03     ` Ricardo Neri
  2023-05-22  7:55       ` Shrikanth Hegde
  0 siblings, 1 reply; 34+ messages in thread
From: Ricardo Neri @ 2023-05-19  0:03 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Srikar Dronamraju, naveen.n.rao,
	Peter Zijlstra

On Fri, May 12, 2023 at 11:53:48PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 4/29/23 9:02 PM, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
> >> Hi,
> >>
> >> This is v4 of this series. Previous versions can be found here [1], [2],
> >> and here [3]. To avoid duplication, I do not include the cover letter of
> >> the original submission. You can read it in [1].
> >>
> >> This patchset applies cleanly on today's master branch of the tip tree.
> >>
> >> Changes since v3:
> >>
> >> Nobody liked the proposed changes to the setting of prefer_sibling.
> >> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
> >> group, not the local group, determines the setting of prefer_sibling.
> >>
> >> Vincent suggested improvements to the logic to decide whether to follow
> >> asym_packing priorities. Peter suggested to wrap that in a helper function.
> >> I added sched_use_asym_prio().
> >>
> >> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
> >> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
> >> does not assume that all child domains have the requested flag.
> >>
> >> Tim found that asym_active_balance() needs to also check for the idle
> >> states of the SMT siblings of lb_env::dst_cpu. I added such check.
> >>
> >> I wrongly assumed that asym_packing could only be used when the busiest
> >> group had exactly one busy CPU. This broke asym_packing balancing at the
> >> DIE domain. I limited this check to balances between cores at the MC
> >> level.
> >>
> >> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
> >> and placed its logic in sched_asym(). Also, sched_asym() uses
> >> sched_smt_active() to skip checks when not needed.
> >>
> >> I also added a patch from Chen Yu to enable asym_packing balancing in
> >> Meteor Lake, which has CPUs of different maximum frequency in more than
> >> one die.
> > 
> > Is the actual topology of Meteor Lake already public? This patch made me
> > wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
> > but I can't remember (one of the raisins why the endless calls are such
> > a frigging waste of time) and I can't seem to find the answer using
> > Google either.
> > 
> >> Hopefully, these patches are in sufficiently good shape to be merged?
> > 
> > Changelogs are very sparse towards the end and I had to reverse engineer
> > some of it which is a shame. But yeah, on a first reading the code looks
> > mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
> > start to make a little sense. Mostly they utterly fail to answer the
> > very fundament "why did you do this" question.
> > 
> > Also, you seem to have forgotten to Cc our friends from IBM such that
> > they might verify you didn't break their Power7 stuff -- or do you have
> > a Power7 yourself to verify and forgot to mention that?
> 
> Very good patch series in addressing asym packing. Interesting discussions as
> well. Took me quite sometime to get through to understand and do a little bit
> of testing.
> 
> Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
> show ASYM_PACKING present only for SMT domain.
> 
> We don't see any regressions/gain due to patch. SMT priorities are honored when
> tasks are scheduled and load_balanced.

Thank you very much for your review and testing! Would you mind sharing the
qemu command you use? I would like to test my future patches on power7.

BR,
Ricardo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains
  2023-05-19  0:03     ` Ricardo Neri
@ 2023-05-22  7:55       ` Shrikanth Hegde
  0 siblings, 0 replies; 34+ messages in thread
From: Shrikanth Hegde @ 2023-05-22  7:55 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Srikar Dronamraju, naveen.n.rao,
	Peter Zijlstra



On 5/19/23 5:33 AM, Ricardo Neri wrote:
> On Fri, May 12, 2023 at 11:53:48PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 4/29/23 9:02 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 06, 2023 at 01:31:36PM -0700, Ricardo Neri wrote:
>>>> Hi,
>>>>
>>>> This is v4 of this series. Previous versions can be found here [1], [2],
>>>> and here [3]. To avoid duplication, I do not include the cover letter of
>>>> the original submission. You can read it in [1].
>>>>
>>>> This patchset applies cleanly on today's master branch of the tip tree.
>>>>
>>>> Changes since v3:
>>>>
>>>> Nobody liked the proposed changes to the setting of prefer_sibling.
>>>> Instead, I tweaked the solution that Dietmar proposed. Now the busiest
>>>> group, not the local group, determines the setting of prefer_sibling.
>>>>
>>>> Vincent suggested improvements to the logic to decide whether to follow
>>>> asym_packing priorities. Peter suggested to wrap that in a helper function.
>>>> I added sched_use_asym_prio().
>>>>
>>>> Ionela found that removing SD_ASYM_PACKING from the SMT domain in x86
>>>> rendered sd_asym_packing NULL in SMT cores. Now highest_flag_domain()
>>>> does not assume that all child domains have the requested flag.
>>>>
>>>> Tim found that asym_active_balance() needs to also check for the idle
>>>> states of the SMT siblings of lb_env::dst_cpu. I added such check.
>>>>
>>>> I wrongly assumed that asym_packing could only be used when the busiest
>>>> group had exactly one busy CPU. This broke asym_packing balancing at the
>>>> DIE domain. I limited this check to balances between cores at the MC
>>>> level.
>>>>
>>>> As per suggestion from Dietmar, I removed sched_asym_smt_can_pull_tasks()
>>>> and placed its logic in sched_asym(). Also, sched_asym() uses
>>>> sched_smt_active() to skip checks when not needed.
>>>>
>>>> I also added a patch from Chen Yu to enable asym_packing balancing in
>>>> Meteor Lake, which has CPUs of different maximum frequency in more than
>>>> one die.
>>>
>>> Is the actual topology of Meteor Lake already public? This patch made me
>>> wonder if we need SCHED_CLUSTER topology in the hybrid_topology thing,
>>> but I can't remember (one of the raisins why the endless calls are such
>>> a frigging waste of time) and I can't seem to find the answer using
>>> Google either.
>>>
>>>> Hopefully, these patches are in sufficiently good shape to be merged?
>>>
>>> Changelogs are very sparse towards the end and I had to reverse engineer
>>> some of it which is a shame. But yeah, on a first reading the code looks
>>> mostly ok. Specifically 8-10 had me WTF a bit and only at 11 did it
>>> start to make a little sense. Mostly they utterly fail to answer the
>>> very fundament "why did you do this" question.
>>>
>>> Also, you seem to have forgotten to Cc our friends from IBM such that
>>> they might verify you didn't break their Power7 stuff -- or do you have
>>> a Power7 yourself to verify and forgot to mention that?
>>
>> Very good patch series in addressing asym packing. Interesting discussions as
>> well. Took me quite sometime to get through to understand and do a little bit
>> of testing.
>>
>> Tested this patch a bit on power7 with qemu. Tested with SMT=4. sched domains
>> show ASYM_PACKING present only for SMT domain.
>>
>> We don't see any regressions/gain due to patch. SMT priorities are honored when
>> tasks are scheduled and load_balanced.
> 
> Thank you very much for your review and testing! Would you mind sharing the
> qemu command you use? I would like to test my future patches on power7.

Sure. I tried this qemu command on a Power9 system. It loads in compat mode.
This would simulate 4Core and SMT4 
you need the have CONFIG_POWER7_CPU=y and CONFIG_CPU_BIG_ENDIAN=y and right
qcow2 image. 

qemu-system-ppc64 -M pseries,accel=kvm,max-cpu-compat=power7 -m 8192 -smp cores=4,threads=4 -enable-kvm  -drive file=<qcow2_image>,format=qcow2 -vga none -nographic -net nic -net user,hostfwd=tcp::2222-:22 -kernel ./<custom_vmlinux> --append "noautotest root=/dev/sda2"


> 
> BR,
> Ricardo

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-05-22  7:58 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06 20:31 [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 01/12] sched/fair: Move is_core_idle() out of CONFIG_NUMA Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 02/12] sched/fair: Only do asym_packing load balancing from fully idle SMT cores Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 03/12] sched/fair: Simplify asym_packing logic for " Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 04/12] sched/fair: Let low-priority cores help high-priority busy " Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 05/12] sched/fair: Keep a fully_busy SMT sched group as busiest Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-05-12 18:41   ` [PATCH v4 05/12] " Shrikanth Hegde
2023-05-19  0:01     ` Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 06/12] sched/fair: Use the busiest group to set prefer_sibling Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 07/12] sched/fair: Do not even the number of busy CPUs via asym_packing Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 08/12] sched/topology: Check SDF_SHARED_CHILD in highest_flag_domain() Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 09/12] sched/topology: Remove SHARED_CHILD from ASYM_PACKING Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 10/12] x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 11/12] x86/sched/itmt: Give all SMT siblings of a core the same priority Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Ricardo Neri
2023-04-06 20:31 ` [PATCH v4 12/12] x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors Ricardo Neri
2023-05-10 13:49   ` [tip: sched/core] " tip-bot2 for Chen Yu
2023-04-29 15:32 ` [PATCH v4 00/12] sched: Avoid unnecessary migrations within SMT domains Peter Zijlstra
2023-05-01 18:30   ` Tim Chen
2023-05-02  1:42   ` Ricardo Neri
2023-05-02  1:52     ` Steven Rostedt
2023-05-12 18:23   ` Shrikanth Hegde
2023-05-19  0:03     ` Ricardo Neri
2023-05-22  7:55       ` Shrikanth Hegde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.