All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs
@ 2023-07-07 22:56 Tim Chen
  2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
                   ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton

This is the third version of patches to fix issues to allow cluster
scheduling on x86 hybrid CPUs.  They address concerns raised by
Peter on the second version.  Please refer to the cover letter in the
first version for the motivation behind this patch series.

Changes from v2:
1. Peter pointed out that biasing asym packing in sibling imbalance
computation is unnecessary.  We will negate extra turbo headroom
advantage by concentrating tasks in the preferred group.  In v3, we
simplify computing sibling imbalance only in proportion to the number
of cores, and remove asym packing bias. We do not lose any performance
and do a bit better than v2.

2. Peter asked the question of whether it is better to round the
sibling_imbalance() computation or floor the sibling_imbalanace()
as in the v2 implementation.  I did find the rounding to be
better in threaded tensor computation, hence v3 adopt rounding
in sibling_imbalance().  The performance of both versions are
listed in the performance data below.

3. Fix patch 1 to take SMT thread number more than 2 into consideration.  

4. Various style clean ups suggested by Peter.

Past Versions:
[v1] https://lore.kernel.org/lkml/CAKfTPtD1W6vJQBsNKEt_4tn2EeAs=73CeH4LoCwENrh2JUDwnQ@mail.gmail.com/T/
[v2] https://lore.kernel.org/all/cover.1686263351.git.tim.c.chen@linux.intel.com/ 

v3 Performance numbers:

                                     This version                            
Single Threaded	6.3-rc5              with cluster         Improvement     Alternative        Improvement    
Benchmark 	Baseline             scheduling	          in Performance  implementation     in Performance 
                                     (round imbalance)                    (floor imbalance)	          
               (run-run deviation)   (run-run deviation)                  (run-run deviation)               
------------------------------------------------------------------------------------------------------------
tjbench		(+/- 0.08%)          (+/- 0.12%)           0.03%          (+/- 0.11%)         0.00%	  
PhPbench	(+/- 0.31%)          (+/- 0.50%)          +0.19%          (+/- 0.87%)        +0.21%         
flac		(+/- 0.58%)          (+/- 0.41%)          +0.48%          (+/- 0.41%)        +1.02%         
pybench		(+/- 3.16%)          (+/- 2.87%)          +2.04%          (+/- 2.22%)        +4.25%         
                                                                                                            

                                     This version                         
                                     with cluster         Improvement     Alternative        Improvement				  
Multi Threaded	6.3-rc5              scheduling           in Performance  implementation     in Performance 
Benchmark       Baseline             (round imbalance)                    (floor imbalance)
(-#threads)     (run-run deviation)  (run-run deviation)                  (run-run deviation)               
------------------------------------------------------------------------------------------------------------
Kbuild-8	(+/- 2.90%)          (+/- 0.23%)          -1.10%          (+/- 0.40%)        -1.01%         
Kbuild-10	(+/- 3.08%)          (+/- 0.51%)          -1.93%          (+/- 0.49%)        -1.57%
Kbuild-12	(+/- 3.28%)          (+/- 0.39%)          -1.10%          (+/- 0.23%)        -0.98%
Tensor Lite-8	(+/- 4.84%)          (+/- 0.86%)          -1.32%          (+/- 0.58%)        -0.78%
Tensor Lite-10	(+/- 0.87%)          (+/- 0.30%)          +0.68%          (+/- 1.24%)        -0.13%
Tensor Lite-12	(+/- 1.37%)          (+/- 0.82%)          +4.16%          (+/- 1.65%)        +1.19%         

Tim


Peter Zijlstra (Intel) (1):
  sched/debug: Dump domains' sched group flags

Ricardo Neri (1):
  sched/fair: Consider the idle state of the whole core for load balance

Tim C Chen (4):
  sched/fair: Determine active load balance for SMT sched groups
  sched/topology: Record number of cores in sched group
  sched/fair: Implement prefer sibling imbalance calculation between
    asymmetric groups
  sched/x86: Add cluster topology to hybrid CPU

 arch/x86/kernel/smpboot.c |   3 +
 kernel/sched/debug.c      |   1 +
 kernel/sched/fair.c       | 137 +++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h      |   1 +
 kernel/sched/topology.c   |  10 ++-
 5 files changed, 143 insertions(+), 9 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-14 13:06   ` Shrikanth Hegde
  2023-07-14 14:53   ` Tobias Huschle
  2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim C Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton

From: Tim C Chen <tim.c.chen@linux.intel.com>

On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.

Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.

          --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
          [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
          Big    Big    Big    Big    Atom       Atom
          core   core   core   core   Module     Module

If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores.  Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().

In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU.  We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.

Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest.  The destination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should.  As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task).  And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87317634fab2..f636d6c09dc6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8279,6 +8279,11 @@ enum group_type {
 	 * more powerful CPU.
 	 */
 	group_misfit_task,
+	/*
+	 * Balance SMT group that's fully busy. Can benefit from migration
+	 * a task on SMT with busy sibling to another CPU on idle core.
+	 */
+	group_smt_balance,
 	/*
 	 * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
 	 * and the task should be migrated to it instead of running on the
@@ -8987,6 +8992,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+	unsigned int group_smt_balance;  /* Task on busy SMT be moved */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct,
 	if (sgs->group_asym_packing)
 		return group_asym_packing;
 
+	if (sgs->group_smt_balance)
+		return group_smt_balance;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
@@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }
 
+/* One group has more than one SMT CPU while the other group does not */
+static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
+				    struct sched_group *sg2)
+{
+	if (!sg1 || !sg2)
+		return false;
+
+	return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
+		(sg2->flags & SD_SHARE_CPUCAPACITY);
+}
+
+static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	if (env->idle == CPU_NOT_IDLE)
+		return false;
+
+	/*
+	 * For SMT source group, it is better to move a task
+	 * to a CPU that doesn't have multiple tasks sharing its CPU capacity.
+	 * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
+	 * will not be on.
+	 */
+	if (group->flags & SD_SHARE_CPUCAPACITY &&
+	    sgs->sum_h_nr_running > 1)
+		return true;
+
+	return false;
+}
+
 static inline bool
 sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 {
@@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_asym_packing = 1;
 	}
 
+	/* Check for loaded SMT group to be balanced to dst CPU */
+	if (!local_group && smt_balance(env, sgs, group))
+		sgs->group_smt_balance = 1;
+
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
@@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			return false;
 		break;
 
+	case group_smt_balance:
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_has_spare:
+		/*
+		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
+		 * as we do not want to pull task off SMT core with one task
+		 * and make the core idle.
+		 */
+		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
+			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
+				return false;
+			else
+				return true;
+		}
+
 		/*
 		 * Select not overloaded group with lowest number of idle cpus
 		 * and highest number of running tasks. We could also compare
@@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
 
 	case group_imbalanced:
 	case group_asym_packing:
+	case group_smt_balance:
 		/* Those types are not used in the slow wakeup path */
 		return false;
 
@@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 
 	case group_imbalanced:
 	case group_asym_packing:
+	case group_smt_balance:
 		/* Those type are not used in the slow wakeup path */
 		return NULL;
 
@@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
+	if (busiest->group_type == group_smt_balance) {
+		/* Reduce number of tasks sharing CPU capacity */
+		env->migration_type = migrate_task;
+		env->imbalance = 1;
+		return;
+	}
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-		if (env->idle == CPU_NOT_IDLE)
+		if (env->idle == CPU_NOT_IDLE) {
 			/*
 			 * If the busiest group is not overloaded (and as a
 			 * result the local one too) but this CPU is already
 			 * busy, let another idle CPU try to pull task.
 			 */
 			goto out_balanced;
+		}
+
+		if (busiest->group_type == group_smt_balance &&
+		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
+			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
+			goto force_balance;
+		}
 
 		if (busiest->group_weight > 1 &&
-		    local->idle_cpus <= (busiest->idle_cpus + 1))
+		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
 			/*
 			 * If the busiest group is not overloaded
 			 * and there is no imbalance between this and busiest
@@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 			 * there is more than 1 CPU per group.
 			 */
 			goto out_balanced;
+		}
 
-		if (busiest->sum_h_nr_running == 1)
+		if (busiest->sum_h_nr_running == 1) {
 			/*
 			 * busiest doesn't have any tasks waiting to run
 			 */
 			goto out_balanced;
+		}
 	}
 
 force_balance:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
  2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-10 20:33   ` Valentin Schneider
  2023-07-10 22:40   ` Tim Chen
  2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim C Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton

From: Tim C Chen <tim.c.chen@linux.intel.com>

When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the number
of cores in each domain. In preparation of implementing such a policy,
record the number of tasks in a scheduling group.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 10 +++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d0eb36350d2..5f7f36e45b87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1860,6 +1860,7 @@ struct sched_group {
 	atomic_t		ref;
 
 	unsigned int		group_weight;
+	unsigned int		cores;
 	struct sched_group_capacity *sgc;
 	int			asym_prefer_cpu;	/* CPU of highest priority in group */
 	int			flags;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6d5628fcebcf..6b099dbdfb39 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 {
 	struct sched_group *sg = sd->groups;
+	struct cpumask *mask = sched_domains_tmpmask2;
 
 	WARN_ON(!sg);
 
 	do {
-		int cpu, max_cpu = -1;
+		int cpu, cores = 0, max_cpu = -1;
 
 		sg->group_weight = cpumask_weight(sched_group_span(sg));
 
+		cpumask_copy(mask, sched_group_span(sg));
+		for_each_cpu(cpu, mask) {
+			cores++;
+			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
+		}
+		sg->cores = cores;
+
 		if (!(sd->flags & SD_ASYM_PACKING))
 			goto next;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
  2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
  2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-14 13:14   ` Shrikanth Hegde
  2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim C Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton

From: Tim C Chen <tim.c.chen@linux.intel.com>

In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.

However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups.  In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups.  Hence the imbalance
computed will need to reflect this.

Adjust the sibling imbalance computation to take into account of the
above considerations.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f636d6c09dc6..f491b94908bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 	return false;
 }
 
+static inline long sibling_imbalance(struct lb_env *env,
+				    struct sd_lb_stats *sds,
+				    struct sg_lb_stats *busiest,
+				    struct sg_lb_stats *local)
+{
+	int ncores_busiest, ncores_local;
+	long imbalance;
+
+	if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+		return 0;
+
+	ncores_busiest = sds->busiest->cores;
+	ncores_local = sds->local->cores;
+
+	if (ncores_busiest == ncores_local) {
+		imbalance = busiest->sum_nr_running;
+		lsub_positive(&imbalance, local->sum_nr_running);
+		return imbalance;
+	}
+
+	/* Balance such that nr_running/ncores ratio are same on both groups */
+	imbalance = ncores_local * busiest->sum_nr_running;
+	lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
+	/* Normalize imbalance and do rounding on normalization */
+	imbalance = 2 * imbalance + ncores_local + ncores_busiest;
+	imbalance /= ncores_local + ncores_busiest;
+
+	/* Take advantage of resource in an empty sched group */
+	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	    busiest->sum_nr_running > 1)
+		imbalance = 2;
+
+	return imbalance;
+}
+
 static inline bool
 sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 {
@@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
-			unsigned int nr_diff = busiest->sum_nr_running;
 			/*
 			 * When prefer sibling, evenly spread running tasks on
 			 * groups.
 			 */
 			env->migration_type = migrate_task;
-			lsub_positive(&nr_diff, local->sum_nr_running);
-			env->imbalance = nr_diff;
+			env->imbalance = sibling_imbalance(env, sds, busiest, local);
 		} else {
 
 			/*
@@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    busiest->sum_nr_running > local->sum_nr_running + 1)
+	    sibling_imbalance(env, &sds, busiest, local) > 1)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
                   ` (2 preceding siblings ...)
  2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-14 13:02   ` Shrikanth Hegde
  2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
  2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
  5 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ricardo Neri, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Tim Chen,
	Valentin Schneider, Ionela Voinescu, x86, linux-kernel,
	Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton

From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.

In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:

	[0, 1] [2, 3] [4, 5] 6 7 8 9
         b  i   b  i   b  i  b i i i

In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.

We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling.  So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.

A partially busy (i.e., of type group_has_spare) local group with SMT 
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU. 

However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.

Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.

Instead, consider the idle state of the whole SMT core.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f491b94908bf..294a662c9410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data);
 static int should_we_balance(struct lb_env *env)
 {
 	struct sched_group *sg = env->sd->groups;
-	int cpu;
+	int cpu, idle_smt = -1;
 
 	/*
 	 * Ensure the balancing environment is consistent; can happen
@@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env)
 		if (!idle_cpu(cpu))
 			continue;
 
+		/*
+		 * Don't balance to idle SMT in busy core right away when
+		 * balancing cores, but remember the first idle SMT CPU for
+		 * later consideration.  Find CPU on an idle core first.
+		 */
+		if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
+			if (idle_smt == -1)
+				idle_smt = cpu;
+			continue;
+		}
+
 		/* Are we the first idle CPU? */
 		return cpu == env->dst_cpu;
 	}
 
+	if (idle_smt == env->dst_cpu)
+		return true;
+
 	/* Are we the first CPU of this group ? */
 	return group_balance_cpu(sg) == env->dst_cpu;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
                   ` (3 preceding siblings ...)
  2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-08 12:31   ` Peter Zijlstra
  2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
  5 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim C Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton, Ricardo Neri

From: Tim C Chen <tim.c.chen@linux.intel.com>

Cluster topology was not enabled on hybrid x86 CPU as load balance
was not properly working for cluster domain.  That has been fixed and
cluster domain can be enabled for hybrid CPU.

Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/x86/kernel/smpboot.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index cea297d97034..2489d767c398 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -575,6 +575,9 @@ static struct sched_domain_topology_level x86_hybrid_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [Patch v3 6/6] sched/debug: Dump domains' sched group flags
  2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
                   ` (4 preceding siblings ...)
  2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
@ 2023-07-07 22:57 ` Tim Chen
  2023-07-10 20:33   ` Valentin Schneider
  5 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-07 22:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Ionela Voinescu,
	x86, linux-kernel, Shrikanth Hegde, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.

Add dump of domain sched group flags of a CPU to make debug easier
in the future.

Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..55b50f940feb 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -389,6 +389,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
 #undef SDM
 
 	debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
+	debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
 }
 
 void update_sched_domain_debugfs(void)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU
  2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
@ 2023-07-08 12:31   ` Peter Zijlstra
  2023-07-10 16:13     ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-07-08 12:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton, Ricardo Neri

On Fri, Jul 07, 2023 at 03:57:04PM -0700, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 
> Cluster topology was not enabled on hybrid x86 CPU as load balance
> was not properly working for cluster domain.  That has been fixed and
> cluster domain can be enabled for hybrid CPU.
> 
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Yeah, you didn't actually try appling this to something recent did ya
:-)

You missed 8f2d6c41e5a6 ("x86/sched: Rewrite topology setup").

I'll replace this patch with the below.

---
 arch/x86/kernel/smpboot.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ed2d51960a7d..3b751d79cdfb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -632,14 +632,9 @@ static void __init build_sched_topology(void)
 	};
 #endif
 #ifdef CONFIG_SCHED_CLUSTER
-	/*
-	 * For now, skip the cluster domain on Hybrid.
-	 */
-	if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) {
-		x86_topology[i++] = (struct sched_domain_topology_level){
-			cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
-		};
-	}
+	x86_topology[i++] = (struct sched_domain_topology_level){
+		cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
+	};
 #endif
 #ifdef CONFIG_SCHED_MC
 	x86_topology[i++] = (struct sched_domain_topology_level){

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU
  2023-07-08 12:31   ` Peter Zijlstra
@ 2023-07-10 16:13     ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-10 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton, Ricardo Neri

On Sat, 2023-07-08 at 14:31 +0200, Peter Zijlstra wrote:
> On Fri, Jul 07, 2023 at 03:57:04PM -0700, Tim Chen wrote:
> > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > 
> > Cluster topology was not enabled on hybrid x86 CPU as load balance
> > was not properly working for cluster domain.  That has been fixed and
> > cluster domain can be enabled for hybrid CPU.
> > 
> > Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Yeah, you didn't actually try appling this to something recent did ya
> :-)
> 
> You missed 8f2d6c41e5a6 ("x86/sched: Rewrite topology setup").
> 
> I'll replace this patch with the below.

Thanks for catching it.

Tim

> 
> ---
>  arch/x86/kernel/smpboot.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index ed2d51960a7d..3b751d79cdfb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -632,14 +632,9 @@ static void __init build_sched_topology(void)
>  	};
>  #endif
>  #ifdef CONFIG_SCHED_CLUSTER
> -	/*
> -	 * For now, skip the cluster domain on Hybrid.
> -	 */
> -	if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) {
> -		x86_topology[i++] = (struct sched_domain_topology_level){
> -			cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
> -		};
> -	}
> +	x86_topology[i++] = (struct sched_domain_topology_level){
> +		cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
> +	};
>  #endif
>  #ifdef CONFIG_SCHED_MC
>  	x86_topology[i++] = (struct sched_domain_topology_level){


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
@ 2023-07-10 20:33   ` Valentin Schneider
  2023-07-10 22:13     ` Tim Chen
  2023-07-10 22:40   ` Tim Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Valentin Schneider @ 2023-07-10 20:33 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Tim C Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On 07/07/23 15:57, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
>
> When balancing sibling domains that have different number of cores,
> tasks in respective sibling domain should be proportional to the number
> of cores in each domain. In preparation of implementing such a policy,
> record the number of tasks in a scheduling group.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c | 10 +++++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d0eb36350d2..5f7f36e45b87 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1860,6 +1860,7 @@ struct sched_group {
>       atomic_t		ref;
>
>       unsigned int		group_weight;
> +	unsigned int		cores;
>       struct sched_group_capacity *sgc;
>       int			asym_prefer_cpu;	/* CPU of highest priority in group */
>       int			flags;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6d5628fcebcf..6b099dbdfb39 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
>  {
>       struct sched_group *sg = sd->groups;
> +	struct cpumask *mask = sched_domains_tmpmask2;
>
>       WARN_ON(!sg);
>
>       do {
> -		int cpu, max_cpu = -1;
> +		int cpu, cores = 0, max_cpu = -1;
>
>               sg->group_weight = cpumask_weight(sched_group_span(sg));
>
> +		cpumask_copy(mask, sched_group_span(sg));
> +		for_each_cpu(cpu, mask) {
> +			cores++;
> +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> +		}


This rekindled my desire for an SMT core cpumask/iterator. I played around
with a global mask but that's a headache: what if we end up with a core
whose SMT threads are split across two exclusive cpusets?

I ended up necro'ing a patch from Peter [1], but didn't get anywhere nice
(the LLC shared storage caused me issues).

All that to say, I couldn't think of a nicer way :(

[1]: https://lore.kernel.org/all/20180530143106.082002139@infradead.org/#t


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 6/6] sched/debug: Dump domains' sched group flags
  2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
@ 2023-07-10 20:33   ` Valentin Schneider
  0 siblings, 0 replies; 55+ messages in thread
From: Valentin Schneider @ 2023-07-10 20:33 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Ionela Voinescu, x86, linux-kernel,
	Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton

On 07/07/23 15:57, Tim Chen wrote:
> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>
> There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
> in a parent domain were not set and propagated properly when a degenerate
> domain is removed.
>
> Add dump of domain sched group flags of a CPU to make debug easier
> in the future.
>
> Usage:
> cat /debug/sched/domains/cpu0/domain1/groups_flags
> to dump cpu0 domain1's sched group flags.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-10 20:33   ` Valentin Schneider
@ 2023-07-10 22:13     ` Tim Chen
  2023-07-12  9:27       ` Valentin Schneider
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-10 22:13 UTC (permalink / raw)
  To: Valentin Schneider, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Ionela Voinescu, x86, linux-kernel,
	Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton

On Mon, 2023-07-10 at 21:33 +0100, Valentin Schneider wrote:
> On 07/07/23 15:57, Tim Chen wrote:
> > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > 
> > When balancing sibling domains that have different number of cores,
> > tasks in respective sibling domain should be proportional to the number
> > of cores in each domain. In preparation of implementing such a policy,
> > record the number of tasks in a scheduling group.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/sched.h    |  1 +
> >  kernel/sched/topology.c | 10 +++++++++-
> >  2 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 3d0eb36350d2..5f7f36e45b87 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1860,6 +1860,7 @@ struct sched_group {
> >       atomic_t		ref;
> > 
> >       unsigned int		group_weight;
> > +	unsigned int		cores;
> >       struct sched_group_capacity *sgc;
> >       int			asym_prefer_cpu;	/* CPU of highest priority in group */
> >       int			flags;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 6d5628fcebcf..6b099dbdfb39 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> >  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> >  {
> >       struct sched_group *sg = sd->groups;
> > +	struct cpumask *mask = sched_domains_tmpmask2;
> > 
> >       WARN_ON(!sg);
> > 
> >       do {
> > -		int cpu, max_cpu = -1;
> > +		int cpu, cores = 0, max_cpu = -1;
> > 
> >               sg->group_weight = cpumask_weight(sched_group_span(sg));
> > 
> > +		cpumask_copy(mask, sched_group_span(sg));
> > +		for_each_cpu(cpu, mask) {
> > +			cores++;
> > +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> > +		}
> 
> 
> This rekindled my desire for an SMT core cpumask/iterator. I played around
> with a global mask but that's a headache: what if we end up with a core
> whose SMT threads are split across two exclusive cpusets?

Peter and I pondered that for a while.  But it seems like partitioning
threads in a core between two different sched domains is not a very
reasonable thing to do. 

https://lore.kernel.org/all/20230612112945.GK4253@hirez.programming.kicks-ass.net/

Tim

> 
> I ended up necro'ing a patch from Peter [1], but didn't get anywhere nice
> (the LLC shared storage caused me issues).
> 
> All that to say, I couldn't think of a nicer way :(
> 
> [1]: https://lore.kernel.org/all/20180530143106.082002139@infradead.org/#t
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
  2023-07-10 20:33   ` Valentin Schneider
@ 2023-07-10 22:40   ` Tim Chen
  2023-07-11 11:31     ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-10 22:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Fri, 2023-07-07 at 15:57 -0700, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 
> When balancing sibling domains that have different number of cores,
> tasks in respective sibling domain should be proportional to the number
> of cores in each domain. In preparation of implementing such a policy,
> record the number of tasks in a scheduling group.

Caught a typo.  Should be "the number of cores" instead of
"the number of tasks" in a scheduling group.

Peter, should I send you another patch with the corrected commit log?

Tim

> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c | 10 +++++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d0eb36350d2..5f7f36e45b87 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1860,6 +1860,7 @@ struct sched_group {
>  	atomic_t		ref;
>  
>  	unsigned int		group_weight;
> +	unsigned int		cores;
>  	struct sched_group_capacity *sgc;
>  	int			asym_prefer_cpu;	/* CPU of highest priority in group */
>  	int			flags;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 6d5628fcebcf..6b099dbdfb39 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
>  {
>  	struct sched_group *sg = sd->groups;
> +	struct cpumask *mask = sched_domains_tmpmask2;
>  
>  	WARN_ON(!sg);
>  
>  	do {
> -		int cpu, max_cpu = -1;
> +		int cpu, cores = 0, max_cpu = -1;
>  
>  		sg->group_weight = cpumask_weight(sched_group_span(sg));
>  
> +		cpumask_copy(mask, sched_group_span(sg));
> +		for_each_cpu(cpu, mask) {
> +			cores++;
> +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> +		}
> +		sg->cores = cores;
> +
>  		if (!(sd->flags & SD_ASYM_PACKING))
>  			goto next;
>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-10 22:40   ` Tim Chen
@ 2023-07-11 11:31     ` Peter Zijlstra
  2023-07-11 16:32       ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-07-11 11:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Mon, Jul 10, 2023 at 03:40:34PM -0700, Tim Chen wrote:
> On Fri, 2023-07-07 at 15:57 -0700, Tim Chen wrote:
> > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > 
> > When balancing sibling domains that have different number of cores,
> > tasks in respective sibling domain should be proportional to the number
> > of cores in each domain. In preparation of implementing such a policy,
> > record the number of tasks in a scheduling group.
> 
> Caught a typo.  Should be "the number of cores" instead of
> "the number of tasks" in a scheduling group.
> 
> Peter, should I send you another patch with the corrected commit log?

I'll fix it up, already had to fix the patch because due to robot
finding a compile fail for SCHED_SMT=n builds.



> > @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> >  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> >  {
> >  	struct sched_group *sg = sd->groups;
> > +	struct cpumask *mask = sched_domains_tmpmask2;
> >  
> >  	WARN_ON(!sg);
> >  
> >  	do {
> > -		int cpu, max_cpu = -1;
> > +		int cpu, cores = 0, max_cpu = -1;
> >  
> >  		sg->group_weight = cpumask_weight(sched_group_span(sg));
> >  
> > +		cpumask_copy(mask, sched_group_span(sg));
> > +		for_each_cpu(cpu, mask) {
> > +			cores++;
#ifdef CONFIG_SCHED_SMT
> > +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
#else
			__cpumask_clear_cpu(cpu, mask);
#endif

or something along them lines -- should be in queue.git/sched/core
already.

> > +		}
> > +		sg->cores = cores;
> > +
> >  		if (!(sd->flags & SD_ASYM_PACKING))
> >  			goto next;
> >  
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-11 11:31     ` Peter Zijlstra
@ 2023-07-11 16:32       ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-11 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao,
	Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Tue, 2023-07-11 at 13:31 +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:40:34PM -0700, Tim Chen wrote:
> > On Fri, 2023-07-07 at 15:57 -0700, Tim Chen wrote:
> > > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > > 
> > > When balancing sibling domains that have different number of cores,
> > > tasks in respective sibling domain should be proportional to the number
> > > of cores in each domain. In preparation of implementing such a policy,
> > > record the number of tasks in a scheduling group.
> > 
> > Caught a typo.  Should be "the number of cores" instead of
> > "the number of tasks" in a scheduling group.
> > 
> > Peter, should I send you another patch with the corrected commit log?
> 
> I'll fix it up, already had to fix the patch because due to robot
> finding a compile fail for SCHED_SMT=n builds.
> 
> 
> 
> > > @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> > >  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> > >  {
> > >  	struct sched_group *sg = sd->groups;
> > > +	struct cpumask *mask = sched_domains_tmpmask2;
> > >  
> > >  	WARN_ON(!sg);
> > >  
> > >  	do {
> > > -		int cpu, max_cpu = -1;
> > > +		int cpu, cores = 0, max_cpu = -1;
> > >  
> > >  		sg->group_weight = cpumask_weight(sched_group_span(sg));
> > >  
> > > +		cpumask_copy(mask, sched_group_span(sg));
> > > +		for_each_cpu(cpu, mask) {
> > > +			cores++;
> #ifdef CONFIG_SCHED_SMT
> > > +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
> #else
> 			__cpumask_clear_cpu(cpu, mask);

Thanks for fixing up the non SCHED_SMT.

I think "__cpumask_clear_cpu(cpu, mask);" can be removed.

Since we have already considered the CPU in the iterator, clearing it
is unnecessay.  So effectively

for_each_cpu(cpu, mask) {
	cores++;
}

should be good enough for the non SCHED_SMT case.  

Or replace the patch with the patch below so we don't
have #ifdef in the middle of code body.  Either way
is fine.

---

From 9f19714db69739a7985e46bc1f8334d70a69cf2e Mon Sep 17 00:00:00 2001
Message-Id: <9f19714db69739a7985e46bc1f8334d70a69cf2e.1689092923.git.tim.c.chen@linux.intel.com>
In-Reply-To: <cover.1689092923.git.tim.c.chen@linux.intel.com>
References: <cover.1689092923.git.tim.c.chen@linux.intel.com>
From: Tim C Chen <tim.c.chen@linux.intel.com>
Date: Wed, 17 May 2023 09:09:54 -0700
Subject: [Patch v3 2/6] sched/topology: Record number of cores in sched group
To: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Ricardo Neri <ricardo.neri@intel.com>, Ravi V. Shankar <ravi.v.shankar@intel.com>, Ben Segall
<bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>, Rafael J. Wysocki
<rafael.j.wysocki@intel.com>, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>, Steven Rostedt <rostedt@goodmis.org>, Tim Chen <tim.c.chen@linux.intel.com>, Valentin Schneider
<vschneid@redhat.com>, Ionela Voinescu <ionela.voinescu@arm.com>, x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde <sshegde@linux.vnet.ibm.com>, Srikar Dronamraju
<srikar@linux.vnet.ibm.com>, naveen.n.rao@linux.vnet.ibm.com, Yicong Yang <yangyicong@hisilicon.com>, Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>, Hillf Danton <hdanton@sina.com>

When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the number
of cores in each domain. In preparation of implementing such a policy,
record the number of cores in a scheduling group.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d0eb36350d2..5f7f36e45b87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1860,6 +1860,7 @@ struct sched_group {
 	atomic_t		ref;
 
 	unsigned int		group_weight;
+	unsigned int		cores;
 	struct sched_group_capacity *sgc;
 	int			asym_prefer_cpu;	/* CPU of highest priority in group */
 	int			flags;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6d5628fcebcf..4ecdaef3f8ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1262,6 +1262,26 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_SMT
+static inline int sched_group_cores(struct sched_group *sg)
+{
+	struct cpumask *mask = sched_domains_tmpmask2;
+	int cpu, cores = 0;
+
+	cpumask_copy(mask, sched_group_span(sg));
+	for_each_cpu(cpu, mask) {
+		cores++;
+		cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
+	}
+	return cores;
+}
+#else
+static inline int sched_group_cores(struct sched_group *sg)
+{
+	return sg->group_weight;
+}
+#endif
+
 /*
  * Initialize sched groups cpu_capacity.
  *
@@ -1282,6 +1302,7 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 		int cpu, max_cpu = -1;
 
 		sg->group_weight = cpumask_weight(sched_group_span(sg));
+		sg->cores = sched_group_cores(sg);
 
 		if (!(sd->flags & SD_ASYM_PACKING))
 			goto next;
-- 
2.32.0






^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 2/6] sched/topology: Record number of cores in sched group
  2023-07-10 22:13     ` Tim Chen
@ 2023-07-12  9:27       ` Valentin Schneider
  0 siblings, 0 replies; 55+ messages in thread
From: Valentin Schneider @ 2023-07-12  9:27 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Ionela Voinescu, x86, linux-kernel,
	Shrikanth Hegde, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton

On 10/07/23 15:13, Tim Chen wrote:
> On Mon, 2023-07-10 at 21:33 +0100, Valentin Schneider wrote:
>> On 07/07/23 15:57, Tim Chen wrote:
>> > From: Tim C Chen <tim.c.chen@linux.intel.com>
>> >
>> > When balancing sibling domains that have different number of cores,
>> > tasks in respective sibling domain should be proportional to the number
>> > of cores in each domain. In preparation of implementing such a policy,
>> > record the number of tasks in a scheduling group.
>> >
>> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> > ---
>> >  kernel/sched/sched.h    |  1 +
>> >  kernel/sched/topology.c | 10 +++++++++-
>> >  2 files changed, 10 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> > index 3d0eb36350d2..5f7f36e45b87 100644
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -1860,6 +1860,7 @@ struct sched_group {
>> >       atomic_t		ref;
>> >
>> >       unsigned int		group_weight;
>> > +	unsigned int		cores;
>> >       struct sched_group_capacity *sgc;
>> >       int			asym_prefer_cpu;	/* CPU of highest priority in group */
>> >       int			flags;
>> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> > index 6d5628fcebcf..6b099dbdfb39 100644
>> > --- a/kernel/sched/topology.c
>> > +++ b/kernel/sched/topology.c
>> > @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>> >  static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
>> >  {
>> >       struct sched_group *sg = sd->groups;
>> > +	struct cpumask *mask = sched_domains_tmpmask2;
>> >
>> >       WARN_ON(!sg);
>> >
>> >       do {
>> > -		int cpu, max_cpu = -1;
>> > +		int cpu, cores = 0, max_cpu = -1;
>> >
>> >               sg->group_weight = cpumask_weight(sched_group_span(sg));
>> >
>> > +		cpumask_copy(mask, sched_group_span(sg));
>> > +		for_each_cpu(cpu, mask) {
>> > +			cores++;
>> > +			cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
>> > +		}
>>
>>
>> This rekindled my desire for an SMT core cpumask/iterator. I played around
>> with a global mask but that's a headache: what if we end up with a core
>> whose SMT threads are split across two exclusive cpusets?
>
> Peter and I pondered that for a while.  But it seems like partitioning
> threads in a core between two different sched domains is not a very
> reasonable thing to do.
>
> https://lore.kernel.org/all/20230612112945.GK4253@hirez.programming.kicks-ass.net/
>

Thanks for the link. I'll poke at this a bit more, but regardless:

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance
  2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
@ 2023-07-14 13:02   ` Shrikanth Hegde
  2023-07-14 22:16     ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-14 13:02 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ricardo Neri, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton,
	Peter Zijlstra, shrikanth hegde



On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> 
> should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
> cpus) starting from lower numbered CPUs looking for the first idle CPU.
> 
> In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
> non-SMT cores:
> 
> 	[0, 1] [2, 3] [4, 5] 6 7 8 9
>          b  i   b  i   b  i  b i i i
> 
> In the figure above, CPUs in brackets are siblings of an SMT core. The
> rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
> idle CPU.
> 
> We should let a CPU on a fully idle core get the first chance to idle
> load balance as it has more CPU capacity than a CPU on an idle SMT
> CPU with busy sibling.  So for the figure above, if we are running
> should_we_balance() to CPU 1, we should return false to let CPU 7 on
> idle core to have a chance first to idle load balance.
> 
> A partially busy (i.e., of type group_has_spare) local group with SMT 
> cores will often have only one SMT sibling busy. If the destination CPU
> is a non-SMT core, partially busy, lower-numbered, SMT cores should not
> be considered when finding the first idle CPU. 
> 
> However, in should_we_balance(), when we encounter idle SMT first in partially
> busy core, we prematurely break the search for the first idle CPU.
> 
> Higher-numbered, non-SMT cores is not given the chance to have
> idle balance done on their behalf. Those CPUs will only be considered
> for idle balancing by chance via CPU_NEWLY_IDLE.
> 
> Instead, consider the idle state of the whole SMT core.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f491b94908bf..294a662c9410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data);
>  static int should_we_balance(struct lb_env *env)
>  {
>  	struct sched_group *sg = env->sd->groups;
> -	int cpu;
> +	int cpu, idle_smt = -1;
> 
>  	/*
>  	 * Ensure the balancing environment is consistent; can happen
> @@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env)
>  		if (!idle_cpu(cpu))
>  			continue;
> 
> +		/*
> +		 * Don't balance to idle SMT in busy core right away when
> +		 * balancing cores, but remember the first idle SMT CPU for
> +		 * later consideration.  Find CPU on an idle core first.
> +		 */
> +		if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> +			if (idle_smt == -1)
> +				idle_smt = cpu;
> +			continue;
> +		}
> +
>  		/* Are we the first idle CPU? */
>  		return cpu == env->dst_cpu;
>  	}
> 
> +	if (idle_smt == env->dst_cpu)
> +		return true;


This is nice. It helps in reducing the migrations and improve the performance 
of CPU oriented benchmarks slightly. This could be due to less migrations. 

Tested a bit on power10 with SMT=4. Offlined a CPU to make a few cores SMT=1. There is 
no regression observed. Slight improvement in throughput oriented workload such as stress-ng. 
migrations are reduced by quite a bit, likely due to patch. I have attached the test results there.
[4/6] sched/fair: Consider the idle state of the whole core for load balance

Test results: 

# lscpu
Architecture:            ppc64le
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-17,24,25,32-49,56-89
  Off-line CPU(s) list:  18-23,26-31,50-55,90-95
Model name:              POWER10 (architected), altivec supported

--------------------------------------------------------------------------------------------------
baseline:

 Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        260,813.13 msec task-clock                #   33.390 CPUs utilized            ( +-  0.10% )
            42,535      context-switches          #  163.543 /sec                     ( +-  0.13% )
             9,060      cpu-migrations            #   34.835 /sec                     ( +-  1.07% )
            12,947      page-faults               #   49.780 /sec                     ( +-  1.76% )
   948,061,954,432      cycles                    #    3.645 GHz                      ( +-  0.09% )
   926,045,701,578      instructions              #    0.98  insn per cycle           ( +-  0.00% )
   146,418,075,496      branches                  #  562.964 M/sec                    ( +-  0.00% )
     1,197,661,965      branch-misses             #    0.82% of all branches          ( +-  0.17% )

            7.8111 +- 0.0162 seconds time elapsed  ( +-  0.21% )

 Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        253,351.70 msec task-clock                #   28.207 CPUs utilized            ( +-  0.21% )
            41,046      context-switches          #  162.828 /sec                     ( +-  0.16% )
             6,674      cpu-migrations            #   26.475 /sec                     ( +-  3.42% )
            10,879      page-faults               #   43.157 /sec                     ( +-  1.68% )
   931,014,218,983      cycles                    #    3.693 GHz                      ( +-  0.22% )
   919,717,564,454      instructions              #    0.99  insn per cycle           ( +-  0.00% )
   145,480,596,331      branches                  #  577.116 M/sec                    ( +-  0.00% )
     1,175,362,979      branch-misses             #    0.81% of all branches          ( +-  0.12% )

            8.9818 +- 0.0288 seconds time elapsed  ( +-  0.32% )


---------------------------------------------------------------------------------------------------
with patch:

 Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        254,652.01 msec task-clock                #   33.449 CPUs utilized            ( +-  0.11% )
            40,970      context-switches          #  160.974 /sec                     ( +-  0.10% )
             5,397      cpu-migrations            #   21.205 /sec                     ( +-  2.01% )
            11,705      page-faults               #   45.990 /sec                     ( +-  1.21% )
   911,115,537,080      cycles                    #    3.580 GHz                      ( +-  0.11% )
   925,635,958,489      instructions              #    1.02  insn per cycle           ( +-  0.00% )
   146,450,995,164      branches                  #  575.416 M/sec                    ( +-  0.00% )
     1,188,906,011      branch-misses             #    0.81% of all branches          ( +-  0.28% )

            7.6132 +- 0.0381 seconds time elapsed  ( +-  0.50% )

 Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        236,962.38 msec task-clock                #   27.948 CPUs utilized            ( +-  0.05% )
            40,030      context-switches          #  168.869 /sec                     ( +-  0.04% )
             3,156      cpu-migrations            #   13.314 /sec                     ( +-  1.37% )
             9,448      page-faults               #   39.857 /sec                     ( +-  1.72% )
   856,444,937,794      cycles                    #    3.613 GHz                      ( +-  0.06% )
   919,459,795,805      instructions              #    1.07  insn per cycle           ( +-  0.00% )
   145,654,799,033      branches                  #  614.452 M/sec                    ( +-  0.00% )
     1,177,464,719      branch-misses             #    0.81% of all branches          ( +-  0.23% )

            8.4788 +- 0.0323 seconds time elapsed  ( +-  0.38% )
--------------------------------------------------------------------------------------------------------

Tried on a symmetric system with all cores having SMT=4 as well. There was reduction in migrations here as well.
Didnt observe any major regressions when microbenchmarks run alone. Such as hackbench, stress-ng. 

So. Here is tested-by. 
Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>



> +
>  	/* Are we the first CPU of this group ? */
>  	return group_balance_cpu(sg) == env->dst_cpu;
>  }

One doubt though, Here a fully idle core would be chosen instead of first idle cpu in the 
group (if there is one). Since coming out of idle of SMT is faster compared to a fully idle core,
would latency increase? Or that concerns mainly wakeup path?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
@ 2023-07-14 13:06   ` Shrikanth Hegde
  2023-07-14 23:05     ` Tim Chen
  2023-07-14 14:53   ` Tobias Huschle
  1 sibling, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-14 13:06 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Peter Zijlstra,
	shrikanth hegde



On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 

Hi Tim.  Sorry for the delayed response.

> On hybrid CPUs with scheduling cluster enabled, we will need to
> consider balancing between SMT CPU cluster, and Atom core cluster.
> 
> Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
> Each scheduling cluster span a L2 cache.
> 
>           --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
>           [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
>           Big    Big    Big    Big    Atom       Atom
>           core   core   core   core   Module     Module
> 
> If the busiest group is a big core with both SMT CPUs busy, we should
> active load balance if destination group has idle CPU cores.  Such
> condition is considered by asym_active_balance() in load balancing but not
> considered when looking for busiest group and computing load imbalance.
> Add this consideration in find_busiest_group() and calculate_imbalance().
> 
> In addition, update the logic determining the busier group when one group
> is SMT and the other group is non SMT but both groups are partially busy
> with idle CPU. The busier group should be the group with idle cores rather
> than the group with one busy SMT CPU.  We do not want to make the SMT group
> the busiest one to pull the only task off SMT CPU and causing the whole core to
> go empty.
> 
> Otherwise suppose in the search for the busiest group, we first encounter
> an SMT group with 1 task and set it as the busiest.  The destination
> group is an atom cluster with 1 task and we next encounter an atom
> cluster group with 3 tasks, we will not pick this atom cluster over the
> SMT group, even though we should.  As a result, we do not load balance
> the busier Atom cluster (with 3 tasks) towards the local atom cluster
> (with 1 task).  And it doesn't make sense to pick the 1 task SMT group
> as the busier group as we also should not pull task off the SMT towards
> the 1 task atom cluster and make the SMT core completely empty.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 77 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87317634fab2..f636d6c09dc6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8279,6 +8279,11 @@ enum group_type {
>  	 * more powerful CPU.
>  	 */
>  	group_misfit_task,
> +	/*
> +	 * Balance SMT group that's fully busy. Can benefit from migration
> +	 * a task on SMT with busy sibling to another CPU on idle core.
> +	 */
> +	group_smt_balance,

Could you please explain what group_smt_balance does differently? AFAIU it is doing the same 
thing as group_fully_busy but for one domain above SMT domains right?


>  	/*
>  	 * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
>  	 * and the task should be migrated to it instead of running on the
> @@ -8987,6 +8992,7 @@ struct sg_lb_stats {
>  	unsigned int group_weight;
>  	enum group_type group_type;
>  	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> +	unsigned int group_smt_balance;  /* Task on busy SMT be moved */
>  	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int nr_numa_running;
> @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct,
>  	if (sgs->group_asym_packing)
>  		return group_asym_packing;
> 
> +	if (sgs->group_smt_balance)
> +		return group_smt_balance;
> +
>  	if (sgs->group_misfit_task_load)
>  		return group_misfit_task;
> 
> @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs
>  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>  }
> 
> +/* One group has more than one SMT CPU while the other group does not */
> +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> +				    struct sched_group *sg2)
> +{
> +	if (!sg1 || !sg2)
> +		return false;
> +
> +	return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
> +		(sg2->flags & SD_SHARE_CPUCAPACITY);
> +}
> +
> +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	if (env->idle == CPU_NOT_IDLE)
> +		return false;
> +
> +	/*
> +	 * For SMT source group, it is better to move a task
> +	 * to a CPU that doesn't have multiple tasks sharing its CPU capacity.
> +	 * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> +	 * will not be on.
> +	 */
> +	if (group->flags & SD_SHARE_CPUCAPACITY &&
> +	    sgs->sum_h_nr_running > 1)
> +		return true;
> +

If we consider symmetric platforms which have SMT4 such as power10. 
we have a topology like below. multiple such MC will form DIE(PKG)


[0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
[--SMT--][--SMT--][----SMT---][---SMT----]
[--sg1--][--sg1--][---sg1----][---sg1----]
[--------------MC------------------------]

In case of SMT4, if there is any group which has 2 or more tasks, that 
group will be marked as group_smt_balance. previously, if that group had 2
or 3 tasks, it would have been marked as group_has_spare. Since all the groups have 
SMT that means behavior would be same fully busy right? That can cause some 
corner cases. No?

One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is trying to do 
load balance. Previously imbalance would have been 2, instead now imbalance would be 1.
But in subsequent lb it would be balanced.



> +	return false;
> +}
> +
>  static inline bool
>  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>  {
> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		sgs->group_asym_packing = 1;
>  	}
> 
> +	/* Check for loaded SMT group to be balanced to dst CPU */
> +	if (!local_group && smt_balance(env, sgs, group))
> +		sgs->group_smt_balance = 1;
> +
>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
> 
>  	/* Computing avg_load makes sense only when group is overloaded */
> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return false;
>  		break;
> 
> +	case group_smt_balance:
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In
> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
> 
>  	case group_has_spare:
> +		/*
> +		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
> +		 * as we do not want to pull task off SMT core with one task
> +		 * and make the core idle.
> +		 */
> +		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> +			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> +				return false;
> +			else
> +				return true;> +		}
> +
>  		/*
>  		 * Select not overloaded group with lowest number of idle cpus
>  		 * and highest number of running tasks. We could also compare
> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
> 
>  	case group_imbalanced:
>  	case group_asym_packing:
> +	case group_smt_balance:
>  		/* Those types are not used in the slow wakeup path */
>  		return false;
> 
> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> 
>  	case group_imbalanced:
>  	case group_asym_packing:
> +	case group_smt_balance:
>  		/* Those type are not used in the slow wakeup path */
>  		return NULL;
> 
> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		return;
>  	}
> 
> +	if (busiest->group_type == group_smt_balance) {
> +		/* Reduce number of tasks sharing CPU capacity */
> +		env->migration_type = migrate_task;
> +		env->imbalance = 1;
> +		return;
> +	}
> +
>  	if (busiest->group_type == group_imbalanced) {
>  		/*
>  		 * In the group_imb case we cannot rely on group-wide averages
> @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  		goto force_balance;
> 
>  	if (busiest->group_type != group_overloaded) {
> -		if (env->idle == CPU_NOT_IDLE)
> +		if (env->idle == CPU_NOT_IDLE) {
>  			/*
>  			 * If the busiest group is not overloaded (and as a
>  			 * result the local one too) but this CPU is already
>  			 * busy, let another idle CPU try to pull task.
>  			 */
>  			goto out_balanced;
> +		}
> +
> +		if (busiest->group_type == group_smt_balance &&
> +		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> +			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
> +			goto force_balance;
> +		}
> 
>  		if (busiest->group_weight > 1 &&
> -		    local->idle_cpus <= (busiest->idle_cpus + 1))
> +		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
>  			/*
>  			 * If the busiest group is not overloaded
>  			 * and there is no imbalance between this and busiest
> @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  			 * there is more than 1 CPU per group.
>  			 */
>  			goto out_balanced;
> +		}
> 
> -		if (busiest->sum_h_nr_running == 1)
> +		if (busiest->sum_h_nr_running == 1) {
>  			/*
>  			 * busiest doesn't have any tasks waiting to run
>  			 */
>  			goto out_balanced;
> +		}
>  	}
> 
>  force_balance:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
@ 2023-07-14 13:14   ` Shrikanth Hegde
  2023-07-14 14:22     ` Tobias Huschle
                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-14 13:14 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, shrikanth hegde,
	Tobias Huschle



On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 
> In the current prefer sibling load balancing code, there is an implicit
> assumption that the busiest sched group and local sched group are
> equivalent, hence the tasks to be moved is simply the difference in
> number of tasks between the two groups (i.e. imbalance) divided by two.
> 
> However, we may have different number of cores between the cluster groups,
> say when we take CPU offline or we have hybrid groups.  In that case,
> we should balance between the two groups such that #tasks/#cores ratio
> is the same between the same between both groups.  Hence the imbalance

nit: type here. the same between is repeated.

> computed will need to reflect this.
> 
> Adjust the sibling imbalance computation to take into account of the
> above considerations.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
>  1 file changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f636d6c09dc6..f491b94908bf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>  	return false;
>  }
> 
> +static inline long sibling_imbalance(struct lb_env *env,
> +				    struct sd_lb_stats *sds,
> +				    struct sg_lb_stats *busiest,
> +				    struct sg_lb_stats *local)
> +{
> +	int ncores_busiest, ncores_local;
> +	long imbalance;

can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int.

> +
> +	if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> +		return 0;
> +
> +	ncores_busiest = sds->busiest->cores;
> +	ncores_local = sds->local->cores;
> +
> +	if (ncores_busiest == ncores_local) {
> +		imbalance = busiest->sum_nr_running;
> +		lsub_positive(&imbalance, local->sum_nr_running);
> +		return imbalance;
> +	}
> +
> +	/* Balance such that nr_running/ncores ratio are same on both groups */
> +	imbalance = ncores_local * busiest->sum_nr_running;
> +	lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
> +	/* Normalize imbalance and do rounding on normalization */
> +	imbalance = 2 * imbalance + ncores_local + ncores_busiest;
> +	imbalance /= ncores_local + ncores_busiest;
> +

Could this work for case where number of CPU/cores would differ
between two sched groups in a sched domain? Such as problem pointed
by tobias on S390. It would be nice if this patch can work for that case 
as well. Ran numbers for a few cases. It looks to work.
https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901



> +	/* Take advantage of resource in an empty sched group */
> +	if (imbalance == 0 && local->sum_nr_running == 0 &&
> +	    busiest->sum_nr_running > 1)
> +		imbalance = 2;
> +

I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning 
is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. 


Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

> +	return imbalance;
> +}
> +
>  static inline bool
>  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>  {
> @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		}
> 
>  		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> -			unsigned int nr_diff = busiest->sum_nr_running;
>  			/*
>  			 * When prefer sibling, evenly spread running tasks on
>  			 * groups.
>  			 */
>  			env->migration_type = migrate_task;
> -			lsub_positive(&nr_diff, local->sum_nr_running);
> -			env->imbalance = nr_diff;
> +			env->imbalance = sibling_imbalance(env, sds, busiest, local);
>  		} else {
> 
>  			/*
> @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	 * group's child domain.
>  	 */
>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
> +	    sibling_imbalance(env, &sds, busiest, local) > 1)
>  		goto force_balance;
> 
>  	if (busiest->group_type != group_overloaded) {


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-14 13:14   ` Shrikanth Hegde
@ 2023-07-14 14:22     ` Tobias Huschle
  2023-07-14 23:35       ` Tim Chen
  2023-07-14 20:44     ` Tim Chen
  2023-07-15  0:11     ` Tim Chen
  2 siblings, 1 reply; 55+ messages in thread
From: Tobias Huschle @ 2023-07-14 14:22 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Ricardo Neri, Ravi V . Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton

On 2023-07-14 15:14, Shrikanth Hegde wrote:
> On 7/8/23 4:27 AM, Tim Chen wrote:
>> From: Tim C Chen <tim.c.chen@linux.intel.com>
>> 
>> In the current prefer sibling load balancing code, there is an 
>> implicit
>> assumption that the busiest sched group and local sched group are
>> equivalent, hence the tasks to be moved is simply the difference in
>> number of tasks between the two groups (i.e. imbalance) divided by 
>> two.
>> 
>> However, we may have different number of cores between the cluster 
>> groups,
>> say when we take CPU offline or we have hybrid groups.  In that case,
>> we should balance between the two groups such that #tasks/#cores ratio
>> is the same between the same between both groups.  Hence the imbalance
> 
> nit: type here. the same between is repeated.
> 
>> computed will need to reflect this.
>> 
>> Adjust the sibling imbalance computation to take into account of the
>> above considerations.
>> 
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 37 insertions(+), 4 deletions(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f636d6c09dc6..f491b94908bf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env 
>> *env, struct sg_lb_stats *sgs,
>>  	return false;
>>  }
>> 
>> +static inline long sibling_imbalance(struct lb_env *env,
>> +				    struct sd_lb_stats *sds,
>> +				    struct sg_lb_stats *busiest,
>> +				    struct sg_lb_stats *local)
>> +{
>> +	int ncores_busiest, ncores_local;
>> +	long imbalance;
> 
> can imbalance be unsigned int or unsigned long? as sum_nr_running is
> unsigned int.
> 
>> +
>> +	if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
>> +		return 0;
>> +
>> +	ncores_busiest = sds->busiest->cores;
>> +	ncores_local = sds->local->cores;
>> +
>> +	if (ncores_busiest == ncores_local) {
>> +		imbalance = busiest->sum_nr_running;
>> +		lsub_positive(&imbalance, local->sum_nr_running);
>> +		return imbalance;
>> +	}
>> +
>> +	/* Balance such that nr_running/ncores ratio are same on both groups 
>> */
>> +	imbalance = ncores_local * busiest->sum_nr_running;
>> +	lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
>> +	/* Normalize imbalance and do rounding on normalization */
>> +	imbalance = 2 * imbalance + ncores_local + ncores_busiest;
>> +	imbalance /= ncores_local + ncores_busiest;
>> +
> 
> Could this work for case where number of CPU/cores would differ
> between two sched groups in a sched domain? Such as problem pointed
> by tobias on S390. It would be nice if this patch can work for that 
> case
> as well. Ran numbers for a few cases. It looks to work.
> https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901
> 


Just stumbled upon this patch series as well. In this version it looks
similar to the prototypes I played around with, but more complete.
So I'm happy that my understanding of the load balancer was kinda 
correct :)

 From a functional perspective this appears to address the issues we saw 
on s390.

> 
> 
>> +	/* Take advantage of resource in an empty sched group */
>> +	if (imbalance == 0 && local->sum_nr_running == 0 &&
>> +	    busiest->sum_nr_running > 1)
>> +		imbalance = 2;
>> +
> 
> I don't see how this case would be true. When there are unequal number
> of cores and local->sum_nr_ruuning
> is 0, and busiest->sum_nr_running is atleast 2, imbalance will be 
> atleast 1.
> 
> 
> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> 
>> +	return imbalance;
>> +}
>> +
>>  static inline bool
>>  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>  {
>> @@ -10230,14 +10265,12 @@ static inline void 
>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  		}
>> 
>>  		if (busiest->group_weight == 1 || sds->prefer_sibling) {
>> -			unsigned int nr_diff = busiest->sum_nr_running;
>>  			/*
>>  			 * When prefer sibling, evenly spread running tasks on
>>  			 * groups.
>>  			 */
>>  			env->migration_type = migrate_task;
>> -			lsub_positive(&nr_diff, local->sum_nr_running);
>> -			env->imbalance = nr_diff;
>> +			env->imbalance = sibling_imbalance(env, sds, busiest, local);
>>  		} else {
>> 
>>  			/*
>> @@ -10424,7 +10457,7 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>  	 * group's child domain.
>>  	 */
>>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
>> +	    sibling_imbalance(env, &sds, busiest, local) > 1)
>>  		goto force_balance;
>> 
>>  	if (busiest->group_type != group_overloaded) {

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
  2023-07-14 13:06   ` Shrikanth Hegde
@ 2023-07-14 14:53   ` Tobias Huschle
  2023-07-14 23:29     ` Tim Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Tobias Huschle @ 2023-07-14 14:53 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton

On 2023-07-08 00:57, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 
> On hybrid CPUs with scheduling cluster enabled, we will need to
> consider balancing between SMT CPU cluster, and Atom core cluster.
> 
> Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
> Each scheduling cluster span a L2 cache.
> 
>           --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
>           [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
>           Big    Big    Big    Big    Atom       Atom
>           core   core   core   core   Module     Module
> 
> If the busiest group is a big core with both SMT CPUs busy, we should
> active load balance if destination group has idle CPU cores.  Such
> condition is considered by asym_active_balance() in load balancing but 
> not
> considered when looking for busiest group and computing load imbalance.
> Add this consideration in find_busiest_group() and 
> calculate_imbalance().
> 
> In addition, update the logic determining the busier group when one 
> group
> is SMT and the other group is non SMT but both groups are partially 
> busy
> with idle CPU. The busier group should be the group with idle cores 
> rather
> than the group with one busy SMT CPU.  We do not want to make the SMT 
> group
> the busiest one to pull the only task off SMT CPU and causing the whole 
> core to
> go empty.
> 
> Otherwise suppose in the search for the busiest group, we first 
> encounter
> an SMT group with 1 task and set it as the busiest.  The destination
> group is an atom cluster with 1 task and we next encounter an atom
> cluster group with 3 tasks, we will not pick this atom cluster over the
> SMT group, even though we should.  As a result, we do not load balance
> the busier Atom cluster (with 3 tasks) towards the local atom cluster
> (with 1 task).  And it doesn't make sense to pick the 1 task SMT group
> as the busier group as we also should not pull task off the SMT towards
> the 1 task atom cluster and make the SMT core completely empty.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 77 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87317634fab2..f636d6c09dc6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8279,6 +8279,11 @@ enum group_type {
>  	 * more powerful CPU.
>  	 */
>  	group_misfit_task,
> +	/*
> +	 * Balance SMT group that's fully busy. Can benefit from migration
> +	 * a task on SMT with busy sibling to another CPU on idle core.
> +	 */
> +	group_smt_balance,

Would it make sense to move smt_balance?, s.t. we get:

group_has_spare < group_smt_balance < group_fully_busy

Conceptually I would be more intuitive to me like this as the
smt_balance groups are more busy than has_spare ones, but less busy
then fully_busy ones.

 From a functional perspective I could also see some impact when
update_sd_pick_busiest compares the group types. In that case we
would remove tasks from fully busy groups before moving them
from smt_balance groups. Not sure which way would be to prefer
to increase overall throughput.

Since smt_balance is only selected if the group has SMT, this
should still not pull the last task off of a non-SMT CPU.

>  	/*
>  	 * SD_ASYM_PACKING only: One local CPU with higher capacity is 
> available,
>  	 * and the task should be migrated to it instead of running on the
> @@ -8987,6 +8992,7 @@ struct sg_lb_stats {
>  	unsigned int group_weight;
>  	enum group_type group_type;
>  	unsigned int group_asym_packing; /* Tasks should be moved to 
> preferred CPU */
> +	unsigned int group_smt_balance;  /* Task on busy SMT be moved */
>  	unsigned long group_misfit_task_load; /* A CPU has a task too big
> for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int nr_numa_running;
> @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int 
> imbalance_pct,
>  	if (sgs->group_asym_packing)
>  		return group_asym_packing;
> 
> +	if (sgs->group_smt_balance)
> +		return group_smt_balance;
> +
>  	if (sgs->group_misfit_task_load)
>  		return group_misfit_task;
> 
> @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct
> sd_lb_stats *sds,  struct sg_lb_stats *sgs
>  	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
>  }
> 
> +/* One group has more than one SMT CPU while the other group does not 
> */
> +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> +				    struct sched_group *sg2)
> +{
> +	if (!sg1 || !sg2)
> +		return false;
> +
> +	return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
> +		(sg2->flags & SD_SHARE_CPUCAPACITY);
> +}
> +
> +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats 
> *sgs,
> +			       struct sched_group *group)
> +{
> +	if (env->idle == CPU_NOT_IDLE)
> +		return false;
> +
> +	/*
> +	 * For SMT source group, it is better to move a task
> +	 * to a CPU that doesn't have multiple tasks sharing its CPU 
> capacity.
> +	 * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> +	 * will not be on.
> +	 */
> +	if (group->flags & SD_SHARE_CPUCAPACITY &&
> +	    sgs->sum_h_nr_running > 1)
> +		return true;
> +
> +	return false;
> +}
> +
>  static inline bool
>  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>  {
> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct 
> lb_env *env,
>  		sgs->group_asym_packing = 1;
>  	}
> 
> +	/* Check for loaded SMT group to be balanced to dst CPU */
> +	if (!local_group && smt_balance(env, sgs, group))
> +		sgs->group_smt_balance = 1;
> +
>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
> 
>  	/* Computing avg_load makes sense only when group is overloaded */
> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env 
> *env,
>  			return false;
>  		break;
> 
> +	case group_smt_balance:
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In
> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env 
> *env,
>  		break;
> 
>  	case group_has_spare:
> +		/*
> +		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
> +		 * as we do not want to pull task off SMT core with one task
> +		 * and make the core idle.
> +		 */
> +		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> +			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> +				return false;
> +			else
> +				return true;
> +		}
> +
>  		/*
>  		 * Select not overloaded group with lowest number of idle cpus
>  		 * and highest number of running tasks. We could also compare
> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group 
> *idlest,
> 
>  	case group_imbalanced:
>  	case group_asym_packing:
> +	case group_smt_balance:
>  		/* Those types are not used in the slow wakeup path */
>  		return false;
> 
> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
> struct task_struct *p, int this_cpu)
> 
>  	case group_imbalanced:
>  	case group_asym_packing:
> +	case group_smt_balance:
>  		/* Those type are not used in the slow wakeup path */
>  		return NULL;
> 
> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>  		return;
>  	}
> 
> +	if (busiest->group_type == group_smt_balance) {
> +		/* Reduce number of tasks sharing CPU capacity */
> +		env->migration_type = migrate_task;
> +		env->imbalance = 1;
> +		return;
> +	}
> +
>  	if (busiest->group_type == group_imbalanced) {
>  		/*
>  		 * In the group_imb case we cannot rely on group-wide averages
> @@ -10363,16 +10428,23 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>  		goto force_balance;
> 
>  	if (busiest->group_type != group_overloaded) {
> -		if (env->idle == CPU_NOT_IDLE)
> +		if (env->idle == CPU_NOT_IDLE) {
>  			/*
>  			 * If the busiest group is not overloaded (and as a
>  			 * result the local one too) but this CPU is already
>  			 * busy, let another idle CPU try to pull task.
>  			 */
>  			goto out_balanced;
> +		}
> +
> +		if (busiest->group_type == group_smt_balance &&
> +		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> +			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
> +			goto force_balance;
> +		}
> 
>  		if (busiest->group_weight > 1 &&
> -		    local->idle_cpus <= (busiest->idle_cpus + 1))
> +		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
>  			/*
>  			 * If the busiest group is not overloaded
>  			 * and there is no imbalance between this and busiest
> @@ -10383,12 +10455,14 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>  			 * there is more than 1 CPU per group.
>  			 */
>  			goto out_balanced;
> +		}
> 
> -		if (busiest->sum_h_nr_running == 1)
> +		if (busiest->sum_h_nr_running == 1) {
>  			/*
>  			 * busiest doesn't have any tasks waiting to run
>  			 */
>  			goto out_balanced;
> +		}
>  	}
> 
>  force_balance:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-14 13:14   ` Shrikanth Hegde
  2023-07-14 14:22     ` Tobias Huschle
@ 2023-07-14 20:44     ` Tim Chen
  2023-07-14 23:23       ` Tim Chen
  2023-07-15  0:11     ` Tim Chen
  2 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-14 20:44 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Tobias Huschle

On Fri, 2023-07-14 at 18:44 +0530, Shrikanth Hegde wrote:
> 
> On 7/8/23 4:27 AM, Tim Chen wrote:
> > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > 
> > In the current prefer sibling load balancing code, there is an implicit
> > assumption that the busiest sched group and local sched group are
> > equivalent, hence the tasks to be moved is simply the difference in
> > number of tasks between the two groups (i.e. imbalance) divided by two.
> > 
> > However, we may have different number of cores between the cluster groups,
> > say when we take CPU offline or we have hybrid groups.  In that case,
> > we should balance between the two groups such that #tasks/#cores ratio
> > is the same between the same between both groups.  Hence the imbalance
> 
> nit: type here. the same between is repeated.

Thanks for catching.

> 
> > computed will need to reflect this.
> > 
> > Adjust the sibling imbalance computation to take into account of the
> > above considerations.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 37 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f636d6c09dc6..f491b94908bf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> >  	return false;
> >  }
> > 
> > +static inline long sibling_imbalance(struct lb_env *env,
> > +				    struct sd_lb_stats *sds,
> > +				    struct sg_lb_stats *busiest,
> > +				    struct sg_lb_stats *local)
> > +{
> > +	int ncores_busiest, ncores_local;
> > +	long imbalance;
> 
> can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int.

It could be made unsigned long. 

> 
> > +
> > +	if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
> > +		return 0;
> > +
> > +	ncores_busiest = sds->busiest->cores;
> > +	ncores_local = sds->local->cores;
> > +
> > +	if (ncores_busiest == ncores_local) {
> > +		imbalance = busiest->sum_nr_running;
> > +		lsub_positive(&imbalance, local->sum_nr_running);
> > +		return imbalance;
> > +	}
> > +
> > +	/* Balance such that nr_running/ncores ratio are same on both groups */
> > +	imbalance = ncores_local * busiest->sum_nr_running;
> > +	lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
> > +	/* Normalize imbalance and do rounding on normalization */
> > +	imbalance = 2 * imbalance + ncores_local + ncores_busiest;
> > +	imbalance /= ncores_local + ncores_busiest;
> > +
> 
> Could this work for case where number of CPU/cores would differ
> between two sched groups in a sched domain? 
> 

Yes.  That's the problem I was targeting. 

> Such as problem pointed
> by tobias on S390. It would be nice if this patch can work for that case 
> as well. Ran numbers for a few cases. It looks to work.
> https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901
> 
> 
> 
> > +	/* Take advantage of resource in an empty sched group */
> > +	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > +	    busiest->sum_nr_running > 1)
> > +		imbalance = 2;
> > +
> 
> I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning 
> is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. 
> 
> 
> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

Thanks for the review.

> 
> > +	return imbalance;
> > +}
> > +
> >  static inline bool
> >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> >  {
> > @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		}
> > 
> >  		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > -			unsigned int nr_diff = busiest->sum_nr_running;
> >  			/*
> >  			 * When prefer sibling, evenly spread running tasks on
> >  			 * groups.
> >  			 */
> >  			env->migration_type = migrate_task;
> > -			lsub_positive(&nr_diff, local->sum_nr_running);
> > -			env->imbalance = nr_diff;
> > +			env->imbalance = sibling_imbalance(env, sds, busiest, local);
> >  		} else {
> > 
> >  			/*
> > @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >  	 * group's child domain.
> >  	 */
> >  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > -	    busiest->sum_nr_running > local->sum_nr_running + 1)
> > +	    sibling_imbalance(env, &sds, busiest, local) > 1)
> >  		goto force_balance;
> > 
> >  	if (busiest->group_type != group_overloaded) {
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance
  2023-07-14 13:02   ` Shrikanth Hegde
@ 2023-07-14 22:16     ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-14 22:16 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ricardo Neri, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton,
	Peter Zijlstra

On Fri, 2023-07-14 at 18:32 +0530, Shrikanth Hegde wrote:
> 
> 
> 
> Tried on a symmetric system with all cores having SMT=4 as well. There was reduction in migrations here as well.
> Didnt observe any major regressions when microbenchmarks run alone. Such as hackbench, stress-ng. 
> 
> So. Here is tested-by. 
> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

Thanks for testing.  

> 
> 
> > +
> >  	/* Are we the first CPU of this group ? */
> >  	return group_balance_cpu(sg) == env->dst_cpu;
> >  }
> 
> One doubt though, Here a fully idle core would be chosen instead of first idle cpu in the 
> group (if there is one). Since coming out of idle of SMT is faster compared to a fully idle core,
> would latency increase? Or that concerns mainly wakeup path?

Yeah, I think that concern is for the wakeup path and not for the balance path.

Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-14 13:06   ` Shrikanth Hegde
@ 2023-07-14 23:05     ` Tim Chen
  2023-07-15 18:25       ` Tim Chen
                         ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-14 23:05 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Peter Zijlstra

On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:

> 
> 
> If we consider symmetric platforms which have SMT4 such as power10. 
> we have a topology like below. multiple such MC will form DIE(PKG)
> 
> 
> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
> [--SMT--][--SMT--][----SMT---][---SMT----]
> [--sg1--][--sg1--][---sg1----][---sg1----]
> [--------------MC------------------------]
> 
> In case of SMT4, if there is any group which has 2 or more tasks, that 
> group will be marked as group_smt_balance. previously, if that group had 2
> or 3 tasks, it would have been marked as group_has_spare. Since all the groups have 
> SMT that means behavior would be same fully busy right? That can cause some 
> corner cases. No?

You raised a good point. I was looking from SMT2
perspective so group_smt_balance implies group_fully_busy.
That is no longer true for SMT4.

I am thinking of the following fix on the current patch
to take care of SMT4. Do you think this addresses
concerns from you and Tobias?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 294a662c9410..3fc8d3a3bd22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
                break;
 
        case group_smt_balance:
+               /* no idle cpus on both groups handled by group_fully_busy below */
+               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+                       if (sgs->idle_cpus > busiest->idle_cpus)
+                               return false;
+                       if (sgs->idle_cpus < busiest->idle_cpus)
+                               return true;
+                       if (sgs->sum_nr_running <= busiest_sum_nr_running)
+                               return false;
+                       else
+                               return true;
+               }


I will be on vacation next three weeks so my response will be slow.

Tim

> 
> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is trying to do 
> load balance. Previously imbalance would have been 2, instead now imbalance would be 1.
> But in subsequent lb it would be balanced.
> 
> 
> 
> > +	return false;
> > +}
> > +
> >  static inline bool
> >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> >  {
> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >  		sgs->group_asym_packing = 1;
> >  	}
> > 
> > +	/* Check for loaded SMT group to be balanced to dst CPU */
> > +	if (!local_group && smt_balance(env, sgs, group))
> > +		sgs->group_smt_balance = 1;
> > +
> >  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
> > 
> >  	/* Computing avg_load makes sense only when group is overloaded */
> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  			return false;
> >  		break;
> > 
> > +	case group_smt_balance:
> >  	case group_fully_busy:
> >  		/*
> >  		 * Select the fully busy group with highest avg_load. In
> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		break;
> > 
> >  	case group_has_spare:
> > +		/*
> > +		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
> > +		 * as we do not want to pull task off SMT core with one task
> > +		 * and make the core idle.
> > +		 */
> > +		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> > +			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> > +				return false;
> > +			else
> > +				return true;> +		}
> > +
> >  		/*
> >  		 * Select not overloaded group with lowest number of idle cpus
> >  		 * and highest number of running tasks. We could also compare
> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > 
> >  	case group_imbalanced:
> >  	case group_asym_packing:
> > +	case group_smt_balance:
> >  		/* Those types are not used in the slow wakeup path */
> >  		return false;
> > 
> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > 
> >  	case group_imbalanced:
> >  	case group_asym_packing:
> > +	case group_smt_balance:
> >  		/* Those type are not used in the slow wakeup path */
> >  		return NULL;
> > 
> > @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		return;
> >  	}
> > 
> > +	if (busiest->group_type == group_smt_balance) {
> > +		/* Reduce number of tasks sharing CPU capacity */
> > +		env->migration_type = migrate_task;
> > +		env->imbalance = 1;
> > +		return;
> > +	}
> > +
> >  	if (busiest->group_type == group_imbalanced) {
> >  		/*
> >  		 * In the group_imb case we cannot rely on group-wide averages
> > @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >  		goto force_balance;
> > 
> >  	if (busiest->group_type != group_overloaded) {
> > -		if (env->idle == CPU_NOT_IDLE)
> > +		if (env->idle == CPU_NOT_IDLE) {
> >  			/*
> >  			 * If the busiest group is not overloaded (and as a
> >  			 * result the local one too) but this CPU is already
> >  			 * busy, let another idle CPU try to pull task.
> >  			 */
> >  			goto out_balanced;
> > +		}
> > +
> > +		if (busiest->group_type == group_smt_balance &&
> > +		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> > +			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
> > +			goto force_balance;
> > +		}
> > 
> >  		if (busiest->group_weight > 1 &&
> > -		    local->idle_cpus <= (busiest->idle_cpus + 1))
> > +		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
> >  			/*
> >  			 * If the busiest group is not overloaded
> >  			 * and there is no imbalance between this and busiest
> > @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >  			 * there is more than 1 CPU per group.
> >  			 */
> >  			goto out_balanced;
> > +		}
> > 
> > -		if (busiest->sum_h_nr_running == 1)
> > +		if (busiest->sum_h_nr_running == 1) {
> >  			/*
> >  			 * busiest doesn't have any tasks waiting to run
> >  			 */
> >  			goto out_balanced;
> > +		}
> >  	}
> > 
> >  force_balance:


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-14 20:44     ` Tim Chen
@ 2023-07-14 23:23       ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-14 23:23 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Tobias Huschle

On Fri, 2023-07-14 at 13:44 -0700, Tim Chen wrote:
> > > 
> > > +static inline long sibling_imbalance(struct lb_env *env,
> > > +				    struct sd_lb_stats *sds,
> > > +				    struct sg_lb_stats *busiest,
> > > +				    struct sg_lb_stats *local)
> > > +{
> > > +	int ncores_busiest, ncores_local;
> > > +	long imbalance;
> > 
> > can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int.
> 
> It could be made unsigned long. 
> 
> 
Though in theory the imbalance can be both positive or negative. We are
considering only positive imbalance here as we only pull task to local group
and do not push task from local group.

Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-14 14:53   ` Tobias Huschle
@ 2023-07-14 23:29     ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-14 23:29 UTC (permalink / raw)
  To: Tobias Huschle
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Shrikanth Hegde,
	Srikar Dronamraju, naveen.n.rao, Yicong Yang, Barry Song,
	Chen Yu, Hillf Danton


> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 87317634fab2..f636d6c09dc6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8279,6 +8279,11 @@ enum group_type {
> >  	 * more powerful CPU.
> >  	 */
> >  	group_misfit_task,
> > +	/*
> > +	 * Balance SMT group that's fully busy. Can benefit from migration
> > +	 * a task on SMT with busy sibling to another CPU on idle core.
> > +	 */
> > +	group_smt_balance,
> 
> Would it make sense to move smt_balance?, s.t. we get:
> 
> group_has_spare < group_smt_balance < group_fully_busy
> 
> Conceptually I would be more intuitive to me like this as the
> smt_balance groups are more busy than has_spare ones, but less busy
> then fully_busy ones.
> 
>  From a functional perspective I could also see some impact when
> update_sd_pick_busiest compares the group types. In that case we
> would remove tasks from fully busy groups before moving them
> from smt_balance groups. Not sure which way would be to prefer
> to increase overall throughput.
> 
> Since smt_balance is only selected if the group has SMT, this
> should still not pull the last task off of a non-SMT CPU.
> 
> 

I think you have similar concerns as Shrikanth on this patch.
Can you see if my fix to update_sd_pick_busiest() in my reply
to Shrikanth addresses what you have in mind.

Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-14 14:22     ` Tobias Huschle
@ 2023-07-14 23:35       ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-14 23:35 UTC (permalink / raw)
  To: Tobias Huschle, Shrikanth Hegde
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Fri, 2023-07-14 at 16:22 +0200, Tobias Huschle wrote:
> 
> > 
> > Could this work for case where number of CPU/cores would differ
> > between two sched groups in a sched domain? Such as problem pointed
> > by tobias on S390. It would be nice if this patch can work for that 
> > case
> > as well. Ran numbers for a few cases. It looks to work.
> > https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901
> > 
> 
> 
> Just stumbled upon this patch series as well. In this version it looks
> similar to the prototypes I played around with, but more complete.
> So I'm happy that my understanding of the load balancer was kinda 
> correct :)
> 
>  From a functional perspective this appears to address the issues we saw 
> on s390.

Glad that this patch addresses this common issue across architectures.
I did aim to address the asymmetric groups balancing in general.
Peter pointed out this problem that's inherent when he reviewed the first
version of my patchset. 

Tim

> 
> > 
> > 
> > > +	/* Take advantage of resource in an empty sched group */
> > > +	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > +	    busiest->sum_nr_running > 1)
> > > +		imbalance = 2;
> > > +
> > 
> > I don't see how this case would be true. When there are unequal number
> > of cores and local->sum_nr_ruuning
> > is 0, and busiest->sum_nr_running is atleast 2, imbalance will be 
> > atleast 1.
> > 
> > 
> > Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> > 
> > > +	return imbalance;
> > > +}
> > > +
> > >  static inline bool
> > >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> > >  {
> > > @@ -10230,14 +10265,12 @@ static inline void 
> > > calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		}
> > > 
> > >  		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > > -			unsigned int nr_diff = busiest->sum_nr_running;
> > >  			/*
> > >  			 * When prefer sibling, evenly spread running tasks on
> > >  			 * groups.
> > >  			 */
> > >  			env->migration_type = migrate_task;
> > > -			lsub_positive(&nr_diff, local->sum_nr_running);
> > > -			env->imbalance = nr_diff;
> > > +			env->imbalance = sibling_imbalance(env, sds, busiest, local);
> > >  		} else {
> > > 
> > >  			/*
> > > @@ -10424,7 +10457,7 @@ static struct sched_group 
> > > *find_busiest_group(struct lb_env *env)
> > >  	 * group's child domain.
> > >  	 */
> > >  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > > -	    busiest->sum_nr_running > local->sum_nr_running + 1)
> > > +	    sibling_imbalance(env, &sds, busiest, local) > 1)
> > >  		goto force_balance;
> > > 
> > >  	if (busiest->group_type != group_overloaded) {


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
  2023-07-14 13:14   ` Shrikanth Hegde
  2023-07-14 14:22     ` Tobias Huschle
  2023-07-14 20:44     ` Tim Chen
@ 2023-07-15  0:11     ` Tim Chen
  2 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-15  0:11 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Tobias Huschle

On Fri, 2023-07-14 at 18:44 +0530, Shrikanth Hegde wrote:
> 
> > +	/* Take advantage of resource in an empty sched group */
> > +	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > +	    busiest->sum_nr_running > 1)
> > +		imbalance = 2;
> > +
> 
> I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning 
> is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. 

I think you are correct.  With at least 2 task in the busiest group,
imbalance will be at least 1.  This is the effect of doing rounding
when adding the (ncores_local + ncores_busy) rounding factor.  

Returning an imbalance value of 1 will not be correct as we
will be dividing imbalance by 2 and we will still not move task
to the empty group as intended. 

So this code should be updated as below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3fc8d3a3bd22..16bf75e6a775 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9400,7 +9400,7 @@ static inline long sibling_imbalance(struct lb_env *env,
        imbalance /= ncores_local + ncores_busiest;
 
        /* Take advantage of resource in an empty sched group */
-       if (imbalance == 0 && local->sum_nr_running == 0 &&
+       if (imbalance <= 1 && local->sum_nr_running == 0 &&
            busiest->sum_nr_running > 1)
                imbalance = 2;
 

Tim



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-14 23:05     ` Tim Chen
@ 2023-07-15 18:25       ` Tim Chen
  2023-07-16 19:36       ` Shrikanth Hegde
  2023-07-18  6:07       ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
  2 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-15 18:25 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Peter Zijlstra


> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 294a662c9410..3fc8d3a3bd22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                 break;
>  
>         case group_smt_balance:
> +               /* no idle cpus on both groups handled by group_fully_busy below */
> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +                       if (sgs->idle_cpus > busiest->idle_cpus)
> +                               return false;
> +                       if (sgs->idle_cpus < busiest->idle_cpus)
> +                               return true;
> +                       if (sgs->sum_nr_running <= busiest_sum_nr_running)

typo: should be busiest->sum->nr_running

> +                               return false;
> +                       else
> +                               return true;
> +               }
> 

Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-14 23:05     ` Tim Chen
  2023-07-15 18:25       ` Tim Chen
@ 2023-07-16 19:36       ` Shrikanth Hegde
  2023-07-17 11:10         ` Peter Zijlstra
  2023-07-18  6:07       ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
  2 siblings, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-16 19:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Peter Zijlstra



On 7/15/23 4:35 AM, Tim Chen wrote:
> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
> 
>>
>>
>> If we consider symmetric platforms which have SMT4 such as power10. 
>> we have a topology like below. multiple such MC will form DIE(PKG)
>>
>>
>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>> [--SMT--][--SMT--][----SMT---][---SMT----]
>> [--sg1--][--sg1--][---sg1----][---sg1----]
>> [--------------MC------------------------]
>>
>> In case of SMT4, if there is any group which has 2 or more tasks, that 
>> group will be marked as group_smt_balance. previously, if that group had 2
>> or 3 tasks, it would have been marked as group_has_spare. Since all the groups have 
>> SMT that means behavior would be same fully busy right? That can cause some 
>> corner cases. No?
> 
> You raised a good point. I was looking from SMT2
> perspective so group_smt_balance implies group_fully_busy.
> That is no longer true for SMT4.
> 
> I am thinking of the following fix on the current patch
> to take care of SMT4. Do you think this addresses

Thanks Tim for taking a look at it again. 

Yes. I think this would address some of the corner cases. 
Any SMT4 group having 2,3,4 will have smt_balance as the group type, and busiest one 
is the one which has least number of idle cpu's. (same conditions as group_has_spare)




> concerns from you and Tobias?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 294a662c9410..3fc8d3a3bd22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                 break;
>  
>         case group_smt_balance:
> +               /* no idle cpus on both groups handled by group_fully_busy below */
> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +                       if (sgs->idle_cpus > busiest->idle_cpus)
> +                               return false;
> +                       if (sgs->idle_cpus < busiest->idle_cpus)
> +                               return true;
> +                       if (sgs->sum_nr_running <= busiest_sum_nr_running)
> +                               return false;
> +                       else
> +                               return true;
> +               }
> 
> 
> I will be on vacation next three weeks so my response will be slow.
> 
> Tim
> 
>>

Small suggestion to above code to avoid compiler warning of switch case falling
through and else case can be removed, since update_sd_pick_busiest by default returns true.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5a75c76bcaa..ae364ac6f22e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9728,9 +9728,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
                                return true;
                        if (sgs->sum_nr_running <= busiest->sum_nr_running)
                                return false;
-                       else
-                               return true;
                }
+               break;
+
        case group_fully_busy:
                /*
                 * Select the fully busy group with highest avg_load. In



>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is trying to do 
>> load balance. Previously imbalance would have been 2, instead now imbalance would be 1.
>> But in subsequent lb it would be balanced.
>>
>>
>>
>>> +	return false;
>>> +}
>>> +
>>>  static inline bool
>>>  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>>  {
>>> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>  		sgs->group_asym_packing = 1;
>>>  	}
>>>
>>> +	/* Check for loaded SMT group to be balanced to dst CPU */
>>> +	if (!local_group && smt_balance(env, sgs, group))
>>> +		sgs->group_smt_balance = 1;
>>> +
>>>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>>>
>>>  	/* Computing avg_load makes sense only when group is overloaded */
>>> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  			return false;
>>>  		break;
>>>
>>> +	case group_smt_balance:
>>>  	case group_fully_busy:
>>>  		/*
>>>  		 * Select the fully busy group with highest avg_load. In
>>> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  		break;
>>>
>>>  	case group_has_spare:
>>> +		/*
>>> +		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
>>> +		 * as we do not want to pull task off SMT core with one task
>>> +		 * and make the core idle.
>>> +		 */
>>> +		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>>> +			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
>>> +				return false;
>>> +			else
>>> +				return true;> +		}
>>> +
>>>  		/*
>>>  		 * Select not overloaded group with lowest number of idle cpus
>>>  		 * and highest number of running tasks. We could also compare
>>> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
>>>
>>>  	case group_imbalanced:
>>>  	case group_asym_packing:
>>> +	case group_smt_balance:
>>>  		/* Those types are not used in the slow wakeup path */
>>>  		return false;
>>>
>>> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>
>>>  	case group_imbalanced:
>>>  	case group_asym_packing:
>>> +	case group_smt_balance:
>>>  		/* Those type are not used in the slow wakeup path */
>>>  		return NULL;
>>>
>>> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>  		return;
>>>  	}
>>>
>>> +	if (busiest->group_type == group_smt_balance) {
>>> +		/* Reduce number of tasks sharing CPU capacity */
>>> +		env->migration_type = migrate_task;
>>> +		env->imbalance = 1;
>>> +		return;
>>> +	}
>>> +
>>>  	if (busiest->group_type == group_imbalanced) {
>>>  		/*
>>>  		 * In the group_imb case we cannot rely on group-wide averages
>>> @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>>  		goto force_balance;
>>>
>>>  	if (busiest->group_type != group_overloaded) {
>>> -		if (env->idle == CPU_NOT_IDLE)
>>> +		if (env->idle == CPU_NOT_IDLE) {
>>>  			/*
>>>  			 * If the busiest group is not overloaded (and as a
>>>  			 * result the local one too) but this CPU is already
>>>  			 * busy, let another idle CPU try to pull task.
>>>  			 */
>>>  			goto out_balanced;
>>> +		}
>>> +
>>> +		if (busiest->group_type == group_smt_balance &&
>>> +		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>>> +			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
>>> +			goto force_balance;
>>> +		}
>>>
>>>  		if (busiest->group_weight > 1 &&
>>> -		    local->idle_cpus <= (busiest->idle_cpus + 1))
>>> +		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
>>>  			/*
>>>  			 * If the busiest group is not overloaded
>>>  			 * and there is no imbalance between this and busiest
>>> @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>>  			 * there is more than 1 CPU per group.
>>>  			 */
>>>  			goto out_balanced;
>>> +		}
>>>
>>> -		if (busiest->sum_h_nr_running == 1)
>>> +		if (busiest->sum_h_nr_running == 1) {
>>>  			/*
>>>  			 * busiest doesn't have any tasks waiting to run
>>>  			 */
>>>  			goto out_balanced;
>>> +		}
>>>  	}
>>>
>>>  force_balance:
> 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-16 19:36       ` Shrikanth Hegde
@ 2023-07-17 11:10         ` Peter Zijlstra
  2023-07-17 12:18           ` Shrikanth Hegde
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-07-17 11:10 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Mon, Jul 17, 2023 at 01:06:59AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 7/15/23 4:35 AM, Tim Chen wrote:
> > On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
> > 
> >>
> >>
> >> If we consider symmetric platforms which have SMT4 such as power10. 
> >> we have a topology like below. multiple such MC will form DIE(PKG)
> >>
> >>
> >> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
> >> [--SMT--][--SMT--][----SMT---][---SMT----]
> >> [--sg1--][--sg1--][---sg1----][---sg1----]
> >> [--------------MC------------------------]
> >>
> >> In case of SMT4, if there is any group which has 2 or more tasks, that 
> >> group will be marked as group_smt_balance. previously, if that group had 2
> >> or 3 tasks, it would have been marked as group_has_spare. Since all the groups have 
> >> SMT that means behavior would be same fully busy right? That can cause some 
> >> corner cases. No?
> > 
> > You raised a good point. I was looking from SMT2
> > perspective so group_smt_balance implies group_fully_busy.
> > That is no longer true for SMT4.
> > 
> > I am thinking of the following fix on the current patch
> > to take care of SMT4. Do you think this addresses
> 
> Thanks Tim for taking a look at it again. 
> 
> Yes. I think this would address some of the corner cases. 
> Any SMT4 group having 2,3,4 will have smt_balance as the group type, and busiest one 
> is the one which has least number of idle cpu's. (same conditions as group_has_spare)
> 
> 
> 
> 
> > concerns from you and Tobias?
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 294a662c9410..3fc8d3a3bd22 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >                 break;
> >  
> >         case group_smt_balance:
> > +               /* no idle cpus on both groups handled by group_fully_busy below */
> > +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > +                       if (sgs->idle_cpus > busiest->idle_cpus)
> > +                               return false;
> > +                       if (sgs->idle_cpus < busiest->idle_cpus)
> > +                               return true;
> > +                       if (sgs->sum_nr_running <= busiest_sum_nr_running)
> > +                               return false;
> > +                       else
> > +                               return true;
> > +               }
> > 
> > 
> > I will be on vacation next three weeks so my response will be slow.
> > 
> > Tim
> > 
> >>
> 
> Small suggestion to above code to avoid compiler warning of switch case falling
> through and else case can be removed, since update_sd_pick_busiest by default returns true.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e5a75c76bcaa..ae364ac6f22e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9728,9 +9728,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                                 return true;
>                         if (sgs->sum_nr_running <= busiest->sum_nr_running)
>                                 return false;
> -                       else
> -                               return true;
>                 }
> +               break;
> +
>         case group_fully_busy:
>                 /*
>                  * Select the fully busy group with highest avg_load. In
> 
> 

Can someone please send a full patch for this? I've already queued Tim's
patches in tip/sched/core (tip-bot seems to have died somewhere last
week, it's being worked on).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-17 11:10         ` Peter Zijlstra
@ 2023-07-17 12:18           ` Shrikanth Hegde
  2023-07-17 13:37             ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-17 12:18 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton



On 7/17/23 4:40 PM, Peter Zijlstra wrote:
> On Mon, Jul 17, 2023 at 01:06:59AM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 7/15/23 4:35 AM, Tim Chen wrote:
>>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>>
>>>>
>>>>
>>>> If we consider symmetric platforms which have SMT4 such as power10. 
>>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>>
>>>>
>>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>>> [--------------MC------------------------]
>>>>
>>>> In case of SMT4, if there is any group which has 2 or more tasks, that 
>>>> group will be marked as group_smt_balance. previously, if that group had 2
>>>> or 3 tasks, it would have been marked as group_has_spare. Since all the groups have 
>>>> SMT that means behavior would be same fully busy right? That can cause some 
>>>> corner cases. No?
>>>
>>> You raised a good point. I was looking from SMT2
>>> perspective so group_smt_balance implies group_fully_busy.
>>> That is no longer true for SMT4.
>>>
>>> I am thinking of the following fix on the current patch
>>> to take care of SMT4. Do you think this addresses
>>
>> Thanks Tim for taking a look at it again. 
>>
>> Yes. I think this would address some of the corner cases. 
>> Any SMT4 group having 2,3,4 will have smt_balance as the group type, and busiest one 
>> is the one which has least number of idle cpu's. (same conditions as group_has_spare)
>>
>>
>>
>>
>>> concerns from you and Tobias?
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 294a662c9410..3fc8d3a3bd22 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>                 break;
>>>  
>>>         case group_smt_balance:
>>> +               /* no idle cpus on both groups handled by group_fully_busy below */
>>> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> +                       if (sgs->idle_cpus > busiest->idle_cpus)
>>> +                               return false;
>>> +                       if (sgs->idle_cpus < busiest->idle_cpus)
>>> +                               return true;
>>> +                       if (sgs->sum_nr_running <= busiest_sum_nr_running)
>>> +                               return false;
>>> +                       else
>>> +                               return true;
>>> +               }
>>>
>>>
>>> I will be on vacation next three weeks so my response will be slow.
>>>
>>> Tim
>>>
>>>>
>>
>> Small suggestion to above code to avoid compiler warning of switch case falling
>> through and else case can be removed, since update_sd_pick_busiest by default returns true.
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e5a75c76bcaa..ae364ac6f22e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9728,9 +9728,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                                 return true;
>>                         if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>                                 return false;
>> -                       else
>> -                               return true;
>>                 }
>> +               break;
>> +
>>         case group_fully_busy:
>>                 /*
>>                  * Select the fully busy group with highest avg_load. In
>>
>>
> 
> Can someone please send a full patch for this? I've already queued Tim's
> patches in tip/sched/core (tip-bot seems to have died somewhere last
> week, it's being worked on).

Hi Peter. 

Sending on behalf of tim. I have included my suggestion as well. Hope that's ok.
Please find below the patch as of now. it includes the couple of changes that are discussed. (in 1/6 and in 3/6)


---
 kernel/sched/fair.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 932e7b78894a..9502013abe33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/* no idle cpus on both groups handled by group_fully_busy below */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+			if (sgs->idle_cpus > busiest->idle_cpus)
+				return false;
+			if (sgs->idle_cpus < busiest->idle_cpus)
+				return true;
+			if (sgs->sum_nr_running <= busiest->sum_nr_running)
+				return false;
+		}
+		break;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-17 12:18           ` Shrikanth Hegde
@ 2023-07-17 13:37             ` Peter Zijlstra
  2023-07-17 14:58               ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-07-17 13:37 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton

On Mon, Jul 17, 2023 at 05:48:02PM +0530, Shrikanth Hegde wrote:

> Hi Peter. 
> 
> Sending on behalf of tim. I have included my suggestion as well. Hope
> that's ok.  Please find below the patch as of now. it includes the
> couple of changes that are discussed. (in 1/6 and in 3/6)

Could you please add a Changelog and SoB thingies such that I can apply
the thing?

Given Tim is on holidays, perhaps do something like:

Originally-by: Tim Chen <...>

After all, you did some changes and verified it actually works etc..


> ---
>  kernel/sched/fair.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 932e7b78894a..9502013abe33 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>  	imbalance /= ncores_local + ncores_busiest;
>  
>  	/* Take advantage of resource in an empty sched group */
> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>  	    busiest->sum_nr_running > 1)
>  		imbalance = 2;
>  
> @@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
>  
>  	case group_smt_balance:
> +		/* no idle cpus on both groups handled by group_fully_busy below */
> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +			if (sgs->idle_cpus > busiest->idle_cpus)
> +				return false;
> +			if (sgs->idle_cpus < busiest->idle_cpus)
> +				return true;
> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> +				return false;
> +		}
> +		break;
> +
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-07-17 13:37             ` Peter Zijlstra
@ 2023-07-17 14:58               ` Shrikanth Hegde
  2023-07-27  3:11                 ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-17 14:58 UTC (permalink / raw)
  To: peterz, tim.c.chen
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, sshegde, v-songbaohua, vincent.guittot,
	vschneid, x86, yangyicong, yu.c.chen

From: Tim Chen <tim.c.chen@linux.intel.com>

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy)
when the local is fully idle and busy group has more than 2 tasks.
Local group should try to pull at least 1 task in this case.

Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 932e7b78894a..9502013abe33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;

 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;

@@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;

 	case group_smt_balance:
+		/* no idle cpus on both groups handled by group_fully_busy below */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+			if (sgs->idle_cpus > busiest->idle_cpus)
+				return false;
+			if (sgs->idle_cpus < busiest->idle_cpus)
+				return true;
+			if (sgs->sum_nr_running <= busiest->sum_nr_running)
+				return false;
+		}
+		break;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
--
2.31.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-14 23:05     ` Tim Chen
  2023-07-15 18:25       ` Tim Chen
  2023-07-16 19:36       ` Shrikanth Hegde
@ 2023-07-18  6:07       ` Tobias Huschle
  2023-07-18 14:52         ` Shrikanth Hegde
  2 siblings, 1 reply; 55+ messages in thread
From: Tobias Huschle @ 2023-07-18  6:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shrikanth Hegde, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton,
	Peter Zijlstra

On 2023-07-15 01:05, Tim Chen wrote:
> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
> 
>> 
>> 
>> If we consider symmetric platforms which have SMT4 such as power10.
>> we have a topology like below. multiple such MC will form DIE(PKG)
>> 
>> 
>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>> [--SMT--][--SMT--][----SMT---][---SMT----]
>> [--sg1--][--sg1--][---sg1----][---sg1----]
>> [--------------MC------------------------]
>> 
>> In case of SMT4, if there is any group which has 2 or more tasks, that
>> group will be marked as group_smt_balance. previously, if that group 
>> had 2
>> or 3 tasks, it would have been marked as group_has_spare. Since all 
>> the groups have
>> SMT that means behavior would be same fully busy right? That can cause 
>> some
>> corner cases. No?
> 
> You raised a good point. I was looking from SMT2
> perspective so group_smt_balance implies group_fully_busy.
> That is no longer true for SMT4.
> 
> I am thinking of the following fix on the current patch
> to take care of SMT4. Do you think this addresses
> concerns from you and Tobias?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 294a662c9410..3fc8d3a3bd22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env 
> *env,
>                 break;
> 
>         case group_smt_balance:
> +               /* no idle cpus on both groups handled by
> group_fully_busy below */
> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +                       if (sgs->idle_cpus > busiest->idle_cpus)
> +                               return false;
> +                       if (sgs->idle_cpus < busiest->idle_cpus)
> +                               return true;
> +                       if (sgs->sum_nr_running <= 
> busiest_sum_nr_running)
> +                               return false;
> +                       else
> +                               return true;
> +               }
> 
> 
> I will be on vacation next three weeks so my response will be slow.
> 
> Tim
> 

What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.

[0 1][2 3 4 5]
[SMT][--SMT--]

If now CPUs 0,2,3 have a running task, both groups would be classified 
as
smt_balance. But if it comes to the selection of the busiest group, the 
smaller
group would be selected, as it has less idle CPUs, right? Which could 
lead
to the smaller group being left with no tasks.
Using the absolute numbers of task is what made the prefer_sibling path 
problematic,
I would assume that the same holds true here. Therefore, I would prefer 
avg_load,
or, similar to prefer_siblings, a ratio over the number of cores.

I can't really test that on s390 as we always have SMT2. But, we can 
have these
asymmetries on higher levels, e.g.

[0 1][2 3][4 5][6 7][8 9]
[SMT][SMT][SMT][SMT][SMT]
[-----core----][--core--]

For large configurations this can be true for even higher levels.
Therefore, the idea was to move the smt_balance state around and adapt 
its
conditions to something like this (which would require to reorder the 
commits):

@@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
  enum group_type {
         /* The group has spare capacity that can be used to run more 
tasks.  */
         group_has_spare = 0,
+       /*
+        * Balance SMT group that's fully busy. Can benefit from 
migration
+        * a task on SMT with busy sibling to another CPU on idle core.
+        */
+       group_smt_balance,
         /*
          * The group is fully used and the tasks don't compete for more 
CPU
          * cycles. Nevertheless, some tasks might wait before running.
@@ -8340,11 +8345,6 @@ enum group_type {
          * more powerful CPU.
          */
         group_misfit_task,
-       /*
-        * Balance SMT group that's fully busy. Can benefit from 
migration
-        * a task on SMT with busy sibling to another CPU on idle core.
-        */
-       group_smt_balance,
         /*
          * SD_ASYM_PACKING only: One local CPU with higher capacity is 
available,
          * and the task should be migrated to it instead of running on 
the
@@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int 
imbalance_pct,
         if (sgs->group_asym_packing)
                 return group_asym_packing;

-       if (sgs->group_smt_balance)
-               return group_smt_balance;
-
         if (sgs->group_misfit_task_load)
                 return group_misfit_task;

         if (!group_has_capacity(imbalance_pct, sgs))
                 return group_fully_busy;

+       if (sgs->group_smt_balance)
+               return group_smt_balance;
+
         return group_has_spare;
  }

@@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env *env, 
struct sg_lb_stats *sgs,
          * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
          * will not be on.
          */
-       if (group->flags & SD_SHARE_CPUCAPACITY &&
-           sgs->sum_h_nr_running > 1)
+       if (sgs->sum_h_nr_running > group->cores)
                 return true;

         return false;

The s390 problem is currently solved by changing the prefer_sibling 
path. When
disabling that flag, we might have an issue, will have to verify that 
though.

>> 
>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is 
>> trying to do
>> load balance. Previously imbalance would have been 2, instead now 
>> imbalance would be 1.
>> But in subsequent lb it would be balanced.
>> 
>> 
>> 
>> > +	return false;
>> > +}
>> > +
>> >  static inline bool
>> >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>> >  {
>> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> >  		sgs->group_asym_packing = 1;
>> >  	}
>> >
>> > +	/* Check for loaded SMT group to be balanced to dst CPU */
>> > +	if (!local_group && smt_balance(env, sgs, group))
>> > +		sgs->group_smt_balance = 1;
>> > +
>> >  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>> >
>> >  	/* Computing avg_load makes sense only when group is overloaded */
>> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >  			return false;
>> >  		break;
>> >
>> > +	case group_smt_balance:
>> >  	case group_fully_busy:
>> >  		/*
>> >  		 * Select the fully busy group with highest avg_load. In
>> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >  		break;
>> >
>> >  	case group_has_spare:
>> > +		/*
>> > +		 * Do not pick sg with SMT CPUs over sg with pure CPUs,
>> > +		 * as we do not want to pull task off SMT core with one task
>> > +		 * and make the core idle.
>> > +		 */
>> > +		if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>> > +			if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
>> > +				return false;
>> > +			else
>> > +				return true;> +		}
>> > +
>> >  		/*
>> >  		 * Select not overloaded group with lowest number of idle cpus
>> >  		 * and highest number of running tasks. We could also compare
>> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
>> >
>> >  	case group_imbalanced:
>> >  	case group_asym_packing:
>> > +	case group_smt_balance:
>> >  		/* Those types are not used in the slow wakeup path */
>> >  		return false;
>> >
>> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> >
>> >  	case group_imbalanced:
>> >  	case group_asym_packing:
>> > +	case group_smt_balance:
>> >  		/* Those type are not used in the slow wakeup path */
>> >  		return NULL;
>> >
>> > @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >  		return;
>> >  	}
>> >
>> > +	if (busiest->group_type == group_smt_balance) {
>> > +		/* Reduce number of tasks sharing CPU capacity */
>> > +		env->migration_type = migrate_task;
>> > +		env->imbalance = 1;
>> > +		return;
>> > +	}
>> > +
>> >  	if (busiest->group_type == group_imbalanced) {
>> >  		/*
>> >  		 * In the group_imb case we cannot rely on group-wide averages
>> > @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> >  		goto force_balance;
>> >
>> >  	if (busiest->group_type != group_overloaded) {
>> > -		if (env->idle == CPU_NOT_IDLE)
>> > +		if (env->idle == CPU_NOT_IDLE) {
>> >  			/*
>> >  			 * If the busiest group is not overloaded (and as a
>> >  			 * result the local one too) but this CPU is already
>> >  			 * busy, let another idle CPU try to pull task.
>> >  			 */
>> >  			goto out_balanced;
>> > +		}
>> > +
>> > +		if (busiest->group_type == group_smt_balance &&
>> > +		    smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>> > +			/* Let non SMT CPU pull from SMT CPU sharing with sibling */
>> > +			goto force_balance;
>> > +		}
>> >
>> >  		if (busiest->group_weight > 1 &&
>> > -		    local->idle_cpus <= (busiest->idle_cpus + 1))
>> > +		    local->idle_cpus <= (busiest->idle_cpus + 1)) {
>> >  			/*
>> >  			 * If the busiest group is not overloaded
>> >  			 * and there is no imbalance between this and busiest
>> > @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> >  			 * there is more than 1 CPU per group.
>> >  			 */
>> >  			goto out_balanced;
>> > +		}
>> >
>> > -		if (busiest->sum_h_nr_running == 1)
>> > +		if (busiest->sum_h_nr_running == 1) {
>> >  			/*
>> >  			 * busiest doesn't have any tasks waiting to run
>> >  			 */
>> >  			goto out_balanced;
>> > +		}
>> >  	}
>> >
>> >  force_balance:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-18  6:07       ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
@ 2023-07-18 14:52         ` Shrikanth Hegde
  2023-07-19  8:14           ` Tobias Huschle
  0 siblings, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-07-18 14:52 UTC (permalink / raw)
  To: Tobias Huschle, Tim Chen
  Cc: Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V . Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J . Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Ionela Voinescu, x86,
	linux-kernel, Srikar Dronamraju, naveen.n.rao, Yicong Yang,
	Barry Song, Chen Yu, Hillf Danton, Peter Zijlstra



On 7/18/23 11:37 AM, Tobias Huschle wrote:
> On 2023-07-15 01:05, Tim Chen wrote:
>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>
>>>
>>>
>>> If we consider symmetric platforms which have SMT4 such as power10.
>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>
>>>
>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>> [--------------MC------------------------]
>>>
>>> In case of SMT4, if there is any group which has 2 or more tasks, that
>>> group will be marked as group_smt_balance. previously, if that group
>>> had 2
>>> or 3 tasks, it would have been marked as group_has_spare. Since all
>>> the groups have
>>> SMT that means behavior would be same fully busy right? That can
>>> cause some
>>> corner cases. No?
>>
>> You raised a good point. I was looking from SMT2
>> perspective so group_smt_balance implies group_fully_busy.
>> That is no longer true for SMT4.
>>
>> I am thinking of the following fix on the current patch
>> to take care of SMT4. Do you think this addresses
>> concerns from you and Tobias?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 294a662c9410..3fc8d3a3bd22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct
>> lb_env *env,
>>                 break;
>>
>>         case group_smt_balance:
>> +               /* no idle cpus on both groups handled by
>> group_fully_busy below */
>> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>> +                       if (sgs->idle_cpus > busiest->idle_cpus)
>> +                               return false;
>> +                       if (sgs->idle_cpus < busiest->idle_cpus)
>> +                               return true;
>> +                       if (sgs->sum_nr_running <=
>> busiest_sum_nr_running)
>> +                               return false;
>> +                       else
>> +                               return true;
>> +               }
>>
>>
>> I will be on vacation next three weeks so my response will be slow.
>>
>> Tim
>>
> 
> What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.
> 
> [0 1][2 3 4 5]
> [SMT][--SMT--]
> 
> If now CPUs 0,2,3 have a running task, both groups would be classified as
> smt_balance. But if it comes to the selection of the busiest group, the
> smaller
> group would be selected, as it has less idle CPUs, right? Which could lead
> to the smaller group being left with no tasks.
> Using the absolute numbers of task is what made the prefer_sibling path
> problematic,


Yes. But Not sure how realistic is that configuration. on power10, we typically
have all cores in either SMT1, SMT2 or SMT4. But not mixed configs.
One can offline a CPUs to get into that cases in SMT4. 

> I would assume that the same holds true here. Therefore, I would prefer
> avg_load,
> or, similar to prefer_siblings, a ratio over the number of cores.
> 
> I can't really test that on s390 as we always have SMT2. But, we can
> have these
> asymmetries on higher levels, e.g


IIUC, on higher levels, group will not have SD_SHARE_CPUCAPACITY, so it shouldn't 
run into group_smt_balance. 

> 
> [0 1][2 3][4 5][6 7][8 9]
> [SMT][SMT][SMT][SMT][SMT]
> [-----core----][--core--]
> 
> For large configurations this can be true for even higher levels.
> Therefore, the idea was to move the smt_balance state around and adapt its
> conditions to something like this (which would require to reorder the
> commits):
> 
> @@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
>  enum group_type {
>         /* The group has spare capacity that can be used to run more
> tasks.  */
>         group_has_spare = 0,
> +       /*
> +        * Balance SMT group that's fully busy. Can benefit from migration
> +        * a task on SMT with busy sibling to another CPU on idle core.
> +        */
> +       group_smt_balance,
>         /*
>          * The group is fully used and the tasks don't compete for more CPU
>          * cycles. Nevertheless, some tasks might wait before running.
> @@ -8340,11 +8345,6 @@ enum group_type {
>          * more powerful CPU.
>          */
>         group_misfit_task,
> -       /*
> -        * Balance SMT group that's fully busy. Can benefit from migration
> -        * a task on SMT with busy sibling to another CPU on idle core.
> -        */
> -       group_smt_balance,
>         /*
>          * SD_ASYM_PACKING only: One local CPU with higher capacity is
> available,


IIUC, for cluster topology of this patch, busiest group should be a SMT if it has 2 
threads compared to an Atom cluster having 4 threads. Atom cluster will be group_fully_busy, 
whereas SMT group will be group_smt_balance. For that to happen group_smt_balance should have 
higher group_type. 

>          * and the task should be migrated to it instead of running on the
> @@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int
> imbalance_pct,
>         if (sgs->group_asym_packing)
>                 return group_asym_packing;
> 
> -       if (sgs->group_smt_balance)
> -               return group_smt_balance;
> -
>         if (sgs->group_misfit_task_load)
>                 return group_misfit_task;
> 
>         if (!group_has_capacity(imbalance_pct, sgs))
>                 return group_fully_busy;
> 
> +       if (sgs->group_smt_balance)
> +               return group_smt_balance;
> +
>         return group_has_spare;
>  }
> 
> @@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env *env,
> struct sg_lb_stats *sgs,
>          * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
>          * will not be on.
>          */
> -       if (group->flags & SD_SHARE_CPUCAPACITY &&
> -           sgs->sum_h_nr_running > 1)
> +       if (sgs->sum_h_nr_running > group->cores)

In case of Power10, where we have SMT4, group->cores will be 1. I dont see 
a difference here.

>                 return true;
> 
>         return false;
> 
> The s390 problem is currently solved by changing the prefer_sibling
> path. When
> disabling that flag, we might have an issue, will have to verify that
> though.
> 
>>>
>>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is
>>> trying to do
>>> load balance. Previously imbalance would have been 2, instead now
>>> imbalance would be 1.
>>> But in subsequent lb it would be balanced.
>>>
>>>
>>>
>>> > +    return false;
>>> > +}
>>> > +
>>> >  static inline bool
>>> >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>> >  {
>>> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct
>>> lb_env *env,
>>> >          sgs->group_asym_packing = 1;
>>> >      }
>>> >
>>> > +    /* Check for loaded SMT group to be balanced to dst CPU */
>>> > +    if (!local_group && smt_balance(env, sgs, group))
>>> > +        sgs->group_smt_balance = 1;
>>> > +
>>> >      sgs->group_type = group_classify(env->sd->imbalance_pct,
>>> group, sgs);
>>> >
>>> >      /* Computing avg_load makes sense only when group is
>>> overloaded */
>>> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> >              return false;
>>> >          break;
>>> >
>>> > +    case group_smt_balance:
>>> >      case group_fully_busy:
>>> >          /*
>>> >           * Select the fully busy group with highest avg_load. In
>>> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> >          break;
>>> >
>>> >      case group_has_spare:
>>> > +        /*
>>> > +         * Do not pick sg with SMT CPUs over sg with pure CPUs,
>>> > +         * as we do not want to pull task off SMT core with one task
>>> > +         * and make the core idle.
>>> > +         */
>>> > +        if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>>> > +            if (sg->flags & SD_SHARE_CPUCAPACITY &&
>>> sgs->sum_h_nr_running <= 1)
>>> > +                return false;
>>> > +            else
>>> > +                return true;> +        }
>>> > +
>>> >          /*
>>> >           * Select not overloaded group with lowest number of idle
>>> cpus
>>> >           * and highest number of running tasks. We could also compare
>>> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct
>>> sched_group *idlest,
>>> >
>>> >      case group_imbalanced:
>>> >      case group_asym_packing:
>>> > +    case group_smt_balance:
>>> >          /* Those types are not used in the slow wakeup path */
>>> >          return false;
>>> >
>>> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
>>> struct task_struct *p, int this_cpu)
>>> >
>>> >      case group_imbalanced:
>>> >      case group_asym_packing:
>>> > +    case group_smt_balance:
>>> >          /* Those type are not used in the slow wakeup path */
>>> >          return NULL;
>>> >
>>> > @@ -10118,6 +10176,13 @@ static inline void
>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>> >          return;
>>> >      }
>>> >
>>> > +    if (busiest->group_type == group_smt_balance) {
>>> > +        /* Reduce number of tasks sharing CPU capacity */
>>> > +        env->migration_type = migrate_task;
>>> > +        env->imbalance = 1;
>>> > +        return;
>>> > +    }
>>> > +
>>> >      if (busiest->group_type == group_imbalanced) {
>>> >          /*
>>> >           * In the group_imb case we cannot rely on group-wide
>>> averages
>>> > @@ -10363,16 +10428,23 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> >          goto force_balance;
>>> >
>>> >      if (busiest->group_type != group_overloaded) {
>>> > -        if (env->idle == CPU_NOT_IDLE)
>>> > +        if (env->idle == CPU_NOT_IDLE) {
>>> >              /*
>>> >               * If the busiest group is not overloaded (and as a
>>> >               * result the local one too) but this CPU is already
>>> >               * busy, let another idle CPU try to pull task.
>>> >               */
>>> >              goto out_balanced;
>>> > +        }
>>> > +
>>> > +        if (busiest->group_type == group_smt_balance &&
>>> > +            smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>>> > +            /* Let non SMT CPU pull from SMT CPU sharing with
>>> sibling */
>>> > +            goto force_balance;
>>> > +        }
>>> >
>>> >          if (busiest->group_weight > 1 &&
>>> > -            local->idle_cpus <= (busiest->idle_cpus + 1))
>>> > +            local->idle_cpus <= (busiest->idle_cpus + 1)) {
>>> >              /*
>>> >               * If the busiest group is not overloaded
>>> >               * and there is no imbalance between this and busiest
>>> > @@ -10383,12 +10455,14 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> >               * there is more than 1 CPU per group.
>>> >               */
>>> >              goto out_balanced;
>>> > +        }
>>> >
>>> > -        if (busiest->sum_h_nr_running == 1)
>>> > +        if (busiest->sum_h_nr_running == 1) {
>>> >              /*
>>> >               * busiest doesn't have any tasks waiting to run
>>> >               */
>>> >              goto out_balanced;
>>> > +        }
>>> >      }
>>> >
>>> >  force_balance:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
  2023-07-18 14:52         ` Shrikanth Hegde
@ 2023-07-19  8:14           ` Tobias Huschle
  0 siblings, 0 replies; 55+ messages in thread
From: Tobias Huschle @ 2023-07-19  8:14 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, Juri Lelli, Vincent Guittot, Ricardo Neri,
	Ravi V . Shankar, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Len Brown, Mel Gorman, Rafael J . Wysocki,
	Srinivas Pandruvada, Steven Rostedt, Valentin Schneider,
	Ionela Voinescu, x86, linux-kernel, Srikar Dronamraju,
	naveen.n.rao, Yicong Yang, Barry Song, Chen Yu, Hillf Danton,
	Peter Zijlstra

On 2023-07-18 16:52, Shrikanth Hegde wrote:
> On 7/18/23 11:37 AM, Tobias Huschle wrote:
>> On 2023-07-15 01:05, Tim Chen wrote:
>>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>> 
>>>> 
>>>> 
>>>> If we consider symmetric platforms which have SMT4 such as power10.
>>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>> 
>>>> 
>>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>>> [--------------MC------------------------]
>>>> 
>>>> In case of SMT4, if there is any group which has 2 or more tasks, 
>>>> that
>>>> group will be marked as group_smt_balance. previously, if that group
>>>> had 2
>>>> or 3 tasks, it would have been marked as group_has_spare. Since all
>>>> the groups have
>>>> SMT that means behavior would be same fully busy right? That can
>>>> cause some
>>>> corner cases. No?
>>> 
>>> You raised a good point. I was looking from SMT2
>>> perspective so group_smt_balance implies group_fully_busy.
>>> That is no longer true for SMT4.
>>> 
>>> I am thinking of the following fix on the current patch
>>> to take care of SMT4. Do you think this addresses
>>> concerns from you and Tobias?
>>> 
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 294a662c9410..3fc8d3a3bd22 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>>                 break;
>>> 
>>>         case group_smt_balance:
>>> +               /* no idle cpus on both groups handled by
>>> group_fully_busy below */
>>> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> +                       if (sgs->idle_cpus > busiest->idle_cpus)
>>> +                               return false;
>>> +                       if (sgs->idle_cpus < busiest->idle_cpus)
>>> +                               return true;
>>> +                       if (sgs->sum_nr_running <=
>>> busiest_sum_nr_running)
>>> +                               return false;
>>> +                       else
>>> +                               return true;
>>> +               }
>>> 
>>> 
>>> I will be on vacation next three weeks so my response will be slow.
>>> 
>>> Tim
>>> 
>> 
>> What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.
>> 
>> [0 1][2 3 4 5]
>> [SMT][--SMT--]
>> 
>> If now CPUs 0,2,3 have a running task, both groups would be classified 
>> as
>> smt_balance. But if it comes to the selection of the busiest group, 
>> the
>> smaller
>> group would be selected, as it has less idle CPUs, right? Which could 
>> lead
>> to the smaller group being left with no tasks.
>> Using the absolute numbers of task is what made the prefer_sibling 
>> path
>> problematic,
> 
> 
> Yes. But Not sure how realistic is that configuration. on power10, we 
> typically
> have all cores in either SMT1, SMT2 or SMT4. But not mixed configs.
> One can offline a CPUs to get into that cases in SMT4.

I'm also not sure if there is a real case for that. The assumption that 
two groups
are always of equal size was the issue why the prefer_sibling path did 
not work as
expected. I just wanted to point out that we might introduce a similar 
assumption
here again. It might be valid to assume that if there are no usecases 
for having
two cores with a different number of SMT threads.

> 
>> I would assume that the same holds true here. Therefore, I would 
>> prefer
>> avg_load,
>> or, similar to prefer_siblings, a ratio over the number of cores.
>> 
>> I can't really test that on s390 as we always have SMT2. But, we can
>> have these
>> asymmetries on higher levels, e.g
> 
> 
> IIUC, on higher levels, group will not have SD_SHARE_CPUCAPACITY, so
> it shouldn't
> run into group_smt_balance.
> 
>> 
>> [0 1][2 3][4 5][6 7][8 9]
>> [SMT][SMT][SMT][SMT][SMT]
>> [-----core----][--core--]
>> 
>> For large configurations this can be true for even higher levels.
>> Therefore, the idea was to move the smt_balance state around and adapt 
>> its
>> conditions to something like this (which would require to reorder the
>> commits):
>> 
>> @@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
>>  enum group_type {
>>         /* The group has spare capacity that can be used to run more
>> tasks.  */
>>         group_has_spare = 0,
>> +       /*
>> +        * Balance SMT group that's fully busy. Can benefit from 
>> migration
>> +        * a task on SMT with busy sibling to another CPU on idle 
>> core.
>> +        */
>> +       group_smt_balance,
>>         /*
>>          * The group is fully used and the tasks don't compete for 
>> more CPU
>>          * cycles. Nevertheless, some tasks might wait before running.
>> @@ -8340,11 +8345,6 @@ enum group_type {
>>          * more powerful CPU.
>>          */
>>         group_misfit_task,
>> -       /*
>> -        * Balance SMT group that's fully busy. Can benefit from 
>> migration
>> -        * a task on SMT with busy sibling to another CPU on idle 
>> core.
>> -        */
>> -       group_smt_balance,
>>         /*
>>          * SD_ASYM_PACKING only: One local CPU with higher capacity is
>> available,
> 
> 
> IIUC, for cluster topology of this patch, busiest group should be a
> SMT if it has 2
> threads compared to an Atom cluster having 4 threads. Atom cluster
> will be group_fully_busy,
> whereas SMT group will be group_smt_balance. For that to happen
> group_smt_balance should have
> higher group_type.

Makes sense.

> 
>>          * and the task should be migrated to it instead of running on 
>> the
>> @@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int
>> imbalance_pct,
>>         if (sgs->group_asym_packing)
>>                 return group_asym_packing;
>> 
>> -       if (sgs->group_smt_balance)
>> -               return group_smt_balance;
>> -
>>         if (sgs->group_misfit_task_load)
>>                 return group_misfit_task;
>> 
>>         if (!group_has_capacity(imbalance_pct, sgs))
>>                 return group_fully_busy;
>> 
>> +       if (sgs->group_smt_balance)
>> +               return group_smt_balance;
>> +
>>         return group_has_spare;
>>  }
>> 
>> @@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env 
>> *env,
>> struct sg_lb_stats *sgs,
>>          * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
>>          * will not be on.
>>          */
>> -       if (group->flags & SD_SHARE_CPUCAPACITY &&
>> -           sgs->sum_h_nr_running > 1)
>> +       if (sgs->sum_h_nr_running > group->cores)
> 
> In case of Power10, where we have SMT4, group->cores will be 1. I dont 
> see
> a difference here.

The aim of this change was to also make use of this further up in the 
hierarchy,
where SD_SHARE_CPUCAPACITY is not set. Up there, it would be possible to 
have
more than one core, also potentially different numbers (at least on 
s390).

It appears to work fine without these changes though, so I think there 
is
nothing to do for now.

> 
>>                 return true;
>> 
>>         return false;
>> 
>> The s390 problem is currently solved by changing the prefer_sibling
>> path. When
>> disabling that flag, we might have an issue, will have to verify that
>> though.
>> 
>>>> 
>>>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is
>>>> trying to do
>>>> load balance. Previously imbalance would have been 2, instead now
>>>> imbalance would be 1.
>>>> But in subsequent lb it would be balanced.
>>>> 
>>>> 
>>>> 
>>>> > +    return false;
>>>> > +}
>>>> > +
>>>> >  static inline bool
>>>> >  sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>>> >  {
>>>> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct
>>>> lb_env *env,
>>>> >          sgs->group_asym_packing = 1;
>>>> >      }
>>>> >
>>>> > +    /* Check for loaded SMT group to be balanced to dst CPU */
>>>> > +    if (!local_group && smt_balance(env, sgs, group))
>>>> > +        sgs->group_smt_balance = 1;
>>>> > +
>>>> >      sgs->group_type = group_classify(env->sd->imbalance_pct,
>>>> group, sgs);
>>>> >
>>>> >      /* Computing avg_load makes sense only when group is
>>>> overloaded */
>>>> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct
>>>> lb_env *env,
>>>> >              return false;
>>>> >          break;
>>>> >
>>>> > +    case group_smt_balance:
>>>> >      case group_fully_busy:
>>>> >          /*
>>>> >           * Select the fully busy group with highest avg_load. In
>>>> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct
>>>> lb_env *env,
>>>> >          break;
>>>> >
>>>> >      case group_has_spare:
>>>> > +        /*
>>>> > +         * Do not pick sg with SMT CPUs over sg with pure CPUs,
>>>> > +         * as we do not want to pull task off SMT core with one task
>>>> > +         * and make the core idle.
>>>> > +         */
>>>> > +        if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>>>> > +            if (sg->flags & SD_SHARE_CPUCAPACITY &&
>>>> sgs->sum_h_nr_running <= 1)
>>>> > +                return false;
>>>> > +            else
>>>> > +                return true;> +        }
>>>> > +
>>>> >          /*
>>>> >           * Select not overloaded group with lowest number of idle
>>>> cpus
>>>> >           * and highest number of running tasks. We could also compare
>>>> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct
>>>> sched_group *idlest,
>>>> >
>>>> >      case group_imbalanced:
>>>> >      case group_asym_packing:
>>>> > +    case group_smt_balance:
>>>> >          /* Those types are not used in the slow wakeup path */
>>>> >          return false;
>>>> >
>>>> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
>>>> struct task_struct *p, int this_cpu)
>>>> >
>>>> >      case group_imbalanced:
>>>> >      case group_asym_packing:
>>>> > +    case group_smt_balance:
>>>> >          /* Those type are not used in the slow wakeup path */
>>>> >          return NULL;
>>>> >
>>>> > @@ -10118,6 +10176,13 @@ static inline void
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>> >          return;
>>>> >      }
>>>> >
>>>> > +    if (busiest->group_type == group_smt_balance) {
>>>> > +        /* Reduce number of tasks sharing CPU capacity */
>>>> > +        env->migration_type = migrate_task;
>>>> > +        env->imbalance = 1;
>>>> > +        return;
>>>> > +    }
>>>> > +
>>>> >      if (busiest->group_type == group_imbalanced) {
>>>> >          /*
>>>> >           * In the group_imb case we cannot rely on group-wide
>>>> averages
>>>> > @@ -10363,16 +10428,23 @@ static struct sched_group
>>>> *find_busiest_group(struct lb_env *env)
>>>> >          goto force_balance;
>>>> >
>>>> >      if (busiest->group_type != group_overloaded) {
>>>> > -        if (env->idle == CPU_NOT_IDLE)
>>>> > +        if (env->idle == CPU_NOT_IDLE) {
>>>> >              /*
>>>> >               * If the busiest group is not overloaded (and as a
>>>> >               * result the local one too) but this CPU is already
>>>> >               * busy, let another idle CPU try to pull task.
>>>> >               */
>>>> >              goto out_balanced;
>>>> > +        }
>>>> > +
>>>> > +        if (busiest->group_type == group_smt_balance &&
>>>> > +            smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>>>> > +            /* Let non SMT CPU pull from SMT CPU sharing with
>>>> sibling */
>>>> > +            goto force_balance;
>>>> > +        }
>>>> >
>>>> >          if (busiest->group_weight > 1 &&
>>>> > -            local->idle_cpus <= (busiest->idle_cpus + 1))
>>>> > +            local->idle_cpus <= (busiest->idle_cpus + 1)) {
>>>> >              /*
>>>> >               * If the busiest group is not overloaded
>>>> >               * and there is no imbalance between this and busiest
>>>> > @@ -10383,12 +10455,14 @@ static struct sched_group
>>>> *find_busiest_group(struct lb_env *env)
>>>> >               * there is more than 1 CPU per group.
>>>> >               */
>>>> >              goto out_balanced;
>>>> > +        }
>>>> >
>>>> > -        if (busiest->sum_h_nr_running == 1)
>>>> > +        if (busiest->sum_h_nr_running == 1) {
>>>> >              /*
>>>> >               * busiest doesn't have any tasks waiting to run
>>>> >               */
>>>> >              goto out_balanced;
>>>> > +        }
>>>> >      }
>>>> >
>>>> >  force_balance:

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-07-17 14:58               ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
@ 2023-07-27  3:11                 ` Tim Chen
  2023-07-27 13:32                   ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-07-27  3:11 UTC (permalink / raw)
  To: Shrikanth Hegde, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen

On Mon, 2023-07-17 at 20:28 +0530, Shrikanth Hegde wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> For SMT4, any group with more than 2 tasks will be marked as
> group_smt_balance. Retain the behaviour of group_has_spare by marking
> the busiest group as the group which has the least number of idle_cpus.
> 
> Also, handle rounding effect of adding (ncores_local + ncores_busy)
> when the local is fully idle and busy group has more than 2 tasks.
> Local group should try to pull at least 1 task in this case.
> 
> Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 932e7b78894a..9502013abe33 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>  	imbalance /= ncores_local + ncores_busiest;
> 
>  	/* Take advantage of resource in an empty sched group */
> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>  	    busiest->sum_nr_running > 1)
>  		imbalance = 2;
> 
> @@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
> 
>  	case group_smt_balance:
> +		/* no idle cpus on both groups handled by group_fully_busy below */
> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +			if (sgs->idle_cpus > busiest->idle_cpus)
> +				return false;
> +			if (sgs->idle_cpus < busiest->idle_cpus)
> +				return true;
> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> +				return false;
> +		}
> +		break;
> +
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In

Thanks for the fix up patch.

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-07-27  3:11                 ` Tim Chen
@ 2023-07-27 13:32                   ` Tim Chen
  2023-08-07  9:36                     ` Shrikanth Hegde
  2023-09-05 10:41                     ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-07-27 13:32 UTC (permalink / raw)
  To: Shrikanth Hegde, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen

On Wed, 2023-07-26 at 20:11 -0700, Tim Chen wrote:
> On Mon, 2023-07-17 at 20:28 +0530, Shrikanth Hegde wrote:
> > From: Tim Chen <tim.c.chen@linux.intel.com>
> > 
> > For SMT4, any group with more than 2 tasks will be marked as
> > group_smt_balance. Retain the behaviour of group_has_spare by marking
> > the busiest group as the group which has the least number of idle_cpus.
> > 
> > Also, handle rounding effect of adding (ncores_local + ncores_busy)
> > when the local is fully idle and busy group has more than 2 tasks.
> > Local group should try to pull at least 1 task in this case.
> > 
> > Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> > ---
> >  kernel/sched/fair.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 932e7b78894a..9502013abe33 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> >  	imbalance /= ncores_local + ncores_busiest;
> > 
> >  	/* Take advantage of resource in an empty sched group */
> > -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
> >  	    busiest->sum_nr_running > 1)
> >  		imbalance = 2;
> > 
> > @@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		break;
> > 
> >  	case group_smt_balance:
> > +		/* no idle cpus on both groups handled by group_fully_busy below */
> > +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > +			if (sgs->idle_cpus > busiest->idle_cpus)
> > +				return false;
> > +			if (sgs->idle_cpus < busiest->idle_cpus)
> > +				return true;
> > +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > +				return false;
> > +		}
> > +		break;

Shrikanth and Peter,

Sorry, I acked Shrikanth's fixup patch too quickly without seeing that Shrikanth added
a "break" in the patch above.  My original code did not have that break statement as
I did intend the code to fall through to the "group_fully_busy" code path when
there are no idle cpus in both groups.  To make the compiler happy and putting
in the correct logic, I refresh the patch as below.

Thanks.

Tim

From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Fri, 14 Jul 2023 16:09:30 -0700
Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy)
when the local is fully idle and busy group has more than 2 tasks.
Local group should try to pull at least 1 task in this case.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a87988327f88..566686c5f2bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/* no idle cpus on both groups handled by group_fully_busy below */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+			if (sgs->idle_cpus > busiest->idle_cpus)
+				return false;
+			if (sgs->idle_cpus < busiest->idle_cpus)
+				return true;
+			if (sgs->sum_nr_running <= busiest->sum_nr_running)
+				return false;
+			else
+				return true;
+		}
+		goto fully_busy;
+		break;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		 * select the 1st one, except if @sg is composed of SMT
 		 * siblings.
 		 */
-
+fully_busy:
 		if (sgs->avg_load < busiest->avg_load)
 			return false;
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-07-27 13:32                   ` Tim Chen
@ 2023-08-07  9:36                     ` Shrikanth Hegde
  2023-08-21 19:19                       ` Tim Chen
  2023-09-05 10:41                     ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-08-07  9:36 UTC (permalink / raw)
  To: Tim Chen, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen



On 7/27/23 7:02 PM, Tim Chen wrote:
> On Wed, 2023-07-26 at 20:11 -0700, Tim Chen wrote:
>> On Mon, 2023-07-17 at 20:28 +0530, Shrikanth Hegde wrote:
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>> For SMT4, any group with more than 2 tasks will be marked as
>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>> the busiest group as the group which has the least number of idle_cpus.
>>>
>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>> when the local is fully idle and busy group has more than 2 tasks.
>>> Local group should try to pull at least 1 task in this case.
>>>
>>> Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
>>> ---
>>>  kernel/sched/fair.c | 13 ++++++++++++-
>>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 932e7b78894a..9502013abe33 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>>  	imbalance /= ncores_local + ncores_busiest;
>>>
>>>  	/* Take advantage of resource in an empty sched group */
>>> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
>>> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>>  	    busiest->sum_nr_running > 1)
>>>  		imbalance = 2;
>>>
>>> @@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  		break;
>>>
>>>  	case group_smt_balance:
>>> +		/* no idle cpus on both groups handled by group_fully_busy below */
>>> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> +			if (sgs->idle_cpus > busiest->idle_cpus)
>>> +				return false;
>>> +			if (sgs->idle_cpus < busiest->idle_cpus)
>>> +				return true;
>>> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>> +				return false;
>>> +		}
>>> +		break;
> 
> Shrikanth and Peter,
> 
> Sorry, I acked Shrikanth's fixup patch too quickly without seeing that Shrikanth added
> a "break" in the patch above.  My original code did not have that break statement as
> I did intend the code to fall through to the "group_fully_busy" code path when
> there are no idle cpus in both groups.  To make the compiler happy and putting
> in the correct logic, I refresh the patch as below.
> 
> Thanks.
> 
> Tim
> 
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Fri, 14 Jul 2023 16:09:30 -0700
> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
> 
> For SMT4, any group with more than 2 tasks will be marked as
> group_smt_balance. Retain the behaviour of group_has_spare by marking
> the busiest group as the group which has the least number of idle_cpus.
> 
> Also, handle rounding effect of adding (ncores_local + ncores_busy)
> when the local is fully idle and busy group has more than 2 tasks.
> Local group should try to pull at least 1 task in this case.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a87988327f88..566686c5f2bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>  	imbalance /= ncores_local + ncores_busiest;
>  
>  	/* Take advantage of resource in an empty sched group */
> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>  	    busiest->sum_nr_running > 1)
>  		imbalance = 2;
>  
> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
>  
>  	case group_smt_balance:
> +		/* no idle cpus on both groups handled by group_fully_busy below */
> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +			if (sgs->idle_cpus > busiest->idle_cpus)
> +				return false;
> +			if (sgs->idle_cpus < busiest->idle_cpus)
> +				return true;
> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> +				return false;
> +			else
> +				return true;
> +		}
> +		goto fully_busy;
> +		break;
> +
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In
> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		 * select the 1st one, except if @sg is composed of SMT
>  		 * siblings.
>  		 */
> -
> +fully_busy:
>  		if (sgs->avg_load < busiest->avg_load)
>  			return false;
>  

Hi Tim, Peter. 

group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
see this above patch there yet. Currently as is, this can cause function difference 
in SMT4 systems( such as Power10). 

Can we please have the above patch as well in tip/sched/core?

Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-08-07  9:36                     ` Shrikanth Hegde
@ 2023-08-21 19:19                       ` Tim Chen
  2023-09-05  8:03                         ` Shrikanth Hegde
  2023-09-05 10:38                         ` [PATCH] sched/fair: Add " Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-08-21 19:19 UTC (permalink / raw)
  To: Shrikanth Hegde, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen

On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
> > 
> > From: Tim Chen <tim.c.chen@linux.intel.com>
> > Date: Fri, 14 Jul 2023 16:09:30 -0700
> > Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
> > 
> > For SMT4, any group with more than 2 tasks will be marked as
> > group_smt_balance. Retain the behaviour of group_has_spare by marking
> > the busiest group as the group which has the least number of idle_cpus.
> > 
> > Also, handle rounding effect of adding (ncores_local + ncores_busy)
> > when the local is fully idle and busy group has more than 2 tasks.
> > Local group should try to pull at least 1 task in this case.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c | 18 ++++++++++++++++--
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a87988327f88..566686c5f2bd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> >  	imbalance /= ncores_local + ncores_busiest;
> >  
> >  	/* Take advantage of resource in an empty sched group */
> > -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
> >  	    busiest->sum_nr_running > 1)
> >  		imbalance = 2;
> >  
> > @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		break;
> >  
> >  	case group_smt_balance:
> > +		/* no idle cpus on both groups handled by group_fully_busy below */
> > +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > +			if (sgs->idle_cpus > busiest->idle_cpus)
> > +				return false;
> > +			if (sgs->idle_cpus < busiest->idle_cpus)
> > +				return true;
> > +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > +				return false;
> > +			else
> > +				return true;
> > +		}
> > +		goto fully_busy;
> > +		break;
> > +
> >  	case group_fully_busy:
> >  		/*
> >  		 * Select the fully busy group with highest avg_load. In
> > @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		 * select the 1st one, except if @sg is composed of SMT
> >  		 * siblings.
> >  		 */
> > -
> > +fully_busy:
> >  		if (sgs->avg_load < busiest->avg_load)
> >  			return false;
> >  
> 
> Hi Tim, Peter. 
> 
> group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
> see this above patch there yet. Currently as is, this can cause function difference 
> in SMT4 systems( such as Power10). 
> 
> Can we please have the above patch as well in tip/sched/core?
> 
> Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

Hi Peter,

Just back from my long vacation.  Wonder if you have any comments on the above patch
for fixing the SMT4 case?

Tim


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-08-21 19:19                       ` Tim Chen
@ 2023-09-05  8:03                         ` Shrikanth Hegde
  2023-09-05  9:49                           ` Peter Zijlstra
  2023-09-05 18:37                           ` Tim Chen
  2023-09-05 10:38                         ` [PATCH] sched/fair: Add " Peter Zijlstra
  1 sibling, 2 replies; 55+ messages in thread
From: Shrikanth Hegde @ 2023-09-05  8:03 UTC (permalink / raw)
  To: Tim Chen, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen



On 8/22/23 12:49 AM, Tim Chen wrote:
> On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
>>>
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>> Date: Fri, 14 Jul 2023 16:09:30 -0700
>>> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
>>>
>>> For SMT4, any group with more than 2 tasks will be marked as
>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>> the busiest group as the group which has the least number of idle_cpus.
>>>
>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>> when the local is fully idle and busy group has more than 2 tasks.
>>> Local group should try to pull at least 1 task in this case.
>>>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>  kernel/sched/fair.c | 18 ++++++++++++++++--
>>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a87988327f88..566686c5f2bd 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>>  	imbalance /= ncores_local + ncores_busiest;
>>>  
>>>  	/* Take advantage of resource in an empty sched group */
>>> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
>>> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>>  	    busiest->sum_nr_running > 1)
>>>  		imbalance = 2;
>>>  
>>> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  		break;
>>>  
>>>  	case group_smt_balance:
>>> +		/* no idle cpus on both groups handled by group_fully_busy below */
>>> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> +			if (sgs->idle_cpus > busiest->idle_cpus)
>>> +				return false;
>>> +			if (sgs->idle_cpus < busiest->idle_cpus)
>>> +				return true;
>>> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>> +				return false;
>>> +			else
>>> +				return true;
>>> +		}
>>> +		goto fully_busy;
>>> +		break;
>>> +
>>>  	case group_fully_busy:
>>>  		/*
>>>  		 * Select the fully busy group with highest avg_load. In
>>> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  		 * select the 1st one, except if @sg is composed of SMT
>>>  		 * siblings.
>>>  		 */
>>> -
>>> +fully_busy:
>>>  		if (sgs->avg_load < busiest->avg_load)
>>>  			return false;
>>>  
>>
>> Hi Tim, Peter. 
>>
>> group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
>> see this above patch there yet. Currently as is, this can cause function difference 
>> in SMT4 systems( such as Power10). 
>>
>> Can we please have the above patch as well in tip/sched/core?
>>
>> Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> 
> Hi Peter,
> 
> Just back from my long vacation.  Wonder if you have any comments on the above patch
> for fixing the SMT4 case?
> 
> Tim

Hi Tim, Peter. 

are there any concerns with the above patch for fixing the SMT4 case. 
Currently the behavior is group_smt_balance is set for having even 2 tasks in 
SMT4, ideally it should be same as the group_has_spare. 

The above patch copies the same behavior to group_smt_balance. 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05  8:03                         ` Shrikanth Hegde
@ 2023-09-05  9:49                           ` Peter Zijlstra
  2023-09-05 18:37                           ` Tim Chen
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-09-05  9:49 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Tue, Sep 05, 2023 at 01:33:57PM +0530, Shrikanth Hegde wrote:
> Hi Tim, Peter. 

Back from PTO; mailbox is a disaster area, but I'll try and have a look
soon.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-08-21 19:19                       ` Tim Chen
  2023-09-05  8:03                         ` Shrikanth Hegde
@ 2023-09-05 10:38                         ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-09-05 10:38 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shrikanth Hegde, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Mon, Aug 21, 2023 at 12:19:40PM -0700, Tim Chen wrote:
> Just back from my long vacation.  Wonder if you have any comments on the above patch
> for fixing the SMT4 case?

This should have:

Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")

Right?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-07-27 13:32                   ` Tim Chen
  2023-08-07  9:36                     ` Shrikanth Hegde
@ 2023-09-05 10:41                     ` Peter Zijlstra
  2023-09-05 17:54                       ` Tim Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-09-05 10:41 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shrikanth Hegde, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Thu, Jul 27, 2023 at 06:32:44AM -0700, Tim Chen wrote:

>  kernel/sched/fair.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a87988327f88..566686c5f2bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>  	imbalance /= ncores_local + ncores_busiest;
>  
>  	/* Take advantage of resource in an empty sched group */
> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>  	    busiest->sum_nr_running > 1)
>  		imbalance = 2;
>  
> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
>  
>  	case group_smt_balance:
> +		/* no idle cpus on both groups handled by group_fully_busy below */
> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> +			if (sgs->idle_cpus > busiest->idle_cpus)
> +				return false;
> +			if (sgs->idle_cpus < busiest->idle_cpus)
> +				return true;
> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> +				return false;
> +			else
> +				return true;
> +		}
> +		goto fully_busy;
> +		break;

This is really daft; why can't this simply be: fallthrough; ? At the
very least that break must go.

> +
>  	case group_fully_busy:
>  		/*
>  		 * Select the fully busy group with highest avg_load. In
> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		 * select the 1st one, except if @sg is composed of SMT
>  		 * siblings.
>  		 */
> -
> +fully_busy:
>  		if (sgs->avg_load < busiest->avg_load)
>  			return false;
>  
> -- 
> 2.32.0
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05 10:41                     ` Peter Zijlstra
@ 2023-09-05 17:54                       ` Tim Chen
  2023-09-06  8:23                         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Tim Chen @ 2023-09-05 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shrikanth Hegde, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Tue, 2023-09-05 at 12:41 +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 06:32:44AM -0700, Tim Chen wrote:
> 
> >  kernel/sched/fair.c | 18 ++++++++++++++++--
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a87988327f88..566686c5f2bd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> >  	imbalance /= ncores_local + ncores_busiest;
> >  
> >  	/* Take advantage of resource in an empty sched group */
> > -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
> >  	    busiest->sum_nr_running > 1)
> >  		imbalance = 2;
> >  
> > @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		break;
> >  
> >  	case group_smt_balance:
> > +		/* no idle cpus on both groups handled by group_fully_busy below */
> > +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > +			if (sgs->idle_cpus > busiest->idle_cpus)
> > +				return false;
> > +			if (sgs->idle_cpus < busiest->idle_cpus)
> > +				return true;
> > +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > +				return false;
> > +			else
> > +				return true;
> > +		}
> > +		goto fully_busy;
> > +		break;
> 
> This is really daft; why can't this simply be: fallthrough; ? At the
> very least that break must go.
> 
> 

Yes, the break should go.  I was adding the goto to prevent compiler
from complaining about fall through code.  The break no longer is needed.

Tim

From 81971a0b1eb64059756f00d8497b1865af2c0792 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Fri, 14 Jul 2023 16:09:30 -0700
Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy)
when the local is fully idle and busy group has more than 2 tasks.
Local group should try to pull at least 1 task in this case.

Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..6e7ee2efc1ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9575,7 +9575,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9763,6 +9763,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/* no idle cpus on both groups handled by group_fully_busy below */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+			if (sgs->idle_cpus > busiest->idle_cpus)
+				return false;
+			if (sgs->idle_cpus < busiest->idle_cpus)
+				return true;
+			if (sgs->sum_nr_running <= busiest->sum_nr_running)
+				return false;
+			else
+				return true;
+		}
+		goto fully_busy;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9775,7 +9788,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		 * select the 1st one, except if @sg is composed of SMT
 		 * siblings.
 		 */
-
+fully_busy:
 		if (sgs->avg_load < busiest->avg_load)
 			return false;
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05  8:03                         ` Shrikanth Hegde
  2023-09-05  9:49                           ` Peter Zijlstra
@ 2023-09-05 18:37                           ` Tim Chen
  2023-09-06  9:29                             ` Shrikanth Hegde
  2023-09-07  8:58                             ` Shrikanth Hegde
  1 sibling, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-09-05 18:37 UTC (permalink / raw)
  To: Shrikanth Hegde, peterz
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen

On Tue, 2023-09-05 at 13:33 +0530, Shrikanth Hegde wrote:
> 
> On 8/22/23 12:49 AM, Tim Chen wrote:
> > On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
> > > > 
> > > > From: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Date: Fri, 14 Jul 2023 16:09:30 -0700
> > > > Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
> > > > 
> > > > For SMT4, any group with more than 2 tasks will be marked as
> > > > group_smt_balance. Retain the behaviour of group_has_spare by marking
> > > > the busiest group as the group which has the least number of idle_cpus.
> > > > 
> > > > Also, handle rounding effect of adding (ncores_local + ncores_busy)
> > > > when the local is fully idle and busy group has more than 2 tasks.
> > > > Local group should try to pull at least 1 task in this case.
> > > > 
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > ---
> > > >  kernel/sched/fair.c | 18 ++++++++++++++++--
> > > >  1 file changed, 16 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a87988327f88..566686c5f2bd 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> > > >  	imbalance /= ncores_local + ncores_busiest;
> > > >  
> > > >  	/* Take advantage of resource in an empty sched group */
> > > > -	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > > +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
> > > >  	    busiest->sum_nr_running > 1)
> > > >  		imbalance = 2;
> > > >  
> > > > @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > > >  		break;
> > > >  
> > > >  	case group_smt_balance:
> > > > +		/* no idle cpus on both groups handled by group_fully_busy below */
> > > > +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > > > +			if (sgs->idle_cpus > busiest->idle_cpus)
> > > > +				return false;
> > > > +			if (sgs->idle_cpus < busiest->idle_cpus)
> > > > +				return true;
> > > > +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > > > +				return false;
> > > > +			else
> > > > +				return true;
> > > > +		}
> > > > +		goto fully_busy;
> > > > +		break;
> > > > +
> > > >  	case group_fully_busy:
> > > >  		/*
> > > >  		 * Select the fully busy group with highest avg_load. In
> > > > @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > > >  		 * select the 1st one, except if @sg is composed of SMT
> > > >  		 * siblings.
> > > >  		 */
> > > > -
> > > > +fully_busy:
> > > >  		if (sgs->avg_load < busiest->avg_load)
> > > >  			return false;
> > > >  
> > > 
> > > Hi Tim, Peter. 
> > > 
> > > group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
> > > see this above patch there yet. Currently as is, this can cause function difference 
> > > in SMT4 systems( such as Power10). 
> > > 
> > > Can we please have the above patch as well in tip/sched/core?
> > > 
> > > Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> > 
> > Hi Peter,
> > 
> > Just back from my long vacation.  Wonder if you have any comments on the above patch
> > for fixing the SMT4 case?
> > 
> > Tim
> 
> Hi Tim, Peter. 
> 
> are there any concerns with the above patch for fixing the SMT4 case. 
> Currently the behavior is group_smt_balance is set for having even 2 tasks in 
> SMT4, ideally it should be same as the group_has_spare. 
> 
> The above patch copies the same behavior to group_smt_balance. 
> > 

You mean simplify the patch as below?  I think that should be fine.  Can you
make sure it works for SMT4?  And I can update the patch once you confirm it
works properly.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e7ee2efc1ba..48e9ab7f8a87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 
        case group_smt_balance:
                /* no idle cpus on both groups handled by group_fully_busy below */
-               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
-                       if (sgs->idle_cpus > busiest->idle_cpus)
-                               return false;
-                       if (sgs->idle_cpus < busiest->idle_cpus)
-                               return true;
-                       if (sgs->sum_nr_running <= busiest->sum_nr_running)
-                               return false;
-                       else
-                               return true;
-               }
+               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
+                       goto has_spare;
+
                goto fully_busy;
 
        case group_fully_busy:
@@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
                 * as we do not want to pull task off SMT core with one task
                 * and make the core idle.
                 */
+has_spare:
                if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
                        if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
                                return false;




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05 17:54                       ` Tim Chen
@ 2023-09-06  8:23                         ` Peter Zijlstra
  2023-09-06 15:45                           ` Tim Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-09-06  8:23 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shrikanth Hegde, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Tue, Sep 05, 2023 at 10:54:09AM -0700, Tim Chen wrote:

> > > +		goto fully_busy;
> > > +		break;
> > 
> > This is really daft; why can't this simply be: fallthrough; ? At the
> > very least that break must go.
> > 
> > 
> 
> Yes, the break should go.  I was adding the goto to prevent compiler
> from complaining about fall through code. 

But that's what we have the fallthrough keyword for, no?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05 18:37                           ` Tim Chen
@ 2023-09-06  9:29                             ` Shrikanth Hegde
  2023-09-06 15:42                               ` Tim Chen
  2023-09-07  8:58                             ` Shrikanth Hegde
  1 sibling, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-09-06  9:29 UTC (permalink / raw)
  To: Tim Chen
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen, peterz



On 9/6/23 12:07 AM, Tim Chen wrote:
> On Tue, 2023-09-05 at 13:33 +0530, Shrikanth Hegde wrote:
>>
>> On 8/22/23 12:49 AM, Tim Chen wrote:
>>> On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
>>>>>
>>>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> Date: Fri, 14 Jul 2023 16:09:30 -0700
>>>>> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
>>>>>
>>>>> For SMT4, any group with more than 2 tasks will be marked as
>>>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>>>> the busiest group as the group which has the least number of idle_cpus.
>>>>>
>>>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>>>> when the local is fully idle and busy group has more than 2 tasks.
>>>>> Local group should try to pull at least 1 task in this case.
>>>>>
>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> ---
>>>>>  kernel/sched/fair.c | 18 ++++++++++++++++--
>>>>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index a87988327f88..566686c5f2bd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>>>>  	imbalance /= ncores_local + ncores_busiest;
>>>>>  
>>>>>  	/* Take advantage of resource in an empty sched group */
>>>>> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
>>>>> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>>>>  	    busiest->sum_nr_running > 1)
>>>>>  		imbalance = 2;
>>>>>  
>>>>> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>  		break;
>>>>>  
>>>>>  	case group_smt_balance:
>>>>> +		/* no idle cpus on both groups handled by group_fully_busy below */
>>>>> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>>>> +			if (sgs->idle_cpus > busiest->idle_cpus)
>>>>> +				return false;
>>>>> +			if (sgs->idle_cpus < busiest->idle_cpus)
>>>>> +				return true;
>>>>> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>>>> +				return false;
>>>>> +			else
>>>>> +				return true;
>>>>> +		}
>>>>> +		goto fully_busy;
>>>>> +		break;
>>>>> +
>>>>>  	case group_fully_busy:
>>>>>  		/*
>>>>>  		 * Select the fully busy group with highest avg_load. In
>>>>> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>  		 * select the 1st one, except if @sg is composed of SMT
>>>>>  		 * siblings.
>>>>>  		 */
>>>>> -
>>>>> +fully_busy:
>>>>>  		if (sgs->avg_load < busiest->avg_load)
>>>>>  			return false;
>>>>>  
>>>>
>>>> Hi Tim, Peter. 
>>>>
>>>> group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
>>>> see this above patch there yet. Currently as is, this can cause function difference 
>>>> in SMT4 systems( such as Power10). 
>>>>
>>>> Can we please have the above patch as well in tip/sched/core?
>>>>
>>>> Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
>>>
>>> Hi Peter,
>>>
>>> Just back from my long vacation.  Wonder if you have any comments on the above patch
>>> for fixing the SMT4 case?
>>>
>>> Tim
>>
>> Hi Tim, Peter. 
>>
>> are there any concerns with the above patch for fixing the SMT4 case. 
>> Currently the behavior is group_smt_balance is set for having even 2 tasks in 
>> SMT4, ideally it should be same as the group_has_spare. 
>>
>> The above patch copies the same behavior to group_smt_balance. 
>>>
> 
> You mean simplify the patch as below?  I think that should be fine.  Can you
> make sure it works for SMT4?  And I can update the patch once you confirm it
> works properly.
> 

This looks fine. likely better as it would avoid duplication. A few nit below.


> Tim
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e7ee2efc1ba..48e9ab7f8a87 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  
>         case group_smt_balance:
>                 /* no idle cpus on both groups handled by group_fully_busy below */

Please add a comment here explaining the fall-through and spare logic.

> -               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> -                       if (sgs->idle_cpus > busiest->idle_cpus)
> -                               return false;
> -                       if (sgs->idle_cpus < busiest->idle_cpus)
> -                               return true;
> -                       if (sgs->sum_nr_running <= busiest->sum_nr_running)
> -                               return false;
> -                       else
> -                               return true;
> -               }
> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
> +                       goto has_spare;
> +
>                 goto fully_busy;

This can fall through without the additional goto statement no?

>  
>         case group_fully_busy:
> @@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                  * as we do not want to pull task off SMT core with one task
>                  * and make the core idle.
>                  */
> +has_spare:
>                 if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>                         if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
>                                 return false;
> 
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-06  9:29                             ` Shrikanth Hegde
@ 2023-09-06 15:42                               ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-09-06 15:42 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen, peterz

On Wed, 2023-09-06 at 14:59 +0530, Shrikanth Hegde wrote:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e7ee2efc1ba..48e9ab7f8a87 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  
> >         case group_smt_balance:
> >                 /* no idle cpus on both groups handled by group_fully_busy below */
> 
> Please add a comment here explaining the fall-through and spare logic.
> 

Sure.


> > -               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > -                       if (sgs->idle_cpus > busiest->idle_cpus)
> > -                               return false;
> > -                       if (sgs->idle_cpus < busiest->idle_cpus)
> > -                               return true;
> > -                       if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > -                               return false;
> > -                       else
> > -                               return true;
> > -               }
> > +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
> > +                       goto has_spare;
> > +
> >                 goto fully_busy;
> 
> This can fall through without the additional goto statement no?
> 

There is an unconditional goto fully_busy so won't fall through and
compiler won't complain.

> >  
> >         case group_fully_busy:
> > @@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >                  * as we do not want to pull task off SMT core with one task
> >                  * and make the core idle.
> >                  */
> > +has_spare:
> >                 if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> >                         if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> >                                 return false;
> > 
> > 
> > 
Tim



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-06  8:23                         ` Peter Zijlstra
@ 2023-09-06 15:45                           ` Tim Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Tim Chen @ 2023-09-06 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shrikanth Hegde, bristot, bsegall, dietmar.eggemann, hdanton,
	ionela.voinescu, juri.lelli, len.brown, linux-kernel, mgorman,
	naveen.n.rao, rafael.j.wysocki, ravi.v.shankar, ricardo.neri,
	rostedt, srikar, srinivas.pandruvada, v-songbaohua,
	vincent.guittot, vschneid, x86, yangyicong, yu.c.chen

On Wed, 2023-09-06 at 10:23 +0200, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 10:54:09AM -0700, Tim Chen wrote:
> 
> > > > +		goto fully_busy;
> > > > +		break;
> > > 
> > > This is really daft; why can't this simply be: fallthrough; ? At the
> > > very least that break must go.
> > > 
> > > 
> > 
> > Yes, the break should go.  I was adding the goto to prevent compiler
> > from complaining about fall through code. 
> 
> But that's what we have the fallthrough keyword for, no?

Okay.  Will update patch to use fallthrough once Shrikanth has
a chance to test the update to use has_spare path for SMT4.

Tim


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-05 18:37                           ` Tim Chen
  2023-09-06  9:29                             ` Shrikanth Hegde
@ 2023-09-07  8:58                             ` Shrikanth Hegde
  2023-09-07 17:42                               ` Tim Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Shrikanth Hegde @ 2023-09-07  8:58 UTC (permalink / raw)
  To: Tim Chen
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen, Peter Zijlstra



On 9/6/23 12:07 AM, Tim Chen wrote:
> On Tue, 2023-09-05 at 13:33 +0530, Shrikanth Hegde wrote:
>>
>> On 8/22/23 12:49 AM, Tim Chen wrote:
>>> On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
>>>>>
>>>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> Date: Fri, 14 Jul 2023 16:09:30 -0700
>>>>> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
>>>>>
>>>>> For SMT4, any group with more than 2 tasks will be marked as
>>>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>>>> the busiest group as the group which has the least number of idle_cpus.
>>>>>
>>>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>>>> when the local is fully idle and busy group has more than 2 tasks.
>>>>> Local group should try to pull at least 1 task in this case.
>>>>>
>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> ---
>>>>>  kernel/sched/fair.c | 18 ++++++++++++++++--
>>>>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index a87988327f88..566686c5f2bd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>>>>  	imbalance /= ncores_local + ncores_busiest;
>>>>>  
>>>>>  	/* Take advantage of resource in an empty sched group */
>>>>> -	if (imbalance == 0 && local->sum_nr_running == 0 &&
>>>>> +	if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>>>>  	    busiest->sum_nr_running > 1)
>>>>>  		imbalance = 2;
>>>>>  
>>>>> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>  		break;
>>>>>  
>>>>>  	case group_smt_balance:
>>>>> +		/* no idle cpus on both groups handled by group_fully_busy below */
>>>>> +		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>>>> +			if (sgs->idle_cpus > busiest->idle_cpus)
>>>>> +				return false;
>>>>> +			if (sgs->idle_cpus < busiest->idle_cpus)
>>>>> +				return true;
>>>>> +			if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>>>> +				return false;
>>>>> +			else
>>>>> +				return true;
>>>>> +		}
>>>>> +		goto fully_busy;
>>>>> +		break;
>>>>> +
>>>>>  	case group_fully_busy:
>>>>>  		/*
>>>>>  		 * Select the fully busy group with highest avg_load. In
>>>>> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>  		 * select the 1st one, except if @sg is composed of SMT
>>>>>  		 * siblings.
>>>>>  		 */
>>>>> -
>>>>> +fully_busy:
>>>>>  		if (sgs->avg_load < busiest->avg_load)
>>>>>  			return false;
>>>>>  
>>>>
>>>> Hi Tim, Peter. 
>>>>
>>>> group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont 
>>>> see this above patch there yet. Currently as is, this can cause function difference 
>>>> in SMT4 systems( such as Power10). 
>>>>
>>>> Can we please have the above patch as well in tip/sched/core?
>>>>
>>>> Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
>>>
>>> Hi Peter,
>>>
>>> Just back from my long vacation.  Wonder if you have any comments on the above patch
>>> for fixing the SMT4 case?
>>>
>>> Tim
>>
>> Hi Tim, Peter. 
>>
>> are there any concerns with the above patch for fixing the SMT4 case. 
>> Currently the behavior is group_smt_balance is set for having even 2 tasks in 
>> SMT4, ideally it should be same as the group_has_spare. 
>>
>> The above patch copies the same behavior to group_smt_balance. 
>>>
> 
> You mean simplify the patch as below?  I think that should be fine.  Can you
> make sure it works for SMT4?  And I can update the patch once you confirm it
> works properly.
> 
> Tim
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e7ee2efc1ba..48e9ab7f8a87 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  
>         case group_smt_balance:
>                 /* no idle cpus on both groups handled by group_fully_busy below */
> -               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> -                       if (sgs->idle_cpus > busiest->idle_cpus)
> -                               return false;
> -                       if (sgs->idle_cpus < busiest->idle_cpus)
> -                               return true;
> -                       if (sgs->sum_nr_running <= busiest->sum_nr_running)
> -                               return false;
> -                       else
> -                               return true;
> -               }
> +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
> +                       goto has_spare;
> +
>                 goto fully_busy;
>  
>         case group_fully_busy:
> @@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>                  * as we do not want to pull task off SMT core with one task
>                  * and make the core idle.
>                  */
> +has_spare:
>                 if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>                         if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
>                                 return false;
> 
> 
> 

Hi Tim,

In case you were waiting for my reply as inferred from other email. 
The above change looks fine as well. This would avoid duplication of
code for group_smt_balance. 

Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
  2023-09-07  8:58                             ` Shrikanth Hegde
@ 2023-09-07 17:42                               ` Tim Chen
  2023-09-12 10:29                                 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
  2023-09-13 13:11                                 ` tip-bot2 for Tim Chen
  0 siblings, 2 replies; 55+ messages in thread
From: Tim Chen @ 2023-09-07 17:42 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: bristot, bsegall, dietmar.eggemann, hdanton, ionela.voinescu,
	juri.lelli, len.brown, linux-kernel, mgorman, naveen.n.rao,
	rafael.j.wysocki, ravi.v.shankar, ricardo.neri, rostedt, srikar,
	srinivas.pandruvada, v-songbaohua, vincent.guittot, vschneid,
	x86, yangyicong, yu.c.chen

On Thu, 2023-09-07 at 14:28 +0530, Shrikanth Hegde wrote:
> > 
> > You mean simplify the patch as below?  I think that should be fine.  Can you
> > make sure it works for SMT4?  And I can update the patch once you confirm it
> > works properly.
> > 
> > Tim
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e7ee2efc1ba..48e9ab7f8a87 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  
> >         case group_smt_balance:
> >                 /* no idle cpus on both groups handled by group_fully_busy below */
> > -               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> > -                       if (sgs->idle_cpus > busiest->idle_cpus)
> > -                               return false;
> > -                       if (sgs->idle_cpus < busiest->idle_cpus)
> > -                               return true;
> > -                       if (sgs->sum_nr_running <= busiest->sum_nr_running)
> > -                               return false;
> > -                       else
> > -                               return true;
> > -               }
> > +               if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
> > +                       goto has_spare;
> > +
> >                 goto fully_busy;
> >  
> >         case group_fully_busy:
> > @@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >                  * as we do not want to pull task off SMT core with one task
> >                  * and make the core idle.
> >                  */
> > +has_spare:
> >                 if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> >                         if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> >                                 return false;
> > 
> > 
> > 
> 
> Hi Tim,
> 
> In case you were waiting for my reply as inferred from other email. 
> The above change looks fine as well. This would avoid duplication of
> code for group_smt_balance. 
> 
> Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

Peter,

Here's the updated patch.  Please consider it for inclusion.

Thanks.

Tim

From 979e261fed6e3765316a4de794f595f93c02cef0 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Subject: [PATCH] sched/fair: Fix SMT4 group_smt_balance handling
To: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Ricardo Neri <ricardo.neri@intel.com>, Ravi V. Shankar <ravi.v.shankar@intel.com>, Ben Segall
<bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>, Rafael J. Wysocki
<rafael.j.wysocki@intel.com>, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>, Steven Rostedt <rostedt@goodmis.org>, Tim Chen <tim.c.chen@linux.intel.com>, Valentin Schneider
<vschneid@redhat.com>, Ionela Voinescu <ionela.voinescu@arm.com>, x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde <sshegde@linux.vnet.ibm.com>, Srikar Dronamraju
<srikar@linux.vnet.ibm.com>, naveen.n.rao@linux.vnet.ibm.com, Yicong Yang <yangyicong@hisilicon.com>, Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>, Hillf Danton <hdanton@sina.com>

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.

Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..fd9e594b5623 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9575,7 +9575,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9763,6 +9763,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/*
+		 * Check if we have spare CPUs on either SMT group to
+		 * choose has spare or fully busy handling.
+		 */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
+			goto has_spare;
+
+		fallthrough;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9802,6 +9811,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			else
 				return true;
 		}
+has_spare:
 
 		/*
 		 * Select not overloaded group with lowest number of idle cpus
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip: sched/urgent] sched/fair: Fix SMT4 group_smt_balance handling
  2023-09-07 17:42                               ` Tim Chen
@ 2023-09-12 10:29                                 ` tip-bot2 for Tim Chen
  2023-09-13 13:11                                 ` tip-bot2 for Tim Chen
  1 sibling, 0 replies; 55+ messages in thread
From: tip-bot2 for Tim Chen @ 2023-09-12 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Shrikanth Hegde, Tim Chen, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     ad468232c3eb1dab163672f98a1ab2363be7981e
Gitweb:        https://git.kernel.org/tip/ad468232c3eb1dab163672f98a1ab2363be7981e
Author:        Tim Chen <tim.c.chen@linux.intel.com>
AuthorDate:    Thu, 07 Sep 2023 10:42:21 -07:00
Committer:     root <root@noisy.programming.kicks-ass.net>
CommitterDate: Sat, 09 Sep 2023 15:10:10 +02:00

sched/fair: Fix SMT4 group_smt_balance handling

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.

Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33a2b6b..cb22592 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9580,7 +9580,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9768,6 +9768,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/*
+		 * Check if we have spare CPUs on either SMT group to
+		 * choose has spare or fully busy handling.
+		 */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
+			goto has_spare;
+
+		fallthrough;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9807,6 +9816,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			else
 				return true;
 		}
+has_spare:
 
 		/*
 		 * Select not overloaded group with lowest number of idle cpus

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip: sched/urgent] sched/fair: Fix SMT4 group_smt_balance handling
  2023-09-07 17:42                               ` Tim Chen
  2023-09-12 10:29                                 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
@ 2023-09-13 13:11                                 ` tip-bot2 for Tim Chen
  1 sibling, 0 replies; 55+ messages in thread
From: tip-bot2 for Tim Chen @ 2023-09-13 13:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Shrikanth Hegde, Tim Chen, Peter Zijlstra (Intel),
	Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     450e749707bc1755f22b505d9cd942d4869dc535
Gitweb:        https://git.kernel.org/tip/450e749707bc1755f22b505d9cd942d4869dc535
Author:        Tim Chen <tim.c.chen@linux.intel.com>
AuthorDate:    Thu, 07 Sep 2023 10:42:21 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Sep 2023 15:03:06 +02:00

sched/fair: Fix SMT4 group_smt_balance handling

For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.

Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33a2b6b..cb22592 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9580,7 +9580,7 @@ static inline long sibling_imbalance(struct lb_env *env,
 	imbalance /= ncores_local + ncores_busiest;
 
 	/* Take advantage of resource in an empty sched group */
-	if (imbalance == 0 && local->sum_nr_running == 0 &&
+	if (imbalance <= 1 && local->sum_nr_running == 0 &&
 	    busiest->sum_nr_running > 1)
 		imbalance = 2;
 
@@ -9768,6 +9768,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 
 	case group_smt_balance:
+		/*
+		 * Check if we have spare CPUs on either SMT group to
+		 * choose has spare or fully busy handling.
+		 */
+		if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
+			goto has_spare;
+
+		fallthrough;
+
 	case group_fully_busy:
 		/*
 		 * Select the fully busy group with highest avg_load. In
@@ -9807,6 +9816,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			else
 				return true;
 		}
+has_spare:
 
 		/*
 		 * Select not overloaded group with lowest number of idle cpus

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-09-13 13:11 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
2023-07-14 13:06   ` Shrikanth Hegde
2023-07-14 23:05     ` Tim Chen
2023-07-15 18:25       ` Tim Chen
2023-07-16 19:36       ` Shrikanth Hegde
2023-07-17 11:10         ` Peter Zijlstra
2023-07-17 12:18           ` Shrikanth Hegde
2023-07-17 13:37             ` Peter Zijlstra
2023-07-17 14:58               ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
2023-07-27  3:11                 ` Tim Chen
2023-07-27 13:32                   ` Tim Chen
2023-08-07  9:36                     ` Shrikanth Hegde
2023-08-21 19:19                       ` Tim Chen
2023-09-05  8:03                         ` Shrikanth Hegde
2023-09-05  9:49                           ` Peter Zijlstra
2023-09-05 18:37                           ` Tim Chen
2023-09-06  9:29                             ` Shrikanth Hegde
2023-09-06 15:42                               ` Tim Chen
2023-09-07  8:58                             ` Shrikanth Hegde
2023-09-07 17:42                               ` Tim Chen
2023-09-12 10:29                                 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
2023-09-13 13:11                                 ` tip-bot2 for Tim Chen
2023-09-05 10:38                         ` [PATCH] sched/fair: Add " Peter Zijlstra
2023-09-05 10:41                     ` Peter Zijlstra
2023-09-05 17:54                       ` Tim Chen
2023-09-06  8:23                         ` Peter Zijlstra
2023-09-06 15:45                           ` Tim Chen
2023-07-18  6:07       ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
2023-07-18 14:52         ` Shrikanth Hegde
2023-07-19  8:14           ` Tobias Huschle
2023-07-14 14:53   ` Tobias Huschle
2023-07-14 23:29     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
2023-07-10 20:33   ` Valentin Schneider
2023-07-10 22:13     ` Tim Chen
2023-07-12  9:27       ` Valentin Schneider
2023-07-10 22:40   ` Tim Chen
2023-07-11 11:31     ` Peter Zijlstra
2023-07-11 16:32       ` Tim Chen
2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
2023-07-14 13:14   ` Shrikanth Hegde
2023-07-14 14:22     ` Tobias Huschle
2023-07-14 23:35       ` Tim Chen
2023-07-14 20:44     ` Tim Chen
2023-07-14 23:23       ` Tim Chen
2023-07-15  0:11     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
2023-07-14 13:02   ` Shrikanth Hegde
2023-07-14 22:16     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
2023-07-08 12:31   ` Peter Zijlstra
2023-07-10 16:13     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
2023-07-10 20:33   ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.