From: Tobias Huschle <huschle@linux.ibm.com>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Ricardo Neri <ricardo.neri@intel.com>,
"Ravi V . Shankar" <ravi.v.shankar@intel.com>,
Ben Segall <bsegall@google.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
Steven Rostedt <rostedt@goodmis.org>,
Valentin Schneider <vschneid@redhat.com>,
Ionela Voinescu <ionela.voinescu@arm.com>,
x86@kernel.org, linux-kernel@vger.kernel.org,
Shrikanth Hegde <sshegde@linux.vnet.ibm.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
naveen.n.rao@linux.vnet.ibm.com,
Yicong Yang <yangyicong@hisilicon.com>,
Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>,
Hillf Danton <hdanton@sina.com>
Subject: Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
Date: Fri, 14 Jul 2023 16:53:49 +0200 [thread overview]
Message-ID: <d3a1621eae04f7671b2b430352b42a27@linux.ibm.com> (raw)
In-Reply-To: <e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com>
On 2023-07-08 00:57, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
>
> On hybrid CPUs with scheduling cluster enabled, we will need to
> consider balancing between SMT CPU cluster, and Atom core cluster.
>
> Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
> Each scheduling cluster span a L2 cache.
>
> --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
> [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
> Big Big Big Big Atom Atom
> core core core core Module Module
>
> If the busiest group is a big core with both SMT CPUs busy, we should
> active load balance if destination group has idle CPU cores. Such
> condition is considered by asym_active_balance() in load balancing but
> not
> considered when looking for busiest group and computing load imbalance.
> Add this consideration in find_busiest_group() and
> calculate_imbalance().
>
> In addition, update the logic determining the busier group when one
> group
> is SMT and the other group is non SMT but both groups are partially
> busy
> with idle CPU. The busier group should be the group with idle cores
> rather
> than the group with one busy SMT CPU. We do not want to make the SMT
> group
> the busiest one to pull the only task off SMT CPU and causing the whole
> core to
> go empty.
>
> Otherwise suppose in the search for the busiest group, we first
> encounter
> an SMT group with 1 task and set it as the busiest. The destination
> group is an atom cluster with 1 task and we next encounter an atom
> cluster group with 3 tasks, we will not pick this atom cluster over the
> SMT group, even though we should. As a result, we do not load balance
> the busier Atom cluster (with 3 tasks) towards the local atom cluster
> (with 1 task). And it doesn't make sense to pick the 1 task SMT group
> as the busier group as we also should not pull task off the SMT towards
> the 1 task atom cluster and make the SMT core completely empty.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 77 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87317634fab2..f636d6c09dc6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8279,6 +8279,11 @@ enum group_type {
> * more powerful CPU.
> */
> group_misfit_task,
> + /*
> + * Balance SMT group that's fully busy. Can benefit from migration
> + * a task on SMT with busy sibling to another CPU on idle core.
> + */
> + group_smt_balance,
Would it make sense to move smt_balance?, s.t. we get:
group_has_spare < group_smt_balance < group_fully_busy
Conceptually I would be more intuitive to me like this as the
smt_balance groups are more busy than has_spare ones, but less busy
then fully_busy ones.
From a functional perspective I could also see some impact when
update_sd_pick_busiest compares the group types. In that case we
would remove tasks from fully busy groups before moving them
from smt_balance groups. Not sure which way would be to prefer
to increase overall throughput.
Since smt_balance is only selected if the group has SMT, this
should still not pull the last task off of a non-SMT CPU.
> /*
> * SD_ASYM_PACKING only: One local CPU with higher capacity is
> available,
> * and the task should be migrated to it instead of running on the
> @@ -8987,6 +8992,7 @@ struct sg_lb_stats {
> unsigned int group_weight;
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to
> preferred CPU */
> + unsigned int group_smt_balance; /* Task on busy SMT be moved */
> unsigned long group_misfit_task_load; /* A CPU has a task too big
> for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int
> imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> + if (sgs->group_smt_balance)
> + return group_smt_balance;
> +
> if (sgs->group_misfit_task_load)
> return group_misfit_task;
>
> @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct
> sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> +/* One group has more than one SMT CPU while the other group does not
> */
> +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> + struct sched_group *sg2)
> +{
> + if (!sg1 || !sg2)
> + return false;
> +
> + return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
> + (sg2->flags & SD_SHARE_CPUCAPACITY);
> +}
> +
> +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats
> *sgs,
> + struct sched_group *group)
> +{
> + if (env->idle == CPU_NOT_IDLE)
> + return false;
> +
> + /*
> + * For SMT source group, it is better to move a task
> + * to a CPU that doesn't have multiple tasks sharing its CPU
> capacity.
> + * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> + * will not be on.
> + */
> + if (group->flags & SD_SHARE_CPUCAPACITY &&
> + sgs->sum_h_nr_running > 1)
> + return true;
> +
> + return false;
> +}
> +
> static inline bool
> sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> {
> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct
> lb_env *env,
> sgs->group_asym_packing = 1;
> }
>
> + /* Check for loaded SMT group to be balanced to dst CPU */
> + if (!local_group && smt_balance(env, sgs, group))
> + sgs->group_smt_balance = 1;
> +
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env
> *env,
> return false;
> break;
>
> + case group_smt_balance:
> case group_fully_busy:
> /*
> * Select the fully busy group with highest avg_load. In
> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env
> *env,
> break;
>
> case group_has_spare:
> + /*
> + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> + * as we do not want to pull task off SMT core with one task
> + * and make the core idle.
> + */
> + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> + return false;
> + else
> + return true;
> + }
> +
> /*
> * Select not overloaded group with lowest number of idle cpus
> * and highest number of running tasks. We could also compare
> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group
> *idlest,
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those types are not used in the slow wakeup path */
> return false;
>
> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
> struct task_struct *p, int this_cpu)
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those type are not used in the slow wakeup path */
> return NULL;
>
> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> + if (busiest->group_type == group_smt_balance) {
> + /* Reduce number of tasks sharing CPU capacity */
> + env->migration_type = migrate_task;
> + env->imbalance = 1;
> + return;
> + }
> +
> if (busiest->group_type == group_imbalanced) {
> /*
> * In the group_imb case we cannot rely on group-wide averages
> @@ -10363,16 +10428,23 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> - if (env->idle == CPU_NOT_IDLE)
> + if (env->idle == CPU_NOT_IDLE) {
> /*
> * If the busiest group is not overloaded (and as a
> * result the local one too) but this CPU is already
> * busy, let another idle CPU try to pull task.
> */
> goto out_balanced;
> + }
> +
> + if (busiest->group_type == group_smt_balance &&
> + smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> + /* Let non SMT CPU pull from SMT CPU sharing with sibling */
> + goto force_balance;
> + }
>
> if (busiest->group_weight > 1 &&
> - local->idle_cpus <= (busiest->idle_cpus + 1))
> + local->idle_cpus <= (busiest->idle_cpus + 1)) {
> /*
> * If the busiest group is not overloaded
> * and there is no imbalance between this and busiest
> @@ -10383,12 +10455,14 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> * there is more than 1 CPU per group.
> */
> goto out_balanced;
> + }
>
> - if (busiest->sum_h_nr_running == 1)
> + if (busiest->sum_h_nr_running == 1) {
> /*
> * busiest doesn't have any tasks waiting to run
> */
> goto out_balanced;
> + }
> }
>
> force_balance:
next prev parent reply other threads:[~2023-07-14 14:55 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
2023-07-14 13:06 ` Shrikanth Hegde
2023-07-14 23:05 ` Tim Chen
2023-07-15 18:25 ` Tim Chen
2023-07-16 19:36 ` Shrikanth Hegde
2023-07-17 11:10 ` Peter Zijlstra
2023-07-17 12:18 ` Shrikanth Hegde
2023-07-17 13:37 ` Peter Zijlstra
2023-07-17 14:58 ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
2023-07-27 3:11 ` Tim Chen
2023-07-27 13:32 ` Tim Chen
2023-08-07 9:36 ` Shrikanth Hegde
2023-08-21 19:19 ` Tim Chen
2023-09-05 8:03 ` Shrikanth Hegde
2023-09-05 9:49 ` Peter Zijlstra
2023-09-05 18:37 ` Tim Chen
2023-09-06 9:29 ` Shrikanth Hegde
2023-09-06 15:42 ` Tim Chen
2023-09-07 8:58 ` Shrikanth Hegde
2023-09-07 17:42 ` Tim Chen
2023-09-12 10:29 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
2023-09-13 13:11 ` tip-bot2 for Tim Chen
2023-09-05 10:38 ` [PATCH] sched/fair: Add " Peter Zijlstra
2023-09-05 10:41 ` Peter Zijlstra
2023-09-05 17:54 ` Tim Chen
2023-09-06 8:23 ` Peter Zijlstra
2023-09-06 15:45 ` Tim Chen
2023-07-18 6:07 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
2023-07-18 14:52 ` Shrikanth Hegde
2023-07-19 8:14 ` Tobias Huschle
2023-07-14 14:53 ` Tobias Huschle [this message]
2023-07-14 23:29 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
2023-07-10 20:33 ` Valentin Schneider
2023-07-10 22:13 ` Tim Chen
2023-07-12 9:27 ` Valentin Schneider
2023-07-10 22:40 ` Tim Chen
2023-07-11 11:31 ` Peter Zijlstra
2023-07-11 16:32 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
2023-07-14 13:14 ` Shrikanth Hegde
2023-07-14 14:22 ` Tobias Huschle
2023-07-14 23:35 ` Tim Chen
2023-07-14 20:44 ` Tim Chen
2023-07-14 23:23 ` Tim Chen
2023-07-15 0:11 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
2023-07-14 13:02 ` Shrikanth Hegde
2023-07-14 22:16 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
2023-07-08 12:31 ` Peter Zijlstra
2023-07-10 16:13 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
2023-07-10 20:33 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d3a1621eae04f7671b2b430352b42a27@linux.ibm.com \
--to=huschle@linux.ibm.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=ionela.voinescu@arm.com \
--cc=juri.lelli@redhat.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=naveen.n.rao@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=ravi.v.shankar@intel.com \
--cc=ricardo.neri@intel.com \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=srinivas.pandruvada@linux.intel.com \
--cc=sshegde@linux.vnet.ibm.com \
--cc=tim.c.chen@linux.intel.com \
--cc=v-songbaohua@oppo.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
--cc=yangyicong@hisilicon.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).