From: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
To: Tobias Huschle <huschle@linux.ibm.com>,
Tim Chen <tim.c.chen@linux.intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Ricardo Neri <ricardo.neri@intel.com>,
"Ravi V . Shankar" <ravi.v.shankar@intel.com>,
Ben Segall <bsegall@google.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
Steven Rostedt <rostedt@goodmis.org>,
Valentin Schneider <vschneid@redhat.com>,
Ionela Voinescu <ionela.voinescu@arm.com>,
x86@kernel.org, linux-kernel@vger.kernel.org,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
naveen.n.rao@linux.vnet.ibm.com,
Yicong Yang <yangyicong@hisilicon.com>,
Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>,
Hillf Danton <hdanton@sina.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups
Date: Tue, 18 Jul 2023 20:22:24 +0530 [thread overview]
Message-ID: <c15b1322-074d-b400-bc7b-2a0031862fdd@linux.vnet.ibm.com> (raw)
In-Reply-To: <1109f49642faf60438717eb7716b324f@linux.ibm.com>
On 7/18/23 11:37 AM, Tobias Huschle wrote:
> On 2023-07-15 01:05, Tim Chen wrote:
>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>
>>>
>>>
>>> If we consider symmetric platforms which have SMT4 such as power10.
>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>
>>>
>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>> [--------------MC------------------------]
>>>
>>> In case of SMT4, if there is any group which has 2 or more tasks, that
>>> group will be marked as group_smt_balance. previously, if that group
>>> had 2
>>> or 3 tasks, it would have been marked as group_has_spare. Since all
>>> the groups have
>>> SMT that means behavior would be same fully busy right? That can
>>> cause some
>>> corner cases. No?
>>
>> You raised a good point. I was looking from SMT2
>> perspective so group_smt_balance implies group_fully_busy.
>> That is no longer true for SMT4.
>>
>> I am thinking of the following fix on the current patch
>> to take care of SMT4. Do you think this addresses
>> concerns from you and Tobias?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 294a662c9410..3fc8d3a3bd22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct
>> lb_env *env,
>> break;
>>
>> case group_smt_balance:
>> + /* no idle cpus on both groups handled by
>> group_fully_busy below */
>> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>> + if (sgs->idle_cpus > busiest->idle_cpus)
>> + return false;
>> + if (sgs->idle_cpus < busiest->idle_cpus)
>> + return true;
>> + if (sgs->sum_nr_running <=
>> busiest_sum_nr_running)
>> + return false;
>> + else
>> + return true;
>> + }
>>
>>
>> I will be on vacation next three weeks so my response will be slow.
>>
>> Tim
>>
>
> What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.
>
> [0 1][2 3 4 5]
> [SMT][--SMT--]
>
> If now CPUs 0,2,3 have a running task, both groups would be classified as
> smt_balance. But if it comes to the selection of the busiest group, the
> smaller
> group would be selected, as it has less idle CPUs, right? Which could lead
> to the smaller group being left with no tasks.
> Using the absolute numbers of task is what made the prefer_sibling path
> problematic,
Yes. But Not sure how realistic is that configuration. on power10, we typically
have all cores in either SMT1, SMT2 or SMT4. But not mixed configs.
One can offline a CPUs to get into that cases in SMT4.
> I would assume that the same holds true here. Therefore, I would prefer
> avg_load,
> or, similar to prefer_siblings, a ratio over the number of cores.
>
> I can't really test that on s390 as we always have SMT2. But, we can
> have these
> asymmetries on higher levels, e.g
IIUC, on higher levels, group will not have SD_SHARE_CPUCAPACITY, so it shouldn't
run into group_smt_balance.
>
> [0 1][2 3][4 5][6 7][8 9]
> [SMT][SMT][SMT][SMT][SMT]
> [-----core----][--core--]
>
> For large configurations this can be true for even higher levels.
> Therefore, the idea was to move the smt_balance state around and adapt its
> conditions to something like this (which would require to reorder the
> commits):
>
> @@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
> enum group_type {
> /* The group has spare capacity that can be used to run more
> tasks. */
> group_has_spare = 0,
> + /*
> + * Balance SMT group that's fully busy. Can benefit from migration
> + * a task on SMT with busy sibling to another CPU on idle core.
> + */
> + group_smt_balance,
> /*
> * The group is fully used and the tasks don't compete for more CPU
> * cycles. Nevertheless, some tasks might wait before running.
> @@ -8340,11 +8345,6 @@ enum group_type {
> * more powerful CPU.
> */
> group_misfit_task,
> - /*
> - * Balance SMT group that's fully busy. Can benefit from migration
> - * a task on SMT with busy sibling to another CPU on idle core.
> - */
> - group_smt_balance,
> /*
> * SD_ASYM_PACKING only: One local CPU with higher capacity is
> available,
IIUC, for cluster topology of this patch, busiest group should be a SMT if it has 2
threads compared to an Atom cluster having 4 threads. Atom cluster will be group_fully_busy,
whereas SMT group will be group_smt_balance. For that to happen group_smt_balance should have
higher group_type.
> * and the task should be migrated to it instead of running on the
> @@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int
> imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> - if (sgs->group_smt_balance)
> - return group_smt_balance;
> -
> if (sgs->group_misfit_task_load)
> return group_misfit_task;
>
> if (!group_has_capacity(imbalance_pct, sgs))
> return group_fully_busy;
>
> + if (sgs->group_smt_balance)
> + return group_smt_balance;
> +
> return group_has_spare;
> }
>
> @@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env *env,
> struct sg_lb_stats *sgs,
> * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> * will not be on.
> */
> - if (group->flags & SD_SHARE_CPUCAPACITY &&
> - sgs->sum_h_nr_running > 1)
> + if (sgs->sum_h_nr_running > group->cores)
In case of Power10, where we have SMT4, group->cores will be 1. I dont see
a difference here.
> return true;
>
> return false;
>
> The s390 problem is currently solved by changing the prefer_sibling
> path. When
> disabling that flag, we might have an issue, will have to verify that
> though.
>
>>>
>>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is
>>> trying to do
>>> load balance. Previously imbalance would have been 2, instead now
>>> imbalance would be 1.
>>> But in subsequent lb it would be balanced.
>>>
>>>
>>>
>>> > + return false;
>>> > +}
>>> > +
>>> > static inline bool
>>> > sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>> > {
>>> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct
>>> lb_env *env,
>>> > sgs->group_asym_packing = 1;
>>> > }
>>> >
>>> > + /* Check for loaded SMT group to be balanced to dst CPU */
>>> > + if (!local_group && smt_balance(env, sgs, group))
>>> > + sgs->group_smt_balance = 1;
>>> > +
>>> > sgs->group_type = group_classify(env->sd->imbalance_pct,
>>> group, sgs);
>>> >
>>> > /* Computing avg_load makes sense only when group is
>>> overloaded */
>>> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> > return false;
>>> > break;
>>> >
>>> > + case group_smt_balance:
>>> > case group_fully_busy:
>>> > /*
>>> > * Select the fully busy group with highest avg_load. In
>>> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> > break;
>>> >
>>> > case group_has_spare:
>>> > + /*
>>> > + * Do not pick sg with SMT CPUs over sg with pure CPUs,
>>> > + * as we do not want to pull task off SMT core with one task
>>> > + * and make the core idle.
>>> > + */
>>> > + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>>> > + if (sg->flags & SD_SHARE_CPUCAPACITY &&
>>> sgs->sum_h_nr_running <= 1)
>>> > + return false;
>>> > + else
>>> > + return true;> + }
>>> > +
>>> > /*
>>> > * Select not overloaded group with lowest number of idle
>>> cpus
>>> > * and highest number of running tasks. We could also compare
>>> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct
>>> sched_group *idlest,
>>> >
>>> > case group_imbalanced:
>>> > case group_asym_packing:
>>> > + case group_smt_balance:
>>> > /* Those types are not used in the slow wakeup path */
>>> > return false;
>>> >
>>> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
>>> struct task_struct *p, int this_cpu)
>>> >
>>> > case group_imbalanced:
>>> > case group_asym_packing:
>>> > + case group_smt_balance:
>>> > /* Those type are not used in the slow wakeup path */
>>> > return NULL;
>>> >
>>> > @@ -10118,6 +10176,13 @@ static inline void
>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>> > return;
>>> > }
>>> >
>>> > + if (busiest->group_type == group_smt_balance) {
>>> > + /* Reduce number of tasks sharing CPU capacity */
>>> > + env->migration_type = migrate_task;
>>> > + env->imbalance = 1;
>>> > + return;
>>> > + }
>>> > +
>>> > if (busiest->group_type == group_imbalanced) {
>>> > /*
>>> > * In the group_imb case we cannot rely on group-wide
>>> averages
>>> > @@ -10363,16 +10428,23 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> > goto force_balance;
>>> >
>>> > if (busiest->group_type != group_overloaded) {
>>> > - if (env->idle == CPU_NOT_IDLE)
>>> > + if (env->idle == CPU_NOT_IDLE) {
>>> > /*
>>> > * If the busiest group is not overloaded (and as a
>>> > * result the local one too) but this CPU is already
>>> > * busy, let another idle CPU try to pull task.
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> > +
>>> > + if (busiest->group_type == group_smt_balance &&
>>> > + smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>>> > + /* Let non SMT CPU pull from SMT CPU sharing with
>>> sibling */
>>> > + goto force_balance;
>>> > + }
>>> >
>>> > if (busiest->group_weight > 1 &&
>>> > - local->idle_cpus <= (busiest->idle_cpus + 1))
>>> > + local->idle_cpus <= (busiest->idle_cpus + 1)) {
>>> > /*
>>> > * If the busiest group is not overloaded
>>> > * and there is no imbalance between this and busiest
>>> > @@ -10383,12 +10455,14 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> > * there is more than 1 CPU per group.
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> >
>>> > - if (busiest->sum_h_nr_running == 1)
>>> > + if (busiest->sum_h_nr_running == 1) {
>>> > /*
>>> > * busiest doesn't have any tasks waiting to run
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> > }
>>> >
>>> > force_balance:
next prev parent reply other threads:[~2023-07-18 14:54 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
2023-07-14 13:06 ` Shrikanth Hegde
2023-07-14 23:05 ` Tim Chen
2023-07-15 18:25 ` Tim Chen
2023-07-16 19:36 ` Shrikanth Hegde
2023-07-17 11:10 ` Peter Zijlstra
2023-07-17 12:18 ` Shrikanth Hegde
2023-07-17 13:37 ` Peter Zijlstra
2023-07-17 14:58 ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
2023-07-27 3:11 ` Tim Chen
2023-07-27 13:32 ` Tim Chen
2023-08-07 9:36 ` Shrikanth Hegde
2023-08-21 19:19 ` Tim Chen
2023-09-05 8:03 ` Shrikanth Hegde
2023-09-05 9:49 ` Peter Zijlstra
2023-09-05 18:37 ` Tim Chen
2023-09-06 9:29 ` Shrikanth Hegde
2023-09-06 15:42 ` Tim Chen
2023-09-07 8:58 ` Shrikanth Hegde
2023-09-07 17:42 ` Tim Chen
2023-09-12 10:29 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
2023-09-13 13:11 ` tip-bot2 for Tim Chen
2023-09-05 10:38 ` [PATCH] sched/fair: Add " Peter Zijlstra
2023-09-05 10:41 ` Peter Zijlstra
2023-09-05 17:54 ` Tim Chen
2023-09-06 8:23 ` Peter Zijlstra
2023-09-06 15:45 ` Tim Chen
2023-07-18 6:07 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
2023-07-18 14:52 ` Shrikanth Hegde [this message]
2023-07-19 8:14 ` Tobias Huschle
2023-07-14 14:53 ` Tobias Huschle
2023-07-14 23:29 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
2023-07-10 20:33 ` Valentin Schneider
2023-07-10 22:13 ` Tim Chen
2023-07-12 9:27 ` Valentin Schneider
2023-07-10 22:40 ` Tim Chen
2023-07-11 11:31 ` Peter Zijlstra
2023-07-11 16:32 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
2023-07-14 13:14 ` Shrikanth Hegde
2023-07-14 14:22 ` Tobias Huschle
2023-07-14 23:35 ` Tim Chen
2023-07-14 20:44 ` Tim Chen
2023-07-14 23:23 ` Tim Chen
2023-07-15 0:11 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
2023-07-14 13:02 ` Shrikanth Hegde
2023-07-14 22:16 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
2023-07-08 12:31 ` Peter Zijlstra
2023-07-10 16:13 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
2023-07-10 20:33 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c15b1322-074d-b400-bc7b-2a0031862fdd@linux.vnet.ibm.com \
--to=sshegde@linux.vnet.ibm.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=ionela.voinescu@arm.com \
--cc=juri.lelli@redhat.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=naveen.n.rao@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=ravi.v.shankar@intel.com \
--cc=ricardo.neri@intel.com \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=srinivas.pandruvada@linux.intel.com \
--cc=tim.c.chen@linux.intel.com \
--cc=v-songbaohua@oppo.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
--cc=yangyicong@hisilicon.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).