From: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Ricardo Neri <ricardo.neri@intel.com>,
"Ravi V . Shankar" <ravi.v.shankar@intel.com>,
Ben Segall <bsegall@google.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
Steven Rostedt <rostedt@goodmis.org>,
Valentin Schneider <vschneid@redhat.com>,
Ionela Voinescu <ionela.voinescu@arm.com>,
x86@kernel.org, linux-kernel@vger.kernel.org,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
naveen.n.rao@linux.vnet.ibm.com,
Yicong Yang <yangyicong@hisilicon.com>,
Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>,
Hillf Danton <hdanton@sina.com>,
Peter Zijlstra <peterz@infradead.org>,
shrikanth hegde <sshegde@linux.vnet.ibm.com>
Subject: Re: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance
Date: Fri, 14 Jul 2023 18:32:51 +0530 [thread overview]
Message-ID: <29f8bc23-9b17-0a21-baa5-3eee1322f8f0@linux.vnet.ibm.com> (raw)
In-Reply-To: <807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com>
On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
>
> should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
> cpus) starting from lower numbered CPUs looking for the first idle CPU.
>
> In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
> non-SMT cores:
>
> [0, 1] [2, 3] [4, 5] 6 7 8 9
> b i b i b i b i i i
>
> In the figure above, CPUs in brackets are siblings of an SMT core. The
> rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
> idle CPU.
>
> We should let a CPU on a fully idle core get the first chance to idle
> load balance as it has more CPU capacity than a CPU on an idle SMT
> CPU with busy sibling. So for the figure above, if we are running
> should_we_balance() to CPU 1, we should return false to let CPU 7 on
> idle core to have a chance first to idle load balance.
>
> A partially busy (i.e., of type group_has_spare) local group with SMT
> cores will often have only one SMT sibling busy. If the destination CPU
> is a non-SMT core, partially busy, lower-numbered, SMT cores should not
> be considered when finding the first idle CPU.
>
> However, in should_we_balance(), when we encounter idle SMT first in partially
> busy core, we prematurely break the search for the first idle CPU.
>
> Higher-numbered, non-SMT cores is not given the chance to have
> idle balance done on their behalf. Those CPUs will only be considered
> for idle balancing by chance via CPU_NEWLY_IDLE.
>
> Instead, consider the idle state of the whole SMT core.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f491b94908bf..294a662c9410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data);
> static int should_we_balance(struct lb_env *env)
> {
> struct sched_group *sg = env->sd->groups;
> - int cpu;
> + int cpu, idle_smt = -1;
>
> /*
> * Ensure the balancing environment is consistent; can happen
> @@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env)
> if (!idle_cpu(cpu))
> continue;
>
> + /*
> + * Don't balance to idle SMT in busy core right away when
> + * balancing cores, but remember the first idle SMT CPU for
> + * later consideration. Find CPU on an idle core first.
> + */
> + if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> + if (idle_smt == -1)
> + idle_smt = cpu;
> + continue;
> + }
> +
> /* Are we the first idle CPU? */
> return cpu == env->dst_cpu;
> }
>
> + if (idle_smt == env->dst_cpu)
> + return true;
This is nice. It helps in reducing the migrations and improve the performance
of CPU oriented benchmarks slightly. This could be due to less migrations.
Tested a bit on power10 with SMT=4. Offlined a CPU to make a few cores SMT=1. There is
no regression observed. Slight improvement in throughput oriented workload such as stress-ng.
migrations are reduced by quite a bit, likely due to patch. I have attached the test results there.
[4/6] sched/fair: Consider the idle state of the whole core for load balance
Test results:
# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-17,24,25,32-49,56-89
Off-line CPU(s) list: 18-23,26-31,50-55,90-95
Model name: POWER10 (architected), altivec supported
--------------------------------------------------------------------------------------------------
baseline:
Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):
260,813.13 msec task-clock # 33.390 CPUs utilized ( +- 0.10% )
42,535 context-switches # 163.543 /sec ( +- 0.13% )
9,060 cpu-migrations # 34.835 /sec ( +- 1.07% )
12,947 page-faults # 49.780 /sec ( +- 1.76% )
948,061,954,432 cycles # 3.645 GHz ( +- 0.09% )
926,045,701,578 instructions # 0.98 insn per cycle ( +- 0.00% )
146,418,075,496 branches # 562.964 M/sec ( +- 0.00% )
1,197,661,965 branch-misses # 0.82% of all branches ( +- 0.17% )
7.8111 +- 0.0162 seconds time elapsed ( +- 0.21% )
Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):
253,351.70 msec task-clock # 28.207 CPUs utilized ( +- 0.21% )
41,046 context-switches # 162.828 /sec ( +- 0.16% )
6,674 cpu-migrations # 26.475 /sec ( +- 3.42% )
10,879 page-faults # 43.157 /sec ( +- 1.68% )
931,014,218,983 cycles # 3.693 GHz ( +- 0.22% )
919,717,564,454 instructions # 0.99 insn per cycle ( +- 0.00% )
145,480,596,331 branches # 577.116 M/sec ( +- 0.00% )
1,175,362,979 branch-misses # 0.81% of all branches ( +- 0.12% )
8.9818 +- 0.0288 seconds time elapsed ( +- 0.32% )
---------------------------------------------------------------------------------------------------
with patch:
Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):
254,652.01 msec task-clock # 33.449 CPUs utilized ( +- 0.11% )
40,970 context-switches # 160.974 /sec ( +- 0.10% )
5,397 cpu-migrations # 21.205 /sec ( +- 2.01% )
11,705 page-faults # 45.990 /sec ( +- 1.21% )
911,115,537,080 cycles # 3.580 GHz ( +- 0.11% )
925,635,958,489 instructions # 1.02 insn per cycle ( +- 0.00% )
146,450,995,164 branches # 575.416 M/sec ( +- 0.00% )
1,188,906,011 branch-misses # 0.81% of all branches ( +- 0.28% )
7.6132 +- 0.0381 seconds time elapsed ( +- 0.50% )
Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):
236,962.38 msec task-clock # 27.948 CPUs utilized ( +- 0.05% )
40,030 context-switches # 168.869 /sec ( +- 0.04% )
3,156 cpu-migrations # 13.314 /sec ( +- 1.37% )
9,448 page-faults # 39.857 /sec ( +- 1.72% )
856,444,937,794 cycles # 3.613 GHz ( +- 0.06% )
919,459,795,805 instructions # 1.07 insn per cycle ( +- 0.00% )
145,654,799,033 branches # 614.452 M/sec ( +- 0.00% )
1,177,464,719 branch-misses # 0.81% of all branches ( +- 0.23% )
8.4788 +- 0.0323 seconds time elapsed ( +- 0.38% )
--------------------------------------------------------------------------------------------------------
Tried on a symmetric system with all cores having SMT=4 as well. There was reduction in migrations here as well.
Didnt observe any major regressions when microbenchmarks run alone. Such as hackbench, stress-ng.
So. Here is tested-by.
Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> +
> /* Are we the first CPU of this group ? */
> return group_balance_cpu(sg) == env->dst_cpu;
> }
One doubt though, Here a fully idle core would be chosen instead of first idle cpu in the
group (if there is one). Since coming out of idle of SMT is faster compared to a fully idle core,
would latency increase? Or that concerns mainly wakeup path?
next prev parent reply other threads:[~2023-07-14 13:04 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
2023-07-14 13:06 ` Shrikanth Hegde
2023-07-14 23:05 ` Tim Chen
2023-07-15 18:25 ` Tim Chen
2023-07-16 19:36 ` Shrikanth Hegde
2023-07-17 11:10 ` Peter Zijlstra
2023-07-17 12:18 ` Shrikanth Hegde
2023-07-17 13:37 ` Peter Zijlstra
2023-07-17 14:58 ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
2023-07-27 3:11 ` Tim Chen
2023-07-27 13:32 ` Tim Chen
2023-08-07 9:36 ` Shrikanth Hegde
2023-08-21 19:19 ` Tim Chen
2023-09-05 8:03 ` Shrikanth Hegde
2023-09-05 9:49 ` Peter Zijlstra
2023-09-05 18:37 ` Tim Chen
2023-09-06 9:29 ` Shrikanth Hegde
2023-09-06 15:42 ` Tim Chen
2023-09-07 8:58 ` Shrikanth Hegde
2023-09-07 17:42 ` Tim Chen
2023-09-12 10:29 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
2023-09-13 13:11 ` tip-bot2 for Tim Chen
2023-09-05 10:38 ` [PATCH] sched/fair: Add " Peter Zijlstra
2023-09-05 10:41 ` Peter Zijlstra
2023-09-05 17:54 ` Tim Chen
2023-09-06 8:23 ` Peter Zijlstra
2023-09-06 15:45 ` Tim Chen
2023-07-18 6:07 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
2023-07-18 14:52 ` Shrikanth Hegde
2023-07-19 8:14 ` Tobias Huschle
2023-07-14 14:53 ` Tobias Huschle
2023-07-14 23:29 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
2023-07-10 20:33 ` Valentin Schneider
2023-07-10 22:13 ` Tim Chen
2023-07-12 9:27 ` Valentin Schneider
2023-07-10 22:40 ` Tim Chen
2023-07-11 11:31 ` Peter Zijlstra
2023-07-11 16:32 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
2023-07-14 13:14 ` Shrikanth Hegde
2023-07-14 14:22 ` Tobias Huschle
2023-07-14 23:35 ` Tim Chen
2023-07-14 20:44 ` Tim Chen
2023-07-14 23:23 ` Tim Chen
2023-07-15 0:11 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
2023-07-14 13:02 ` Shrikanth Hegde [this message]
2023-07-14 22:16 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
2023-07-08 12:31 ` Peter Zijlstra
2023-07-10 16:13 ` Tim Chen
2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
2023-07-10 20:33 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=29f8bc23-9b17-0a21-baa5-3eee1322f8f0@linux.vnet.ibm.com \
--to=sshegde@linux.vnet.ibm.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=hdanton@sina.com \
--cc=ionela.voinescu@arm.com \
--cc=juri.lelli@redhat.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=naveen.n.rao@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=ravi.v.shankar@intel.com \
--cc=ricardo.neri-calderon@linux.intel.com \
--cc=ricardo.neri@intel.com \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=srinivas.pandruvada@linux.intel.com \
--cc=tim.c.chen@linux.intel.com \
--cc=v-songbaohua@oppo.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
--cc=yangyicong@hisilicon.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).