linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ricardo Neri <ricardo.neri@intel.com>,
	"Ravi V . Shankar" <ravi.v.shankar@intel.com>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Ionela Voinescu <ionela.voinescu@arm.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	naveen.n.rao@linux.vnet.ibm.com,
	Yicong Yang <yangyicong@hisilicon.com>,
	Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>,
	Hillf Danton <hdanton@sina.com>,
	Peter Zijlstra <peterz@infradead.org>,
	shrikanth hegde <sshegde@linux.vnet.ibm.com>
Subject: Re: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance
Date: Fri, 14 Jul 2023 18:32:51 +0530	[thread overview]
Message-ID: <29f8bc23-9b17-0a21-baa5-3eee1322f8f0@linux.vnet.ibm.com> (raw)
In-Reply-To: <807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com>



On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> 
> should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
> cpus) starting from lower numbered CPUs looking for the first idle CPU.
> 
> In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
> non-SMT cores:
> 
> 	[0, 1] [2, 3] [4, 5] 6 7 8 9
>          b  i   b  i   b  i  b i i i
> 
> In the figure above, CPUs in brackets are siblings of an SMT core. The
> rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
> idle CPU.
> 
> We should let a CPU on a fully idle core get the first chance to idle
> load balance as it has more CPU capacity than a CPU on an idle SMT
> CPU with busy sibling.  So for the figure above, if we are running
> should_we_balance() to CPU 1, we should return false to let CPU 7 on
> idle core to have a chance first to idle load balance.
> 
> A partially busy (i.e., of type group_has_spare) local group with SMT 
> cores will often have only one SMT sibling busy. If the destination CPU
> is a non-SMT core, partially busy, lower-numbered, SMT cores should not
> be considered when finding the first idle CPU. 
> 
> However, in should_we_balance(), when we encounter idle SMT first in partially
> busy core, we prematurely break the search for the first idle CPU.
> 
> Higher-numbered, non-SMT cores is not given the chance to have
> idle balance done on their behalf. Those CPUs will only be considered
> for idle balancing by chance via CPU_NEWLY_IDLE.
> 
> Instead, consider the idle state of the whole SMT core.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f491b94908bf..294a662c9410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data);
>  static int should_we_balance(struct lb_env *env)
>  {
>  	struct sched_group *sg = env->sd->groups;
> -	int cpu;
> +	int cpu, idle_smt = -1;
> 
>  	/*
>  	 * Ensure the balancing environment is consistent; can happen
> @@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env)
>  		if (!idle_cpu(cpu))
>  			continue;
> 
> +		/*
> +		 * Don't balance to idle SMT in busy core right away when
> +		 * balancing cores, but remember the first idle SMT CPU for
> +		 * later consideration.  Find CPU on an idle core first.
> +		 */
> +		if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> +			if (idle_smt == -1)
> +				idle_smt = cpu;
> +			continue;
> +		}
> +
>  		/* Are we the first idle CPU? */
>  		return cpu == env->dst_cpu;
>  	}
> 
> +	if (idle_smt == env->dst_cpu)
> +		return true;


This is nice. It helps in reducing the migrations and improve the performance 
of CPU oriented benchmarks slightly. This could be due to less migrations. 

Tested a bit on power10 with SMT=4. Offlined a CPU to make a few cores SMT=1. There is 
no regression observed. Slight improvement in throughput oriented workload such as stress-ng. 
migrations are reduced by quite a bit, likely due to patch. I have attached the test results there.
[4/6] sched/fair: Consider the idle state of the whole core for load balance

Test results: 

# lscpu
Architecture:            ppc64le
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-17,24,25,32-49,56-89
  Off-line CPU(s) list:  18-23,26-31,50-55,90-95
Model name:              POWER10 (architected), altivec supported

--------------------------------------------------------------------------------------------------
baseline:

 Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        260,813.13 msec task-clock                #   33.390 CPUs utilized            ( +-  0.10% )
            42,535      context-switches          #  163.543 /sec                     ( +-  0.13% )
             9,060      cpu-migrations            #   34.835 /sec                     ( +-  1.07% )
            12,947      page-faults               #   49.780 /sec                     ( +-  1.76% )
   948,061,954,432      cycles                    #    3.645 GHz                      ( +-  0.09% )
   926,045,701,578      instructions              #    0.98  insn per cycle           ( +-  0.00% )
   146,418,075,496      branches                  #  562.964 M/sec                    ( +-  0.00% )
     1,197,661,965      branch-misses             #    0.82% of all branches          ( +-  0.17% )

            7.8111 +- 0.0162 seconds time elapsed  ( +-  0.21% )

 Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        253,351.70 msec task-clock                #   28.207 CPUs utilized            ( +-  0.21% )
            41,046      context-switches          #  162.828 /sec                     ( +-  0.16% )
             6,674      cpu-migrations            #   26.475 /sec                     ( +-  3.42% )
            10,879      page-faults               #   43.157 /sec                     ( +-  1.68% )
   931,014,218,983      cycles                    #    3.693 GHz                      ( +-  0.22% )
   919,717,564,454      instructions              #    0.99  insn per cycle           ( +-  0.00% )
   145,480,596,331      branches                  #  577.116 M/sec                    ( +-  0.00% )
     1,175,362,979      branch-misses             #    0.81% of all branches          ( +-  0.12% )

            8.9818 +- 0.0288 seconds time elapsed  ( +-  0.32% )


---------------------------------------------------------------------------------------------------
with patch:

 Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        254,652.01 msec task-clock                #   33.449 CPUs utilized            ( +-  0.11% )
            40,970      context-switches          #  160.974 /sec                     ( +-  0.10% )
             5,397      cpu-migrations            #   21.205 /sec                     ( +-  2.01% )
            11,705      page-faults               #   45.990 /sec                     ( +-  1.21% )
   911,115,537,080      cycles                    #    3.580 GHz                      ( +-  0.11% )
   925,635,958,489      instructions              #    1.02  insn per cycle           ( +-  0.00% )
   146,450,995,164      branches                  #  575.416 M/sec                    ( +-  0.00% )
     1,188,906,011      branch-misses             #    0.81% of all branches          ( +-  0.28% )

            7.6132 +- 0.0381 seconds time elapsed  ( +-  0.50% )

 Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

        236,962.38 msec task-clock                #   27.948 CPUs utilized            ( +-  0.05% )
            40,030      context-switches          #  168.869 /sec                     ( +-  0.04% )
             3,156      cpu-migrations            #   13.314 /sec                     ( +-  1.37% )
             9,448      page-faults               #   39.857 /sec                     ( +-  1.72% )
   856,444,937,794      cycles                    #    3.613 GHz                      ( +-  0.06% )
   919,459,795,805      instructions              #    1.07  insn per cycle           ( +-  0.00% )
   145,654,799,033      branches                  #  614.452 M/sec                    ( +-  0.00% )
     1,177,464,719      branch-misses             #    0.81% of all branches          ( +-  0.23% )

            8.4788 +- 0.0323 seconds time elapsed  ( +-  0.38% )
--------------------------------------------------------------------------------------------------------

Tried on a symmetric system with all cores having SMT=4 as well. There was reduction in migrations here as well.
Didnt observe any major regressions when microbenchmarks run alone. Such as hackbench, stress-ng. 

So. Here is tested-by. 
Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>



> +
>  	/* Are we the first CPU of this group ? */
>  	return group_balance_cpu(sg) == env->dst_cpu;
>  }

One doubt though, Here a fully idle core would be chosen instead of first idle cpu in the 
group (if there is one). Since coming out of idle of SMT is faster compared to a fully idle core,
would latency increase? Or that concerns mainly wakeup path?

  reply	other threads:[~2023-07-14 13:04 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 22:56 [Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs Tim Chen
2023-07-07 22:57 ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tim Chen
2023-07-14 13:06   ` Shrikanth Hegde
2023-07-14 23:05     ` Tim Chen
2023-07-15 18:25       ` Tim Chen
2023-07-16 19:36       ` Shrikanth Hegde
2023-07-17 11:10         ` Peter Zijlstra
2023-07-17 12:18           ` Shrikanth Hegde
2023-07-17 13:37             ` Peter Zijlstra
2023-07-17 14:58               ` [PATCH] sched/fair: Add SMT4 group_smt_balance handling Shrikanth Hegde
2023-07-27  3:11                 ` Tim Chen
2023-07-27 13:32                   ` Tim Chen
2023-08-07  9:36                     ` Shrikanth Hegde
2023-08-21 19:19                       ` Tim Chen
2023-09-05  8:03                         ` Shrikanth Hegde
2023-09-05  9:49                           ` Peter Zijlstra
2023-09-05 18:37                           ` Tim Chen
2023-09-06  9:29                             ` Shrikanth Hegde
2023-09-06 15:42                               ` Tim Chen
2023-09-07  8:58                             ` Shrikanth Hegde
2023-09-07 17:42                               ` Tim Chen
2023-09-12 10:29                                 ` [tip: sched/urgent] sched/fair: Fix " tip-bot2 for Tim Chen
2023-09-13 13:11                                 ` tip-bot2 for Tim Chen
2023-09-05 10:38                         ` [PATCH] sched/fair: Add " Peter Zijlstra
2023-09-05 10:41                     ` Peter Zijlstra
2023-09-05 17:54                       ` Tim Chen
2023-09-06  8:23                         ` Peter Zijlstra
2023-09-06 15:45                           ` Tim Chen
2023-07-18  6:07       ` [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Tobias Huschle
2023-07-18 14:52         ` Shrikanth Hegde
2023-07-19  8:14           ` Tobias Huschle
2023-07-14 14:53   ` Tobias Huschle
2023-07-14 23:29     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 2/6] sched/topology: Record number of cores in sched group Tim Chen
2023-07-10 20:33   ` Valentin Schneider
2023-07-10 22:13     ` Tim Chen
2023-07-12  9:27       ` Valentin Schneider
2023-07-10 22:40   ` Tim Chen
2023-07-11 11:31     ` Peter Zijlstra
2023-07-11 16:32       ` Tim Chen
2023-07-07 22:57 ` [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Tim Chen
2023-07-14 13:14   ` Shrikanth Hegde
2023-07-14 14:22     ` Tobias Huschle
2023-07-14 23:35       ` Tim Chen
2023-07-14 20:44     ` Tim Chen
2023-07-14 23:23       ` Tim Chen
2023-07-15  0:11     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Tim Chen
2023-07-14 13:02   ` Shrikanth Hegde [this message]
2023-07-14 22:16     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Tim Chen
2023-07-08 12:31   ` Peter Zijlstra
2023-07-10 16:13     ` Tim Chen
2023-07-07 22:57 ` [Patch v3 6/6] sched/debug: Dump domains' sched group flags Tim Chen
2023-07-10 20:33   ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29f8bc23-9b17-0a21-baa5-3eee1322f8f0@linux.vnet.ibm.com \
    --to=sshegde@linux.vnet.ibm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=ionela.voinescu@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=ricardo.neri-calderon@linux.intel.com \
    --cc=ricardo.neri@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    --cc=yangyicong@hisilicon.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).