All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
To: Valentin Schneider <vschneid@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ricardo Neri <ricardo.neri@intel.com>,
	"Ravi V. Shankar" <ravi.v.shankar@intel.com>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	"Tim C . Chen" <tim.c.chen@intel.com>
Subject: Re: [PATCH v2 1/7] sched/fair: Generalize asym_packing logic for SMT local sched group
Date: Sun, 15 Jan 2023 20:05:29 -0800	[thread overview]
Message-ID: <20230116040529.GA16654@ranerica-svr.sc.intel.com> (raw)
In-Reply-To: <20230113190226.GA1379@ranerica-svr.sc.intel.com>

On Fri, Jan 13, 2023 at 11:02:26AM -0800, Ricardo Neri wrote:
> On Wed, Jan 11, 2023 at 04:04:23PM +0000, Valentin Schneider wrote:
> > On 28/12/22 20:00, Ricardo Neri wrote:
> > > On Thu, Dec 22, 2022 at 04:55:58PM +0000, Valentin Schneider wrote:
> > >> Some of this is new to me - I had missed the original series introducing
> > >> this. However ISTM that this is conflating two concepts: SMT occupancy
> > >> balancing, and asym packing.
> > >> 
> > >> Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
> > >> not involve asym packing priorities at all. This can end up in an
> > >> ASYM_PACKING load balance,
> > >
> > > Yes, this the desired result: an idle low-priority CPU should help a high-
> > > priority core with more than one busy SMT sibling. But yes, it does not
> > > relate to priorities and can be implemented differently.
> > >
> > >> which per calculate_imbalance() tries to move
> > >> *all* tasks to the higher priority CPU - in the case of SMT balancing,
> > >> we don't want to totally empty the core, just its siblings.
> > >
> > > But it will not empty the core, only one of its SMT siblings. A single
> > > sibling will be selected in find_busiest_queue(). The other siblings will
> > > be unaffected.
> > >
> > 
> > Right
> > 
> > >> 
> > >> Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
> > >> the above?
> > >
> > > Please see below.
> > >
> > >> 
> > >> Say, what's not sufficient with the below? AFAICT the only thing that isn't
> > >> covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
> > >> asym packing - if the current calculate_imbalance() doesn't do it, it
> > >> should be fixed to do it.
> > >
> > > Agreed.
> > >
> > >>Looking at the
> > >> 
> > >>   local->group_type == group_has_spare
> > >> 
> > >> case, it looks like it should DTRT.
> > >
> > > I had tried (and failed) to have find_busiest_group() handle the
> > > !local_is_smt :: sg_busy_cpus >= 2 case. Now I think I made it work.
> > >
> > > When the busiest group is not overloaded, find_busiest_group() and
> > > local->group = group_has_spare during an idle load balance events the
> > > number of *idle* CPUs. However, this does not work if the local and busiest
> > > groups have different weights. In SMT2, for instance, if busiest has 2 busy
> > > CPUs (i.e., 0 idle CPUs) and the destination CPU is idle, the difference in
> > > the number of idle CPUs is 1. find_busiest_group() will incorrectly goto
> > > out_balanced.
> > >
> > > This issue very visible in Intel hybrid processors because the big cores
> > > have SMT but the small cores do not. It can, however, be reproduced in non-
> > > hybrid processors by offlining the SMT siblings of some cores.
> > >
> > 
> > I think I follow. If we're comparing two groups each spanning an SMT2 core,
> > then
> > 
> >   busiest->group_weight > 1 && local->idle_cpus <= (busiest->idle_cpus + 1)
> > 
> > is false if local is fully idle and busiest fully busy, but if local
> > becomes a non-SMT core, then that's true and we goto out_balanced.
> 
> Exactly right.
> 
> > 
> > 
> > With that said, shouldn't SD_PREFER_SIBLING help here? cf.
> > 
> > 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > 	    busiest->sum_nr_running > local->sum_nr_running + 1)
> 
> It does not help because sds.prefer_sibling is false: an non-SMT core is
> looking into offloading a fully_busy SMT core at the "MC" domain.
> sds.prefer_sibling is set in update_sd_lb_stats() if the sched domain's child
> has the SD_PREFER_SIBLING flag. Since the destination CPU is the non-SMT
> core, there is no child.
> 
> > 
> > It should be set on any topology level below the NUMA ones, we do remove it
> > on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
> > balancing (it would override the group_type), things are a bit different
> > since Vincent's rewrite of load_balance() but I think we still want it off
> > there.

Your comment got me thinking. Whose child sched domain wants prefer_sibling?
It sounds to me that is busiest's. I could not think of any reason of *having*
to use the flags of the local group.

We can use the flags of the sched group (as per 16d364ba6ef2 ("sched/topology:
Introduce sched_group::flags"), these are the flags of the child domain).

The patch below works for me and I don't have to even the number of busy CPUs.
It should not interfere with misfit balancing either:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2c70e1087d0..737bb3c8bfae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9752,8 +9752,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-	/* Tag domain that child domain prefers tasks go to siblings first */
-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+	/*
+	 * Tag domain that child domain prefers tasks go to siblings first.
+	 * A sched group has the flags of the child domain, if any.
+	 */
+	if (sds->busiest)
+		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
 
 
 	if (env->sd->flags & SD_NUMA)

Thanks and BR,
Ricardo

  reply	other threads:[~2023-01-16  3:56 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-22 20:35 [PATCH v2 0/7] x86/sched: Avoid unnecessary migrations within SMT domains Ricardo Neri
2022-11-22 20:35 ` [PATCH v2 1/7] sched/fair: Generalize asym_packing logic for SMT local sched group Ricardo Neri
2022-12-06 17:22   ` Dietmar Eggemann
2022-12-12 17:53     ` Ricardo Neri
2022-12-21 13:03       ` Dietmar Eggemann
2022-12-22  4:32         ` Ricardo Neri
2022-12-22 11:12           ` Dietmar Eggemann
2022-12-23 13:11             ` Ricardo Neri
2022-12-22 16:55   ` Valentin Schneider
2022-12-29  4:00     ` Ricardo Neri
2023-01-11 16:04       ` Valentin Schneider
2023-01-13 19:02         ` Ricardo Neri
2023-01-16  4:05           ` Ricardo Neri [this message]
2023-01-16 19:07             ` Valentin Schneider
2023-01-17 12:49               ` Ricardo Neri
2022-11-22 20:35 ` [PATCH v2 2/7] sched: Prepare sched_asym_prefer() to handle idle state of SMT siblings Ricardo Neri
2022-11-22 20:35 ` [PATCH v2 3/7] sched: Teach arch_asym_cpu_priority() the " Ricardo Neri
2022-12-06 17:54   ` Dietmar Eggemann
2022-12-12 17:54     ` Ricardo Neri
2022-12-21 17:12       ` Dietmar Eggemann
2022-12-22  4:55         ` Ricardo Neri
2022-12-22 16:56           ` Valentin Schneider
2022-11-22 20:35 ` [PATCH v2 4/7] sched/fair: Introduce sched_smt_siblings_idle() Ricardo Neri
2022-12-06 18:03   ` Dietmar Eggemann
2022-12-12 17:54     ` Ricardo Neri
2022-12-22 11:12       ` Dietmar Eggemann
2022-12-22 16:56   ` Valentin Schneider
2022-12-24  5:28     ` Ricardo Neri
2022-12-28 15:29       ` Chen Yu
2022-12-30  0:17         ` Ricardo Neri
2023-01-10 19:21       ` Valentin Schneider
2022-11-22 20:35 ` [PATCH v2 5/7] x86/sched: Remove SD_ASYM_PACKING from the "SMT" domain Ricardo Neri
2022-12-08 16:03   ` Ionela Voinescu
2022-12-14 16:59     ` Ricardo Neri
2022-12-15 16:48       ` Valentin Schneider
2022-12-20  0:42         ` Ricardo Neri
2022-12-22 16:56           ` Valentin Schneider
2022-12-29 19:02             ` Ricardo Neri
2023-01-10 19:17               ` Valentin Schneider
2023-01-13  1:31                 ` Ricardo Neri
2022-11-22 20:35 ` [PATCH v2 6/7] x86/sched/itmt: Give all SMT siblings of a core the same priority Ricardo Neri
2022-11-22 20:35 ` [PATCH v2 7/7] x86/sched/itmt: Consider the idle state of SMT siblings Ricardo Neri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230116040529.GA16654@ranerica-svr.sc.intel.com \
    --to=ricardo.neri-calderon@linux.intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=ricardo.neri@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tim.c.chen@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.