All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>, Phil Auld <pauld@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
	Rik van Riel <riel@surriel.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
Date: Wed, 8 Jan 2020 17:46:57 +0100	[thread overview]
Message-ID: <20200108164657.GA16425@linaro.org> (raw)
In-Reply-To: <20200108140349.GL3466@techsingularity.net>

Le Wednesday 08 Jan 2020 à 14:03:49 (+0000), Mel Gorman a écrit :
> On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> > > Now I get you, but unfortunately it also would not work out. The number
> > > of groups is not related to the LLC except in some specific cases.
> > > It's possible to use the first CPU to find the size of an LLC but now I
> > > worry that it would lead to unpredictable behaviour. AMD has different
> > > numbers of LLCs per node depending on the CPU family and while Intel
> > > generally has one LLC per node, I imagine there are counter examples.
> > 
> > Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
> > pretends this isn't so and truncates the LLC topology information to be
> > the node again -- see arch/x86/kernel/smpboot.c:match_llc().
> > 
> 
> That's the Sub-NUMA Clustering, right? I thought that manifested as a
> separate NUMA node in Linux. Not exactly right from a topology point of
> view, but close enough. Still, you're right in that identifying the LLC
> specifically would not necessarily be the way to go. FWIW, I have one
> machine with SNC enabled and didn't notice anything problematic in the
> versions of the patch released to date.
> 
> > And of course, in the Core2 era we had the Core2Quad chips which was a
> > dual-die solution and therefore also had multiple LLCs, and I think the
> > Xeon variant of that would allow the multiple LLC per node situation
> > too, although this is of course ancient hardware nobody really cares
> > about anymore.
> > 
> 
> Ok, even though they're older, it's enough counter-examples to be concerned
> about. I'm definitely dropping the LLC considerations for the moment :)
> 
> > > This means that load balancing on different machines with similar core
> > > counts will behave differently due to the LLC size.
> > 
> > That sounds like perfectly fine/expected behaviour to me.
> > 
> 
> Even so, I think it should not be part of the initial patch. I would only
> be happy if I had enough different machine types to prove that specific
> special casing is required and we cannot simply rely on standard load
> balancing of the domains below SD_NUMA.
> 
> > > It might be possible
> > > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > > that's guaranteed and it would still be unpredictable. It may be the type
> > > of complexity that should only be introduced with a separate patch with
> > > clear rationale as to why it's necessary and we are not at that threshold
> > > so I withdraw the suggestion.
> > 
> > So IIRC the initial patch(es) had the idea to allow for 1 extra task
> > imbalance to get 1-1 pairs on the same node, instead of across nodes. I
> > don't immediately see that in these later patches.
> > 
> 
> Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for
> very small NUMA domains. By v3, it was not necessary because the value
> was hard-coded regardless of the number of CPUs. I think I'll leave it
> out because I don't think it's worth worrying about the imbalance between
> NUMA domains of less than 4 CPUs (do they even exist any more?)
> 
> > Would that be something to go back to? Would that not side-step much of
> > the issues under discussion here?
> 
> Allowing just 1 extra task would work for netperf in some cases except when
> softirq is involved. It would partially work for IO on ext4 as it's only
> communicating with one journal thread but a bit more borderline for XFS
> due to workqueue usage. XFS is not a massive concern in this context as
> the workqueue is close to the IO issuer and short-lived so I don't think
> it would crop up much for load balancing unlike ext4 where jbd2 could be
> very active.
> 
> If v4 of the patch fails to meet approval then I'll try a patch that

My main concern with v4 was the mismatch between the computed value and the goal to not overload the LLCs

> allows a hard-coded imbalance of 2 tasks (one communicating task and

If there is no good way to compute the allowed imbalance, a hard coded value of 2 is probably simple value to start with

> one kthread) regardless of NUMA domain span up to 50% of utilisation

Are you sure that it's necessary ? This degree of imbalance already applies only if the group has spare capacity

something like

+               /* Consider allowing a small imbalance between NUMA groups */
+               if (env->sd->flags & SD_NUMA) {
+
+                       /*
+                        * Until we found a good way to compute an acceptable
+						 * degree of imbalance linked to the system topology
+						 * and that will not impatc mem bandwith and latency,
+						 * let start with a fixed small value.
+						 */
+                       imbalance_adj = 2;
+
+                       /*
+                        * Ignore small imbalances when the busiest group has
+                        * low utilisation.
+                        */
+                        env->imbalance -= min(env->imbalance, imbalance_adj);
+               }

> and see what that gets. One way or the other, I would like to get basic
> NUMA balancing issue out of the way before even looking at NUMA balancing
> and whether it needs to be adjusted based on the load balancing rewrite.
> NUMA balancing already has some logic that fights load balancer decisions
> and I don't want to make that worse, I would be delighted if we could
> even delete that migration retry check that overrides the load balancer
> because it always was a bit nasty.
> 
> -- 
> Mel Gorman
> SUSE Labs

  reply	other threads:[~2020-01-08 16:47 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
2019-12-20 12:40 ` Valentin Schneider
2019-12-20 14:22   ` Mel Gorman
2019-12-20 15:32     ` Valentin Schneider
2019-12-21 11:25   ` Mel Gorman
2019-12-22 12:00 ` Srikar Dronamraju
2019-12-23 13:31 ` Vincent Guittot
2019-12-23 13:41   ` Vincent Guittot
2020-01-03 14:31   ` Mel Gorman
2020-01-06 13:55     ` Vincent Guittot
2020-01-06 14:52       ` Mel Gorman
2020-01-07  8:38         ` Vincent Guittot
2020-01-07  9:56           ` Mel Gorman
2020-01-07 11:17             ` Vincent Guittot
2020-01-07 11:56               ` Mel Gorman
2020-01-07 16:00                 ` Vincent Guittot
2020-01-07 20:24                   ` Mel Gorman
2020-01-08  8:25                     ` Vincent Guittot
2020-01-08  8:49                       ` Mel Gorman
2020-01-08 13:18                     ` Peter Zijlstra
2020-01-08 14:03                       ` Mel Gorman
2020-01-08 16:46                         ` Vincent Guittot [this message]
2020-01-08 18:03                           ` Mel Gorman
2020-01-07 11:22             ` Peter Zijlstra
2020-01-07 11:42               ` Mel Gorman
2020-01-07 12:29                 ` Peter Zijlstra
2020-01-07 12:28               ` Peter Zijlstra
2020-01-07 19:26             ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200108164657.GA16425@linaro.org \
    --to=vincent.guittot@linaro.org \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=parth@linux.ibm.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.