kernel-janitors.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Julia Lawall <julia.lawall@inria.fr>
To: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>,
	kernel-janitors@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	linux-kernel@vger.kernel.org,
	Valentin Schneider <valentin.schneider@arm.com>,
	Gilles Muller <Gilles.Muller@inria.fr>
Subject: Re: [PATCH] sched/fair: check for idle core
Date: Wed, 21 Oct 2020 11:56:55 +0000	[thread overview]
Message-ID: <alpine.DEB.2.22.394.2010211336410.8475@hadrien> (raw)
In-Reply-To: <20201021112038.GC32041@suse.de>



On Wed, 21 Oct 2020, Mel Gorman wrote:

> On Tue, Oct 20, 2020 at 06:37:59PM +0200, Julia Lawall wrote:
> > On a thread wakeup, the change [1] from runnable load average to load
> > average for comparing candidate cores means that recent short-running
> > daemons on the core where a thread ran previously can be considered to
> > have a higher load than the core performing the wakeup, even when the
> > core where the thread ran previously is currently idle.  This can
> > cause a thread to migrate, taking the place of some other thread that
> > is about to wake up, and so on.  To avoid unnecessary migrations,
> > extend wake_affine_idle to check whether the core where the thread
> > previously ran is currently idle, and if so return that core as the
> > target.
> >
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> >
> > This particularly has an impact when using passive (intel_cpufreq)
> > power management, where kworkers run every 0.004 seconds on all cores,
> > increasing the likelihood that an idle core will be considered to have
> > a load.
> >
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used.  Times are in seconds.  All experiments use all
> > 160 hardware threads.
> >
> > 	v5.9/active		v5.9+patch/active
> > bt.C.c	24.725724+-0.962340	23.349608+-1.607214
> > lu.C.x	29.105952+-4.804203	25.249052+-5.561617
> > sp.C.x	31.220696+-1.831335	30.227760+-2.429792
> > ua.C.x	26.606118+-1.767384	25.778367+-1.263850
> >
> > 	v5.9/passive		v5.9+patch/passive
> > bt.C.c	25.330360+-1.028316	23.544036+-1.020189
> > lu.C.x	35.872659+-4.872090	23.719295+-3.883848
> > sp.C.x	32.141310+-2.289541	29.125363+-0.872300
> > ua.C.x	29.024597+-1.667049	25.728888+-1.539772
> >
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> >
> > Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
>
> I suspect that the benefit of this patch is due to avoiding the overhead
> of wake_affine_weight() check because the following check exists in
> select_idle_sibling

I'm running 160 threads on 160 cores (80 physical cores).  All of the
threads are thus best off to just stay where they are.  If one thread
moves to the socket of prev, then they will be displacing other threads
that also expect to return to where they were previously located.

You can see this in the traces shown here:

https://pages.lip6.fr/Julia.Lawall/uas.pdf

Prior to 5.8, my machine was using intel_pstate and had few background
tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
the kernel decided that intel_cpufreq would be more appropriate, which
introduced kworkers every 0.004 seconds on all cores.  In the graphs for
early versions, sometimes the whole benchmark runs with the threads just
staying on their cores, or a few migrations.  Starting with 5.8, after 5
seconds where there are a number of synchronizations, all of the threads
move around between all of the cores.  Typically, one bad placement leads
to 10-15 threads moving around, until one ends up on the idle core where
the original thread was intended to be.

>
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
>         if (prev != target && cpus_share_cache(prev, target) &&
>             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
>                 return prev;

This isn't triggered in the problematic case, because the problematic case
is where the prev core and the waker core are on different sockets.

To my understanding, when the runnable load was used and prev was idle,
wake_affine_weight would fail, and then wake_affine would return prev.
With the load average, in the case where there is a thread on the waker
core and there has recently been a daemon on the prev core, the comparison
between the cores is a bit random.  The patch thus tries to restore the
previous behavior.

julia

> Still, the concept makes some sense to avoid wake_affine_weight but look
> at the earlier part of wake_affine_idle()
>
>         if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>                 return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>
> This thing is almost completely useless because this_cpu is only going to
> be idle if it's a wakeup from interrupt context when the CPU was otherwise
> idle *but* it takes care to only use the CPU if this and prev share LLC.
>
> The patch as it stands may leave a task on a remote node when it should
> have been pulled local to the waker because prev happened to be idle. This
> is not guaranteed because a node could have multiple LLCs and prev is
> still appropriate but that's a different problem entirely and requires
> much deeper surgery. Still, not pulling a task from a remote node is
> a change in expected behaviour. While it's possible that NUMA domains
> will not even reach this path, it depends on the NUMA distance as can
> be seen in sd_init() for the setting of SD_WAKE_AFFINE so I think the
> cpus_share_cache check is necessary.
>
> I think it would be more appropriate to rework that block that checks
> this_cpu to instead check if the CPUs share cache first and then return one
> of them (preference to prev based on the comment above it about avoiding
> a migration) if either one is idle.
>
> I see Vincent already agreed with the patch so I could be wrong.  Vincent,
> did I miss something stupid?
>
> --
> Mel Gorman
> SUSE Labs
>

  reply	other threads:[~2020-10-21 11:56 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
2020-10-21  7:29 ` Vincent Guittot
2020-10-21 11:13   ` Peter Zijlstra
2020-10-21 12:27   ` Vincent Guittot
2020-10-21 11:20 ` Mel Gorman
2020-10-21 11:56   ` Julia Lawall [this message]
2020-10-21 12:19     ` Peter Zijlstra
2020-10-21 12:42       ` Julia Lawall
2020-10-21 12:52         ` Peter Zijlstra
2020-10-21 18:18           ` Rafael J. Wysocki
2020-10-21 18:15         ` Rafael J. Wysocki
2020-10-21 19:47           ` Julia Lawall
2020-10-21 20:25             ` Rafael J. Wysocki
2020-10-21 13:10       ` Peter Zijlstra
2020-10-21 18:11         ` Rafael J. Wysocki
2020-10-22  4:53           ` Viresh Kumar
2020-10-22  7:11           ` Peter Zijlstra
2020-10-22 10:59             ` Viresh Kumar
2020-10-22 11:45               ` Rafael J. Wysocki
2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
2020-10-22 12:19                   ` Rafael J. Wysocki
2020-10-22 12:29                     ` Peter Zijlstra
2020-10-22 14:52                       ` Mel Gorman
2020-10-22 14:58                         ` Colin Ian King
2020-10-22 15:12                           ` Phil Auld
2020-10-22 16:35                             ` Mel Gorman
2020-10-22 17:59                               ` Rafael J. Wysocki
2020-10-22 20:32                                 ` Mel Gorman
2020-10-22 20:39                                   ` Phil Auld
2020-10-22 15:25                         ` Peter Zijlstra
2020-10-22 15:55                           ` Rafael J. Wysocki
2020-10-22 16:29                           ` Mel Gorman
2020-10-22 20:10                           ` Giovanni Gherdovich
2020-10-22 20:16                             ` Giovanni Gherdovich
2020-10-23  7:03                             ` Peter Zijlstra
2020-10-23 17:46                               ` Tom Lendacky
2020-10-26 19:52                                 ` Fontenot, Nathan
2020-10-22 15:45                       ` A L
2020-10-22 15:55                         ` Vincent Guittot
2020-10-23  5:23                           ` Viresh Kumar
2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
2020-10-23  6:29                     ` Viresh Kumar
2020-10-23 11:59                       ` Rafael J. Wysocki
2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
2020-10-27  3:13                       ` Viresh Kumar
2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
2020-10-27 11:26                     ` Valentin Schneider
2020-10-27 11:42                       ` Qais Yousef
2020-10-27 11:48                         ` Viresh Kumar
2020-10-23  6:24                 ` Viresh Kumar
2020-10-23 15:06                   ` Rafael J. Wysocki
2020-10-27  3:13                     ` Viresh Kumar
2020-10-22 11:21             ` AW: " Walter Harms
2020-10-21 12:28     ` Mel Gorman
2020-10-21 12:25   ` Vincent Guittot
2020-10-21 12:47     ` Mel Gorman
2020-10-21 12:56       ` Julia Lawall
2020-10-21 13:18         ` Mel Gorman
2020-10-21 13:24           ` Julia Lawall
2020-10-21 15:08             ` Mel Gorman
2020-10-21 15:18               ` Julia Lawall
2020-10-21 15:23                 ` Vincent Guittot
2020-10-21 15:33                   ` Julia Lawall
2020-10-21 15:19               ` Vincent Guittot
2020-10-21 17:00                 ` Mel Gorman
2020-10-21 17:39                   ` Julia Lawall
2020-10-21 13:48           ` Julia Lawall
2020-10-21 15:26             ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.22.394.2010211336410.8475@hadrien \
    --to=julia.lawall@inria.fr \
    --cc=Gilles.Muller@inria.fr \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-janitors@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).