kernel-janitors.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Julia Lawall <julia.lawall@inria.fr>
Cc: Ingo Molnar <mingo@redhat.com>,
	kernel-janitors@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	linux-kernel@vger.kernel.org,
	Valentin Schneider <valentin.schneider@arm.com>,
	Gilles Muller <Gilles.Muller@inria.fr>
Subject: Re: [PATCH] sched/fair: check for idle core
Date: Wed, 21 Oct 2020 12:28:11 +0000	[thread overview]
Message-ID: <20201021122811.GD32041@suse.de> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2010211336410.8475@hadrien>

On Wed, Oct 21, 2020 at 01:56:55PM +0200, Julia Lawall wrote:
> > I suspect that the benefit of this patch is due to avoiding the overhead
> > of wake_affine_weight() check because the following check exists in
> > select_idle_sibling
> 
> I'm running 160 threads on 160 cores (80 physical cores).  All of the
> threads are thus best off to just stay where they are.  If one thread
> moves to the socket of prev, then they will be displacing other threads
> that also expect to return to where they were previously located.
> 
> You can see this in the traces shown here:
> 
> https://pages.lip6.fr/Julia.Lawall/uas.pdf
> 

This is an important point -- almost all CPUs are commonly active and the
search for an idle CPU is going to be difficult given that it is inherently
race-prone. In this case, you're right, it may indeed be better off to just
leave the task where it is if the prev CPU is idle. Not only does it avoid
the wake_affine_weight cost, it avoids a useless search of idle siblings.

However, there is alsoo the important case where a machine is only
partially or lightly utilised. In that case, a cross-node migration
may be beneficial as the wakee may need to access data in the wakers
cache or memory local to the waker (no guarantee, the waker/wakee could
be completely independent but the scheduler cannot tell). In the low
utilisation case, select_idle_sibling is also likely to be less expensive.

Leaving a task on a remote node because the prev CPU was idle is an
important semantic change in the behaviour of the scheduler when there is
spare idle capacity in either domain. It's non-obvious from the changelog
and the patch itself that this change of behaviour happens.

> Prior to 5.8, my machine was using intel_pstate and had few background
> tasks.  Thus the problem wasn't visible in practice.  Starting with 5.8
> the kernel decided that intel_cpufreq would be more appropriate, which
> introduced kworkers every 0.004 seconds on all cores.  In the graphs for
> early versions, sometimes the whole benchmark runs with the threads just
> staying on their cores, or a few migrations.  Starting with 5.8, after 5
> seconds where there are a number of synchronizations, all of the threads
> move around between all of the cores.  Typically, one bad placement leads
> to 10-15 threads moving around, until one ends up on the idle core where
> the original thread was intended to be.
> 

And this is an issue but I don't think the fix is avoiding cross-node
migrations entirely if the prev CPU happened to be idle because cache
lines will have to be bounced and/or remote data be accessed.  At minimum,
the low utilisation case should be considered and the changelog justify
why avoiding cross-node wakeups is still appropriate when the waker CPU
has plenty of idle CPUs.

> >
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
> >         if (prev != target && cpus_share_cache(prev, target) &&
> >             (available_idle_cpu(prev) || sched_idle_cpu(prev)))
> >                 return prev;
> 
> This isn't triggered in the problematic case, because the problematic case
> is where the prev core and the waker core are on different sockets.
> 

I simply wanted to highlight the fact it checks whether caches are shared
or not as that is also taken into account in select_idle_sibling for
example.

> To my understanding, when the runnable load was used and prev was idle,
> wake_affine_weight would fail, and then wake_affine would return prev.
> With the load average, in the case where there is a thread on the waker
> core and there has recently been a daemon on the prev core, the comparison
> between the cores is a bit random.  The patch thus tries to restore the
> previous behavior.
> 

I think wake_affine_weight() is a bit random anyway simply because it picks
a CPU that select_idle_sibling ignores in a lot of cases and in others
simply stacks the wakee on top of the waker. Every so often I think that
it should simply be ripped out because the scheduler avoids stacking on
the wakeup paths unless there is no other choice in which case the task
might as well remain on prev anyway. Maybe it made more sense when there
was an effort to detect when task stacking is appropriate but right now,
wake_affine_weight() is basically a random migration generator -- leave
it to the load balancer to figure out placement decisions based on load.

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-10-21 12:28 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-20 16:37 [PATCH] sched/fair: check for idle core Julia Lawall
2020-10-21  7:29 ` Vincent Guittot
2020-10-21 11:13   ` Peter Zijlstra
2020-10-21 12:27   ` Vincent Guittot
2020-10-21 11:20 ` Mel Gorman
2020-10-21 11:56   ` Julia Lawall
2020-10-21 12:19     ` Peter Zijlstra
2020-10-21 12:42       ` Julia Lawall
2020-10-21 12:52         ` Peter Zijlstra
2020-10-21 18:18           ` Rafael J. Wysocki
2020-10-21 18:15         ` Rafael J. Wysocki
2020-10-21 19:47           ` Julia Lawall
2020-10-21 20:25             ` Rafael J. Wysocki
2020-10-21 13:10       ` Peter Zijlstra
2020-10-21 18:11         ` Rafael J. Wysocki
2020-10-22  4:53           ` Viresh Kumar
2020-10-22  7:11           ` Peter Zijlstra
2020-10-22 10:59             ` Viresh Kumar
2020-10-22 11:45               ` Rafael J. Wysocki
2020-10-22 12:02                 ` default cpufreq gov, was: " Peter Zijlstra
2020-10-22 12:19                   ` Rafael J. Wysocki
2020-10-22 12:29                     ` Peter Zijlstra
2020-10-22 14:52                       ` Mel Gorman
2020-10-22 14:58                         ` Colin Ian King
2020-10-22 15:12                           ` Phil Auld
2020-10-22 16:35                             ` Mel Gorman
2020-10-22 17:59                               ` Rafael J. Wysocki
2020-10-22 20:32                                 ` Mel Gorman
2020-10-22 20:39                                   ` Phil Auld
2020-10-22 15:25                         ` Peter Zijlstra
2020-10-22 15:55                           ` Rafael J. Wysocki
2020-10-22 16:29                           ` Mel Gorman
2020-10-22 20:10                           ` Giovanni Gherdovich
2020-10-22 20:16                             ` Giovanni Gherdovich
2020-10-23  7:03                             ` Peter Zijlstra
2020-10-23 17:46                               ` Tom Lendacky
2020-10-26 19:52                                 ` Fontenot, Nathan
2020-10-22 15:45                       ` A L
2020-10-22 15:55                         ` Vincent Guittot
2020-10-23  5:23                           ` Viresh Kumar
2020-10-22 16:23                   ` [PATCH] cpufreq: Avoid configuring old governors as default with intel_pstate Rafael J. Wysocki
2020-10-23  6:29                     ` Viresh Kumar
2020-10-23 11:59                       ` Rafael J. Wysocki
2020-10-23 15:15                     ` [PATCH v2] " Rafael J. Wysocki
2020-10-27  3:13                       ` Viresh Kumar
2020-10-27 11:11                   ` default cpufreq gov, was: [PATCH] sched/fair: check for idle core Qais Yousef
2020-10-27 11:26                     ` Valentin Schneider
2020-10-27 11:42                       ` Qais Yousef
2020-10-27 11:48                         ` Viresh Kumar
2020-10-23  6:24                 ` Viresh Kumar
2020-10-23 15:06                   ` Rafael J. Wysocki
2020-10-27  3:13                     ` Viresh Kumar
2020-10-22 11:21             ` AW: " Walter Harms
2020-10-21 12:28     ` Mel Gorman [this message]
2020-10-21 12:25   ` Vincent Guittot
2020-10-21 12:47     ` Mel Gorman
2020-10-21 12:56       ` Julia Lawall
2020-10-21 13:18         ` Mel Gorman
2020-10-21 13:24           ` Julia Lawall
2020-10-21 15:08             ` Mel Gorman
2020-10-21 15:18               ` Julia Lawall
2020-10-21 15:23                 ` Vincent Guittot
2020-10-21 15:33                   ` Julia Lawall
2020-10-21 15:19               ` Vincent Guittot
2020-10-21 17:00                 ` Mel Gorman
2020-10-21 17:39                   ` Julia Lawall
2020-10-21 13:48           ` Julia Lawall
2020-10-21 15:26             ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201021122811.GD32041@suse.de \
    --to=mgorman@suse.de \
    --cc=Gilles.Muller@inria.fr \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=julia.lawall@inria.fr \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-janitors@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).