All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>, Phil Auld <pauld@redhat.com>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Hillf Danton <hdanton@sina.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Douglas Shakshober <dshaks@redhat.com>,
	Waiman Long <longman@redhat.com>, Joe Mario <jmario@redhat.com>,
	Bill Gray <bgray@redhat.com>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
Date: Fri, 15 May 2020 09:47:40 +0100	[thread overview]
Message-ID: <20200515084740.GJ3758@techsingularity.net> (raw)
In-Reply-To: <20200514153122.GE2978@hirez.programming.kicks-ass.net>

On Thu, May 14, 2020 at 05:31:22PM +0200, Peter Zijlstra wrote:
> On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> > Complete shot in the dark but restore adjust_numa_imbalance() and try
> > this
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..0b31f4468d5b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> >  	struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> > +	if (sched_feat(TTWU_QUEUE)) {
> 
> just saying that this has the risk of regressing other workloads, see:
> 
>   518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
> 

Sure, I didn't really think that this was appropriate but it was the best I
had at the time. This can be a lot more targetted but it took me a while
to put together a picture as it was difficult for me to analyse. I was
going to post this as a seperate RFC but given this thread, what do you
think of this?

---8<---
sched: Wake cache-local tasks via wake_list if wakee CPU is polling

There are two separate method for waking a task from idle, one for
tasks running on CPUs that share a cache and one for when the caches are
separate. The methods can loosely be called local and remote even though
this is not directly related to NUMA and instead is due to the expected
cost of accessing data that is cache-hot on another CPU. The "local" costs
are expected to be relatively cheap as they share the LLC in comparison to
a remote IPI that is potentially required when using the "remote" wakeup.
The problem is that the local wakeup is not always cheaper and it appears
to have degraded even further around the 4.19 mark.

There appears to be a couple of reasons why it can be slower.

The first is specific to pairs of tasks where one or both rapidly enters
idle. For example, on netperf UDP_STREAM, the client sends a bunch of
packets, wakes the server, the server processes some packets and goes
back to sleep. There is a relationship between the tasks but it's not
strictly synchronous. The timing is different if the client/server are on
separate NUMA nodes and netserver is more likely to enter idle (measured
as server entering idle 10% more times when tasks are local vs remote
but machine-specific). However, the wakeups are so rapid that the wakeup
happens while the server is descheduling. That forces the waker to spin
on smp_cond_load_acquire for longer. In this case, it can be cheaper to
add the task to the rq->wake_list even if that potentially requires an IPI.

The second is that the local wakeup path is simply not always
that fast. Using ftrace, the cost of the locks, update_rq_clock and
ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
a single instance, the cost of the "remote" wakeup for try_to_wake_up
was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
wakeup. When there are tens of thousands of wakeups per second, these costs
accumulate and manifest as a throughput regression in netperf UDP_STREAM.

The essential difference in costs comes down to whether the CPU is fully
idle, a task is descheduling or polling in poll_idle(). This patch
special cases ttwu_queue() to use the "remote" method if the CPUs
task is polling as it's generally cheaper to use the wake_list in that
case than the local method because an IPI should not be required. As it is
race-prone, a reschedule IPI may still be sent but in that case the local
wakeup would also have to send a reschedule IPI so it should be harmless.

The benefit is not universal as it requires the specific pattern of wakeups
that happen when the wakee is likely to be descheduling. netperf UDP_STREAM
is one of the more obvious examples and it is more obvious since the
load balancer was reconciled and the client and server are more likely to
be running on the same NUMA node which is why it was missed for so long.

netperf-udp
                                  5.7.0-rc5              5.7.0-rc5
                                    vanilla          wakelist-v1r1
Hmean     send-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     send-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     send-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     send-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     send-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     send-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     send-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     send-8192     17943.12 (   0.00%)    19886.07 *  10.83%*
Hmean     send-16384    29732.94 (   0.00%)    32297.44 *   8.63%*
Hmean     recv-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     recv-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     recv-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     recv-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     recv-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     recv-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     recv-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     recv-8192     17943.12 (   0.00%)    19886.06 *  10.83%*
Hmean     recv-16384    29731.75 (   0.00%)    32296.08 *   8.62%*

It's less obvious on something like TCP_STREAM as there is a stricter
relationship between the client and server but with the patch, it's
much less variable

netperf-tcp
                             5.7.0-rc5              5.7.0-rc5
                               vanilla          wakelist-v1r1
Hmean     64         769.28 (   0.00%)      779.96 *   1.39%*
Hmean     128       1497.54 (   0.00%)     1509.33 *   0.79%*
Hmean     256       2806.00 (   0.00%)     2787.59 (  -0.66%)
Hmean     1024      9700.80 (   0.00%)     9720.02 (   0.20%)
Hmean     2048     17071.28 (   0.00%)    16968.92 (  -0.60%)
Hmean     3312     22976.20 (   0.00%)    23012.80 (   0.16%)
Hmean     4096     25885.74 (   0.00%)    26152.49 *   1.03%*
Hmean     8192     34014.83 (   0.00%)    33878.67 (  -0.40%)
Hmean     16384    39482.14 (   0.00%)    40307.22 (   2.09%)
Stddev    64           2.60 (   0.00%)        0.17 (  93.31%)
Stddev    128          3.69 (   0.00%)        1.52 (  58.97%)
Stddev    256         31.06 (   0.00%)       14.76 (  52.47%)
Stddev    1024        96.50 (   0.00%)       46.43 (  51.89%)
Stddev    2048       115.98 (   0.00%)       62.59 (  46.03%)
Stddev    3312       293.28 (   0.00%)       41.92 (  85.71%)
Stddev    4096       173.45 (   0.00%)      123.19 (  28.98%)
Stddev    8192       783.59 (   0.00%)      129.62 (  83.46%)
Stddev    16384     1012.37 (   0.00%)      252.08 (  75.10%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..59077c7c6660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
+	if (sched_feat(TTWU_QUEUE)) {
+		/*
+		 * A remote wakeup is often expensive as can require
+		 * an IPI and the wakeup path is slow. However, in
+		 * the specific case where the target CPU is idle
+		 * and polling, the CPU is active and rapidly checking
+		 * if a reschedule is needed. In this case, the idle
+		 * task just needs to be marked for resched and p
+		 * will rapidly be requeued which is less expensive
+		 * than the direct wakeup path.
+		 */
+		if (cpus_share_cache(smp_processor_id(), cpu)) {
+			struct thread_info *ti = task_thread_info(p);
+			typeof(ti->flags) val = READ_ONCE(ti->flags);
+
+			if (val & _TIF_POLLING_NRFLAG)
+				goto activate;
+		}
+
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
 		ttwu_queue_remote(p, cpu, wake_flags);
 		return;
 	}
 #endif
 
+activate:
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	ttwu_do_activate(rq, p, wake_flags, &rf);

  reply	other threads:[~2020-05-15  8:47 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01     ` Valentin Schneider
2020-02-24 16:34       ` Mel Gorman
2020-02-25  8:23       ` Vincent Guittot
2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59   ` Mel Gorman
2020-02-25 13:28     ` Vincent Guittot
2020-02-25 14:24       ` Mel Gorman
2020-02-25 14:53         ` Vincent Guittot
2020-02-27  9:09         ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36   ` Mel Gorman
2020-03-12  9:54     ` Mel Gorman
2020-03-12 12:17       ` Jirka Hladky
     [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56         ` Mel Gorman
2020-03-12 17:06           ` Jirka Hladky
     [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47             ` Mel Gorman
2020-03-12 22:24               ` Jirka Hladky
2020-03-20 15:08                 ` Jirka Hladky
     [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22                   ` Mel Gorman
2020-03-20 15:33                     ` Jirka Hladky
     [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38                       ` Mel Gorman
2020-03-20 17:21                         ` Jirka Hladky
2020-05-07 15:24                         ` Jirka Hladky
2020-05-07 15:54                           ` Mel Gorman
2020-05-07 16:29                             ` Jirka Hladky
2020-05-07 17:49                               ` Phil Auld
     [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52                                   ` Jirka Hladky
     [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58                                       ` Jirka Hladky
2020-05-20 16:01                                         ` Jirka Hladky
2020-05-21 11:06                                         ` Mel Gorman
     [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04                                           ` Mel Gorman
     [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05                                             ` Mel Gorman
2020-05-08  9:22                               ` Mel Gorman
2020-05-08 11:05                                 ` Jirka Hladky
     [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57                                   ` Jirka Hladky
2020-05-13 15:30                                     ` Mel Gorman
2020-05-13 16:20                                       ` Jirka Hladky
2020-05-14  9:50                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08                                             ` Mel Gorman
2020-05-14 10:22                                               ` Jirka Hladky
2020-05-14 11:50                                                 ` Mel Gorman
2020-05-14 13:34                                                   ` Jirka Hladky
2020-05-14 15:31                                       ` Peter Zijlstra
2020-05-15  8:47                                         ` Mel Gorman [this message]
2020-05-15 11:17                                           ` Peter Zijlstra
2020-05-15 13:03                                             ` Mel Gorman
2020-05-15 13:12                                               ` Peter Zijlstra
2020-05-15 13:28                                                 ` Peter Zijlstra
2020-05-15 14:24                                             ` Peter Zijlstra
2020-05-21 10:38                                               ` Mel Gorman
2020-05-21 11:41                                                 ` Peter Zijlstra
2020-05-22 13:28                                                   ` Mel Gorman
2020-05-22 14:38                                                     ` Peter Zijlstra
2020-05-15 11:28                                           ` Peter Zijlstra
2020-05-15 12:22                                             ` Mel Gorman
2020-05-15 12:51                                               ` Peter Zijlstra
2020-05-15 14:43                                       ` Jirka Hladky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200515084740.GJ3758@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bgray@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dshaks@redhat.com \
    --cc=hdanton@sina.com \
    --cc=jhladky@redhat.com \
    --cc=jmario@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.