Re: wake_wide mechanism clarification

From: Matt Fleming <matt@codeblueprint.co.uk>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Juri Lelli <Juri.Lelli@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Patrick Bellasi <patrick.bellasi@arm.com>,
	Brendan Jackman <brendan.jackman@arm.com>,
	Chris Redpath <Chris.Redpath@arm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: Re: wake_wide mechanism clarification
Date: Fri, 30 Jun 2017 14:11:24 +0100	[thread overview]
Message-ID: <20170630131124.GB12077@codeblueprint.co.uk> (raw)
In-Reply-To: <20170630004912.GA2457@destiny>

On Thu, 29 Jun, at 08:49:13PM, Josef Bacik wrote:
> 
> It may be worth to try with schedbench and trace it to see how this turns out in
> practice, as that's the workload that generated all this discussion before.  I
> imagine generally speaking this works out properly.  The small regression I
> reported before was at low RPS, so we wouldn't be waking up as many tasks as
> often, so we would be returning 0 from wake_wide() and we'd get screwed.  This
> is where I think possibly dropping the slave < factor part of the test would
> address that, but I'd have to trace it to say for sure.  Thanks,

Just 2 weeks ago I was poking at wake_wide() because it's impacting
hackbench times now we're better at balancing on fork() (see commit
6b94780e45c1 ("sched/core: Use load_avg for selecting idlest group")).

What's happening is that occasionally the hackbench times will be
pretty large because the hackbench tasks are being pulled back and
forth across NUMA domains due to the wake_wide() logic.

Reproducing this issue does require a NUMA box with more CPUs than
hackbench tasks. I was using an 80-cpu 2 NUMA node box with 1
hackbench group (20 readers, 20 writers).

I did the following very quick hack,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1f5efa51dc7..c1bc1b0434bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5055,7 +5055,7 @@ static int wake_wide(struct task_struct *p)

        if (master < slave)
                swap(master, slave);
-       if (slave < factor || master < slave * factor)
+       if (master < slave * factor)
                return 0;
        return 1;
 }

Which produces the following results for the 1 group (40 tasks) on one
of SUSE's enterprise kernels:

hackbench-process-pipes
                            4.4.71                4.4.71
                          patched+patched+-wake-wide-fix
Min      1        0.7000 (  0.00%)      0.8480 (-21.14%)
Amean    1        1.0343 (  0.00%)      0.9073 ( 12.28%)
Stddev   1        0.2373 (  0.00%)      0.0447 ( 81.15%)
CoeffVar 1       22.9447 (  0.00%)      4.9300 ( 78.51%)
Max      1        1.2270 (  0.00%)      0.9560 ( 22.09%)

You'll see that the minimum value is worse with my change, but the
maximum is much better.

So the current wake_wide() code does help sometimes, but it also hurts
sometimes too.

I'm happy to gather performance data for any code suggestions.