linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups
@ 2019-01-09  3:49 Andrea Arcangeli
  2019-01-09  3:49 ` [PATCH 1/1] " Andrea Arcangeli
  2019-01-09  4:19 ` [PATCH 0/1] RFC: " Mike Galbraith
  0 siblings, 2 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2019-01-09  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman; +Cc: linux-kernel

Hello,

we noticed some unexpected performance regressions in the scheduler by
switching the guest CPU topology from "-smp 2,sockets=2,cores=1" to
"-smp 2,sockets=1,cores=2".

With sockets=2,cores=1 localhost message passing (pipes, AF_UNIX etc..)
runs serially at 100% CPU load of a single vcpu with optimal
performance. With sockets=1,cores=2 the load is spread across both
vcpus and performance is reduced.

With SCHED_MC=n on older kernels the problem goes away (but that's far
from ideal for heavily multithreaded workloads which then regress)
because that basically disables the last part of select_idle_sibling().

On bare metal with SCHED_MC=y on any recent multicore CPU the
scheduler (as expected) behaves like sockets=1,cores=2, so it won't
run the tasks serially.

The reason is that select_idle_sibling() can disregard the "sync" hint
and all decisions done up to that point and at the last minute it can
decide to move the waken task to an arbitrary idle core.

To test the above theory I implemented this patch which seems to
confirm the reason the tasks won't run serially anymore with
sockets=1,cores=2 is select_idle_sibling() overriding the "sync" hint.

You worked on the wake_affine() before so you may want to review this
issue, if you agree these sync workloads should run serially even in
presence of idle cores in the system. I don't know if the current
behavior is on purpose but if it is, it'd be interesting to know
why. This is just a RFC.

To test I used this trivial program.

/*
 *  pipe-loop.c
 *
 *  Copyright (C) 2019 Red Hat, Inc.
 *
 *  This work is licensed under the terms of the GNU GPL, version 2.
 */

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main(int argc, char ** argv)
{
	char buf[1];
	int n = 1000000;

	int pipe1[2], pipe2[2];

	pipe(pipe1);
	pipe(pipe2);

	if (fork()) {
		while (n--) {
			read(pipe1[0], buf, 1);
			write(pipe2[1], buf, 1);
		}
		wait(NULL);
	} else {
		while (n--) {
			write(pipe1[1], buf, 1);
			read(pipe2[0], buf, 1);
		}
	}

	return 0;
}

Andrea Arcangeli (1):
  sched/fair: skip select_idle_sibling() in presence of sync wakeups

 kernel/sched/fair.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] sched/fair: skip select_idle_sibling() in presence of sync wakeups
  2019-01-09  3:49 [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups Andrea Arcangeli
@ 2019-01-09  3:49 ` Andrea Arcangeli
  2019-01-09  4:19 ` [PATCH 0/1] RFC: " Mike Galbraith
  1 sibling, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2019-01-09  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman; +Cc: linux-kernel

__wake_up_sync() gives a very explicit hint to the scheduler that the
current task will immediately go to sleep and won't return running
until after the waken tasks has started running again.

This is common behavior for message passing through pipes or local
sockets (AF_UNIX or through the loopback interface).

The scheduler does everything right up to the point it calls
select_idle_sibling(). Up to that point the CPU selected for the next
task that got a sync-wakeup could very well be the local CPU. That way
the sync-waken task will start running immediately after the current
task goes to sleep without requiring remote CPU wakeups.

However when select_idle_sibling() is called (especially if
SCHED_MC=y) if there's at least one idle core in the same package, the
sync-waken task will be forcefully waken to run on a different idle
core, in turn destroying the "sync" information and all work done up
to that point.

Without this patch under such a workload there will be two different
CPUs at ~50% utilization and the __wake_up_sync() hint won't really
provide much of benefit if compared to the regular non-sync
wakeup. With this patch there will be a single CPU used at 100%
utilization and that increases performance for those common workloads.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1907506318a..b2ac152a6935 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -691,7 +691,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #include "pelt.h"
 #include "sched-pelt.h"
 
-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu,
+			       int cpu, int target, int sync);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -1678,7 +1679,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		 */
 		local_irq_disable();
 		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
-						   env->dst_cpu);
+						   -1, env->dst_cpu, 0);
 		local_irq_enable();
 	}
 
@@ -6161,12 +6162,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int this_cpu,
+			       int target, int sync)
 {
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
-	if (available_idle_cpu(target))
+	if (available_idle_cpu(target) ||
+	    (sync && target == this_cpu && cpu_rq(this_cpu)->nr_running == 1))
 		return target;
 
 	/*
@@ -6649,7 +6652,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
 
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+		new_cpu = select_idle_sibling(p, prev_cpu, cpu, new_cpu, sync);
 
 		if (want_affine)
 			current->recent_used_cpu = cpu;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups
  2019-01-09  3:49 [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups Andrea Arcangeli
  2019-01-09  3:49 ` [PATCH 1/1] " Andrea Arcangeli
@ 2019-01-09  4:19 ` Mike Galbraith
  2019-01-09 10:07   ` Mel Gorman
  2019-01-09 18:02   ` Andrea Arcangeli
  1 sibling, 2 replies; 6+ messages in thread
From: Mike Galbraith @ 2019-01-09  4:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Peter Zijlstra, Mel Gorman; +Cc: linux-kernel

On Tue, 2019-01-08 at 22:49 -0500, Andrea Arcangeli wrote:
> Hello,
> 
> we noticed some unexpected performance regressions in the scheduler by
> switching the guest CPU topology from "-smp 2,sockets=2,cores=1" to
> "-smp 2,sockets=1,cores=2".
> 
> With sockets=2,cores=1 localhost message passing (pipes, AF_UNIX etc..)
> runs serially at 100% CPU load of a single vcpu with optimal
> performance. With sockets=1,cores=2 the load is spread across both
> vcpus and performance is reduced.
> 
> With SCHED_MC=n on older kernels the problem goes away (but that's far
> from ideal for heavily multithreaded workloads which then regress)
> because that basically disables the last part of select_idle_sibling().
> 
> On bare metal with SCHED_MC=y on any recent multicore CPU the
> scheduler (as expected) behaves like sockets=1,cores=2, so it won't
> run the tasks serially.
> 
> The reason is that select_idle_sibling() can disregard the "sync" hint
> and all decisions done up to that point and at the last minute it can
> decide to move the waken task to an arbitrary idle core.
> 
> To test the above theory I implemented this patch which seems to
> confirm the reason the tasks won't run serially anymore with
> sockets=1,cores=2 is select_idle_sibling() overriding the "sync" hint.
> 
> You worked on the wake_affine() before so you may want to review this
> issue, if you agree these sync workloads should run serially even in
> presence of idle cores in the system. I don't know if the current
> behavior is on purpose but if it is, it'd be interesting to know
> why. This is just a RFC.
> 
> To test I used this trivial program.

Which highlights the problem.  That proggy really is synchronous, but
the sync hint is applied to many MANY real world cases where this is
not the case at all.  Sure, you can make things like pipe_test and
nearly nonexistent payload TCP_RR numbers look gorgeous, but that
demolishes concurrency for real applications.

Even when any given wakeup really is a truly synchronous wakeup, does,
say, an application sending a byte down a pipe necessarily mean it's
not going to then execute in parallel with the recipient of that byte? 
I think not.  IMO, the sync hint should die.  The only time it is a win
is for applications that do essentially nothing but context switch.  I
don't think the real world plays a lot of high speed ping pong.

> /*
>  *  pipe-loop.c
>  *
>  *  Copyright (C) 2019 Red Hat, Inc.
>  *
>  *  This work is licensed under the terms of the GNU GPL, version 2.
>  */
> 
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> 
> int main(int argc, char ** argv)
> {
> 	char buf[1];
> 	int n = 1000000;
> 
> 	int pipe1[2], pipe2[2];
> 
> 	pipe(pipe1);
> 	pipe(pipe2);
> 
> 	if (fork()) {
> 		while (n--) {
> 			read(pipe1[0], buf, 1);
> 			write(pipe2[1], buf, 1);
> 		}
> 		wait(NULL);
> 	} else {
> 		while (n--) {
> 			write(pipe1[1], buf, 1);
> 			read(pipe2[0], buf, 1);
> 		}
> 	}
> 
> 	return 0;
> }
> 
> Andrea Arcangeli (1):
>   sched/fair: skip select_idle_sibling() in presence of sync wakeups
> 
>  kernel/sched/fair.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups
  2019-01-09  4:19 ` [PATCH 0/1] RFC: " Mike Galbraith
@ 2019-01-09 10:07   ` Mel Gorman
  2019-01-09 18:24     ` Andrea Arcangeli
  2019-01-09 18:02   ` Andrea Arcangeli
  1 sibling, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2019-01-09 10:07 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andrea Arcangeli, Peter Zijlstra, linux-kernel

On Wed, Jan 09, 2019 at 05:19:48AM +0100, Mike Galbraith wrote:
> > To test I used this trivial program.
> 
> Which highlights the problem.  That proggy really is synchronous, but
> the sync hint is applied to many MANY real world cases where this is
> not the case at all.  Sure, you can make things like pipe_test and
> nearly nonexistent payload TCP_RR numbers look gorgeous, but that
> demolishes concurrency for real applications.
> 

I agree with Mike here. Many previous attempts to strictly obey the strict
hint has led to regressions elsewhere -- specifically a task waking 2+
wakees that temporarily stack on one CPU when nearby CPUs sharing LLC
remain idle. It's why the select idle sibling logic tried to take into
account a recently used CPU to wake such tasks if the recent CPU was
still idle.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups
  2019-01-09  4:19 ` [PATCH 0/1] RFC: " Mike Galbraith
  2019-01-09 10:07   ` Mel Gorman
@ 2019-01-09 18:02   ` Andrea Arcangeli
  1 sibling, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2019-01-09 18:02 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Mel Gorman, linux-kernel

Hello Mike,

On Wed, Jan 09, 2019 at 05:19:48AM +0100, Mike Galbraith wrote:
> On Tue, 2019-01-08 at 22:49 -0500, Andrea Arcangeli wrote:
> > Hello,
> > 
> > we noticed some unexpected performance regressions in the scheduler by
> > switching the guest CPU topology from "-smp 2,sockets=2,cores=1" to
> > "-smp 2,sockets=1,cores=2".
*snip*
> > To test I used this trivial program.
> 
> Which highlights the problem.  That proggy really is synchronous, but

Note that I wrote the program only after the guest scheduler
regression was reported, purely in order to test the patch and to
reproduce the customer issue more easily (so I could see the effect by
just running top). The regression was reported by a real life customer
workload AFIK and it was caused by the idle balancing dropping the
sync information.

If it was just the lat_ctx type of workload like the program I
attached I wouldn't care either, but this was a localhost udp and tcp
(both bandwidth and latency) test that showed improvement by not
dropping to the sync information through idle core balancing during
wakeups.

There is no tuning to allow people to test the sync information with
real workloads, the only way is to rebuild the kernel with SCHED_MC=n
(which nobody should be doing because it has other drawbacks) or by
altering the vCPU topology. So for now we're working to restore the
standard only-sockets topology to shut off the idle balancing without
having to patch the guest scheduler, but this looked like a more
general problem that has room for improvement.

Ideally we should detect when the sync information is worth keeping
instead of always dropping it. Alternatively a sched_feat could be
added to achieve it manually.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups
  2019-01-09 10:07   ` Mel Gorman
@ 2019-01-09 18:24     ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2019-01-09 18:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Mike Galbraith, Peter Zijlstra, linux-kernel

On Wed, Jan 09, 2019 at 10:07:51AM +0000, Mel Gorman wrote:
> I agree with Mike here. Many previous attempts to strictly obey the strict
> hint has led to regressions elsewhere -- specifically a task waking 2+
> wakees that temporarily stack on one CPU when nearby CPUs sharing LLC

sync-waking 2 wakees in a row before going to sleep should cause the
runqueue to have nr_running != 1 after the first wakee is waken up on
the local CPU. Your example explains the following nr_running == 1
check in the patch:

	    (sync && target == this_cpu && cpu_rq(this_cpu)->nr_running == 1))
					   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The implementation is just an RFC because it may have other drawbacks,
but I thought the second wakee of this specific example supposedly
should still do the idle core/ht balancing normally like before the
change.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-01-09 18:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09  3:49 [PATCH 0/1] RFC: sched/fair: skip select_idle_sibling() in presence of sync wakeups Andrea Arcangeli
2019-01-09  3:49 ` [PATCH 1/1] " Andrea Arcangeli
2019-01-09  4:19 ` [PATCH 0/1] RFC: " Mike Galbraith
2019-01-09 10:07   ` Mel Gorman
2019-01-09 18:24     ` Andrea Arcangeli
2019-01-09 18:02   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).