sched: tweak select_idle_sibling to look for idle threads

From: Chris Mason <clm@fb.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Mike Galbraith <mgalbraith@suse.de>,
	<linux-kernel@vger.kernel.org>
Subject: sched: tweak select_idle_sibling to look for idle threads
Date: Sat, 9 Apr 2016 15:05:54 -0400	[thread overview]
Message-ID: <20160409190554.honue3gtian2p6vr@floor.thefacebook.com> (raw)
In-Reply-To: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com>

select_task_rq_fair() can leave cpu utilization a little lumpy,
especially as the workload ramps up to the maximum capacity of the
machine.  The end result can be high p99 response times as apps
wait to get scheduled, even when boxes are mostly idle.

I wrote schbench to try and measure this:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

The basic idea is to record the latency between when a thread is kicked
and when it actually gets the CPU.  For this patch I used a simple model
where a thread thinks for a while and then waits for data from another
thread.  The command line below will start two messenger threads with 18
workers per messenger:

./schbench -m 2 -t 18 -s 30000 -c 30000 -r 30
Latency percentiles (usec)
    50.0000th: 52
    75.0000th: 63
    90.0000th: 74
    95.0000th: 80
    *99.0000th: 118
    99.5000th: 707
    99.9000th: 5016
    Over=0, min=0, max=12941

p99 numbers here are acceptable, but you can see the curve starting.
if I use 19 workers per messenger, p99 goes through the roof.  This
machine has two sockets, 10 cores each, so with HT on, this commandline
has one pthread on each CPU:

./schbench -m 2 -t 19 -s 30000 -c 30000 -r 30
Latency percentiles (usec)
    50.0000th: 51
    75.0000th: 63
    90.0000th: 76
    95.0000th: 89
    *99.0000th: 2132
    99.5000th: 6920
    99.9000th: 12752
    Over=0, min=0, max=17079

This commit tries to solve things by doing an extra scan in
select_idle_sibling().  If it can find an idle sibling in any core in the
package, it will return that:

./schbench -m2 -t 19 -c 30000 -s 30000 -r 30
Latency percentiles (usec)
    50.0000th: 65
    75.0000th: 104
    90.0000th: 115
    95.0000th: 118
    *99.0000th: 124
    99.5000th: 127
    99.9000th: 262
    Over=0, min=0, max=12987

This basically means the whole fleet can have one more pthread per socket
and still maintain acceptable latencies.  I can actually go up to -t 20,
but it's not as consistent:

./schbench -m2 -t 20 -c 30000 -s 30000 -r 30
Latency percentiles (usec)
    50.0000th: 50
    75.0000th: 63
    90.0000th: 75
    95.0000th: 81
    *99.0000th: 111
    99.5000th: 975
    99.9000th: 12464
    Over=0, min=0, max=18317

This does preserve the existing logic to prefer idle cores over idle
CPU threads, and includes some tests to try and avoid the idle scan when we're
actually better off sharing a non-idle CPU with someone else.

Benchmarks in production show overall capacity going up between 2-5%
depending on the metric.

Credits to Arun Sharma <asharma@fb.com> for initial versions of this
patch.

Signed-off-by: Chris Mason <clm@fb.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..2c47240 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4969,11 +4969,34 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 }
 
 /*
+ * helper for select_idle_sibling to decide if it should look for idle
+ * threads
+ */
+static int bounce_to_target(struct task_struct *p, int cpu)
+{
+	s64 delta;
+
+	/*
+	 * as the run queue gets bigger, its more and more likely that
+	 * balance will have distributed things for us, and less likely
+	 * that scanning all our CPUs for an idle one will find one.
+	 * So, if nr_running > 1, just call this CPU good enough
+	 */
+	if (cpu_rq(cpu)->cfs.nr_running > 1)
+		return 1;
+
+	/* taken from task_hot() */
+	delta = rq_clock_task(task_rq(p)) - p->se.exec_start;
+	return delta < (s64)sysctl_sched_migration_cost;
+}
+
+/*
  * Try and locate an idle CPU in the sched_domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
+	struct sched_domain *package_sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
@@ -4989,7 +5012,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
+	package_sd = rcu_dereference(per_cpu(sd_llc, target));
+	sd = package_sd;
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
@@ -4998,7 +5022,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
+				/*
+				 * we tested target for idle up above,
+				 * but don't skip it here because it might
+				 * have raced to idle while we were scanning
+				 */
+				if (!idle_cpu(i))
 					goto next;
 			}
 
@@ -5009,6 +5038,24 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	/*
+	 * we're here because we didn't find an idle core, or an idle sibling
+	 * in the target core.  For message bouncing workloads, we want to
+	 * just stick with the target suggestion from the caller, but
+	 * otherwise we'd rather have an idle CPU from anywhere else in
+	 * the package.
+	 */
+	if (package_sd && !bounce_to_target(p, target)) {
+		for_each_cpu_and(i, sched_domain_span(package_sd),
+				 tsk_cpus_allowed(p)) {
+			if (idle_cpu(i)) {
+				target = i;
+				break;
+			}
+
+		}
+	}
 done:
 	return target;
 }