linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <clm@fb.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Mike Galbraith <mgalbraith@suse.de>,
	<linux-kernel@vger.kernel.org>
Subject: sched: tweak select_idle_sibling to look for idle threads
Date: Sat, 9 Apr 2016 15:05:54 -0400	[thread overview]
Message-ID: <20160409190554.honue3gtian2p6vr@floor.thefacebook.com> (raw)
In-Reply-To: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com>

select_task_rq_fair() can leave cpu utilization a little lumpy,
especially as the workload ramps up to the maximum capacity of the
machine.  The end result can be high p99 response times as apps
wait to get scheduled, even when boxes are mostly idle.

I wrote schbench to try and measure this:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

The basic idea is to record the latency between when a thread is kicked
and when it actually gets the CPU.  For this patch I used a simple model
where a thread thinks for a while and then waits for data from another
thread.  The command line below will start two messenger threads with 18
workers per messenger:

./schbench -m 2 -t 18 -s 30000 -c 30000 -r 30
Latency percentiles (usec)
    50.0000th: 52
    75.0000th: 63
    90.0000th: 74
    95.0000th: 80
    *99.0000th: 118
    99.5000th: 707
    99.9000th: 5016
    Over=0, min=0, max=12941

p99 numbers here are acceptable, but you can see the curve starting.
if I use 19 workers per messenger, p99 goes through the roof.  This
machine has two sockets, 10 cores each, so with HT on, this commandline
has one pthread on each CPU:

./schbench -m 2 -t 19 -s 30000 -c 30000 -r 30
Latency percentiles (usec)
    50.0000th: 51
    75.0000th: 63
    90.0000th: 76
    95.0000th: 89
    *99.0000th: 2132
    99.5000th: 6920
    99.9000th: 12752
    Over=0, min=0, max=17079

This commit tries to solve things by doing an extra scan in
select_idle_sibling().  If it can find an idle sibling in any core in the
package, it will return that:

./schbench -m2 -t 19 -c 30000 -s 30000 -r 30
Latency percentiles (usec)
    50.0000th: 65
    75.0000th: 104
    90.0000th: 115
    95.0000th: 118
    *99.0000th: 124
    99.5000th: 127
    99.9000th: 262
    Over=0, min=0, max=12987

This basically means the whole fleet can have one more pthread per socket
and still maintain acceptable latencies.  I can actually go up to -t 20,
but it's not as consistent:

./schbench -m2 -t 20 -c 30000 -s 30000 -r 30
Latency percentiles (usec)
    50.0000th: 50
    75.0000th: 63
    90.0000th: 75
    95.0000th: 81
    *99.0000th: 111
    99.5000th: 975
    99.9000th: 12464
    Over=0, min=0, max=18317

This does preserve the existing logic to prefer idle cores over idle
CPU threads, and includes some tests to try and avoid the idle scan when we're
actually better off sharing a non-idle CPU with someone else.

Benchmarks in production show overall capacity going up between 2-5%
depending on the metric.

Credits to Arun Sharma <asharma@fb.com> for initial versions of this
patch.

Signed-off-by: Chris Mason <clm@fb.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b..2c47240 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4969,11 +4969,34 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 }
 
 /*
+ * helper for select_idle_sibling to decide if it should look for idle
+ * threads
+ */
+static int bounce_to_target(struct task_struct *p, int cpu)
+{
+	s64 delta;
+
+	/*
+	 * as the run queue gets bigger, its more and more likely that
+	 * balance will have distributed things for us, and less likely
+	 * that scanning all our CPUs for an idle one will find one.
+	 * So, if nr_running > 1, just call this CPU good enough
+	 */
+	if (cpu_rq(cpu)->cfs.nr_running > 1)
+		return 1;
+
+	/* taken from task_hot() */
+	delta = rq_clock_task(task_rq(p)) - p->se.exec_start;
+	return delta < (s64)sysctl_sched_migration_cost;
+}
+
+/*
  * Try and locate an idle CPU in the sched_domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
+	struct sched_domain *package_sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
@@ -4989,7 +5012,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
+	package_sd = rcu_dereference(per_cpu(sd_llc, target));
+	sd = package_sd;
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
@@ -4998,7 +5022,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
+				/*
+				 * we tested target for idle up above,
+				 * but don't skip it here because it might
+				 * have raced to idle while we were scanning
+				 */
+				if (!idle_cpu(i))
 					goto next;
 			}
 
@@ -5009,6 +5038,24 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	/*
+	 * we're here because we didn't find an idle core, or an idle sibling
+	 * in the target core.  For message bouncing workloads, we want to
+	 * just stick with the target suggestion from the caller, but
+	 * otherwise we'd rather have an idle CPU from anywhere else in
+	 * the package.
+	 */
+	if (package_sd && !bounce_to_target(p, target)) {
+		for_each_cpu_and(i, sched_domain_span(package_sd),
+				 tsk_cpus_allowed(p)) {
+			if (idle_cpu(i)) {
+				target = i;
+				break;
+			}
+
+		}
+	}
 done:
 	return target;
 }

  parent reply	other threads:[~2016-04-09 19:06 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-05 18:08 [PATCH RFC] select_idle_sibling experiments Chris Mason
2016-04-05 18:43 ` Bastien Bastien Philbert
2016-04-05 19:28   ` Chris Mason
2016-04-05 20:03 ` Matt Fleming
2016-04-05 21:05   ` Bastien Philbert
2016-04-06  0:44   ` Chris Mason
2016-04-06  7:27 ` Mike Galbraith
2016-04-06 13:36   ` Chris Mason
2016-04-09 17:30   ` Chris Mason
2016-04-12 21:45     ` Matt Fleming
2016-04-13  3:40       ` Mike Galbraith
2016-04-13 15:54         ` Chris Mason
2016-04-28 12:00   ` Peter Zijlstra
2016-04-28 13:17     ` Mike Galbraith
2016-05-02  5:35     ` Mike Galbraith
2016-04-07 15:17 ` Chris Mason
2016-04-09 19:05 ` Chris Mason [this message]
2016-04-10 10:04   ` sched: tweak select_idle_sibling to look for idle threads Mike Galbraith
2016-04-10 12:35     ` Chris Mason
2016-04-10 12:46       ` Mike Galbraith
2016-04-10 19:55     ` Chris Mason
2016-04-11  4:54       ` Mike Galbraith
2016-04-12  0:30         ` Chris Mason
2016-04-12  4:44           ` Mike Galbraith
2016-04-12 13:27             ` Chris Mason
2016-04-12 18:16               ` Mike Galbraith
2016-04-12 20:07                 ` Chris Mason
2016-04-13  3:18                   ` Mike Galbraith
2016-04-13 13:44                     ` Chris Mason
2016-04-13 14:22                       ` Mike Galbraith
2016-04-13 14:36                         ` Chris Mason
2016-04-13 15:05                           ` Mike Galbraith
2016-04-13 15:34                             ` Mike Galbraith
2016-04-30 12:47   ` Peter Zijlstra
2016-05-01  7:12     ` Mike Galbraith
2016-05-01  8:53       ` Peter Zijlstra
2016-05-01  9:20         ` Mike Galbraith
2016-05-07  1:24           ` Yuyang Du
2016-05-08  8:08             ` Mike Galbraith
2016-05-08 18:57               ` Yuyang Du
2016-05-09  3:45                 ` Mike Galbraith
2016-05-08 20:22                   ` Yuyang Du
2016-05-09  7:44                     ` Mike Galbraith
2016-05-09  1:13                       ` Yuyang Du
2016-05-09  9:39                         ` Mike Galbraith
2016-05-09 23:26                           ` Yuyang Du
2016-05-10  7:49                             ` Mike Galbraith
2016-05-10 15:26                               ` Mike Galbraith
2016-05-10 19:16                                 ` Yuyang Du
2016-05-11  4:17                                   ` Mike Galbraith
2016-05-11  1:23                                     ` Yuyang Du
2016-05-11  9:56                                       ` Mike Galbraith
2016-05-18  6:41                                   ` Mike Galbraith
2016-05-09  3:52                 ` Mike Galbraith
2016-05-08 20:31                   ` Yuyang Du
2016-05-02  8:46       ` Peter Zijlstra
2016-05-02 14:50         ` Mike Galbraith
2016-05-02 14:58           ` Peter Zijlstra
2016-05-02 15:47             ` Chris Mason
2016-05-03 14:32               ` Peter Zijlstra
2016-05-03 15:11                 ` Chris Mason
2016-05-04 10:37                   ` Peter Zijlstra
2016-05-04 15:31                     ` Peter Zijlstra
2016-05-05 22:03                     ` Matt Fleming
2016-05-06 18:54                       ` Mike Galbraith
2016-05-09  8:33                         ` Peter Zijlstra
2016-05-09  8:56                           ` Mike Galbraith
2016-05-04 15:45                   ` Peter Zijlstra
2016-05-04 17:46                     ` Chris Mason
2016-05-05  9:33                       ` Peter Zijlstra
2016-05-05 13:58                         ` Chris Mason
2016-05-06  7:12                           ` Peter Zijlstra
2016-05-06 17:27                             ` Chris Mason
2016-05-06  7:25                   ` Peter Zijlstra
2016-05-02 17:30             ` Mike Galbraith
2016-05-02 15:01           ` Peter Zijlstra
2016-05-02 16:04             ` Ingo Molnar
2016-05-03 11:31               ` Peter Zijlstra
2016-05-03 18:22                 ` Peter Zijlstra
2016-05-02 15:10           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160409190554.honue3gtian2p6vr@floor.thefacebook.com \
    --to=clm@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mgalbraith@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).