From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751803AbcERFvj (ORCPT <rfc822;w@1wt.eu>);
	Wed, 18 May 2016 01:51:39 -0400
Received: from mx2.suse.de ([195.135.220.15]:42341 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751458AbcERFvi (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 18 May 2016 01:51:38 -0400
Message-ID: <1463550694.4012.7.camel@suse.de>
Subject: [RFC][PATCH 8/7] sched/fair: Use utilization distance to filter
 affine sync wakeups
From: Mike Galbraith <mgalbraith@suse.de>
To: Peter Zijlstra <peterz@infradead.org>, mingo@kernel.org,
        linux-kernel@vger.kernel.org
Cc: clm@fb.com, matt@codeblueprint.co.uk, tglx@linutronix.de,
        fweisbec@gmail.com
Date: Wed, 18 May 2016 07:51:34 +0200
In-Reply-To: <20160509104807.284575300@infradead.org>
References: <20160509104807.284575300@infradead.org>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.16.5 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, 2016-05-09 at 12:48 +0200, Peter Zijlstra wrote:
> Hai,

(got some of the frozen variety handy?:)

> here be a semi coherent patch series for the recent select_idle_siblings()
> tinkering. Happy benchmarking..

And tinkering on top of your rewrite series...

sched/fair: Use utilization distance to filter affine sync wakeups

Identifying truly synchronous tasks accurately is annoyingly fragile,
which led to the demise of the old avg_overlap heuristic, which meant
that we schedule tasks high frequency localhost communicating buddies
to L3 vs L2, causing them to take painful cache misses needlessly.

To combat this, track average utilization distance, and when both
waker/wakee are short duration tasks cycling at the ~same frequency
(ie can't have any appreciable reclaimable overlap), and the sync
hint has been passed, take that as a queue that pulling the wakee
to hot L2 is very likely to be a win.  Changes in behavior, such
as taking a long nap, bursts of other activity, or sharing the rq
with tasks that are not cycling rapidly will quickly encourage the
pair to search for a new home, where they can again find each other.

This only helps really fast movers, but that's ok (if we can get
away with it at all), as these are the ones that need some help.

It's dirt simple, cheap, and seems to work pretty well.  It does
help fast movers, does not wreck lmbench AF_UNIX/TCP throughput
gains that select_idle_sibling() provided, and didn't change pgbench
numbers one bit on my desktop box, ie tight discrimination criteria
seems to work out ok in light testing, so _maybe_ not completely
useless...
 
4 x E7-8890 tbench
Throughput 598.158 MB/sec  1 clients  1 procs  max_latency=0.287 ms         1.000
Throughput 1166.26 MB/sec  2 clients  2 procs  max_latency=0.076 ms         1.000
Throughput 2214.55 MB/sec  4 clients  4 procs  max_latency=0.087 ms         1.000
Throughput 4264.44 MB/sec  8 clients  8 procs  max_latency=0.164 ms         1.000
Throughput 7780.58 MB/sec  16 clients  16 procs  max_latency=0.109 ms       1.000
Throughput 15199.3 MB/sec  32 clients  32 procs  max_latency=0.293 ms       1.000
Throughput 21714.8 MB/sec  64 clients  64 procs  max_latency=0.872 ms       1.000
Throughput 44916.1 MB/sec  128 clients  128 procs  max_latency=4.821 ms     1.000
Throughput 76294.5 MB/sec  256 clients  256 procs  max_latency=7.375 ms     1.000

+IDLE_SYNC
Throughput 737.781 MB/sec  1 clients  1 procs  max_latency=0.248 ms         1.233
Throughput 1478.49 MB/sec  2 clients  2 procs  max_latency=0.321 ms         1.267
Throughput 2506.98 MB/sec  4 clients  4 procs  max_latency=0.413 ms         1.132
Throughput 4359.15 MB/sec  8 clients  8 procs  max_latency=0.306 ms         1.022
Throughput 9025.05 MB/sec  16 clients  16 procs  max_latency=0.349 ms       1.159
Throughput 18703.1 MB/sec  32 clients  32 procs  max_latency=0.290 ms       1.230
Throughput 33600.8 MB/sec  64 clients  64 procs  max_latency=6.469 ms       1.547
Throughput 59084.3 MB/sec  128 clients  128 procs  max_latency=5.031 ms     1.315
Throughput 75705.8 MB/sec  256 clients  256 procs  max_latency=24.113 ms    0.992

1 x i4790 lmbench3
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
IDLE_CORE+IDLE_CPU+IDLE_SMT
homer     4.6.0-masterx 6027 14.K 9773 8905.2  15.2K  10.1K 6775.0 15.K 10.0K
homer     4.6.0-masterx 5962 14.K 9881 8900.7  15.0K  10.1K 6785.2 15.K 10.0K
homer     4.6.0-masterx 5935 14.K 9917 8946.2  15.0K  10.1K 6761.8 15.K 9826.
+IDLE_SYNC
homer     4.6.0-masterx 8865 14.K 9807 8880.6  14.7K  10.1K 6777.9 15.K 9966.
homer     4.6.0-masterx 8855 13.K 9856 8844.5  15.2K  10.1K 6752.1 15.K 10.0K
homer     4.6.0-masterx 8896 14.K 9836 8880.1  15.0K  10.2K 6771.6 15.K 9941.
                        ^++  ^+-  ^+-
select_idle_sibling() completely disabled
homer     4.6.0-masterx 8810 9807 7109 8982.8  15.4K  10.2K 6831.7 15.K 10.1K
homer     4.6.0-masterx 8877 9757 6864 8970.1  15.3K  10.2K 6826.6 15.K 10.1K
homer     4.6.0-masterx 8779 9736 10.K 8975.6  15.4K  10.1K 6830.2 15.K 10.1K
                        ^++  ^--  ^--

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h   |    2 -
 kernel/sched/core.c     |    6 -----
 kernel/sched/fair.c     |   49 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    1 
 kernel/sched/sched.h    |    7 ++++++
 5 files changed, 51 insertions(+), 14 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1302,7 +1302,7 @@ struct load_weight {
  * issues.
  */
 struct sched_avg {
-	u64 last_update_time, load_sum;
+	u64 last_update_time, load_sum, util_dist_us;
 	u32 util_sum, period_contrib;
 	unsigned long load_avg, util_avg;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1706,12 +1706,6 @@ int select_task_rq(struct task_struct *p
 	return cpu;
 }
 
-static void update_avg(u64 *avg, u64 sample)
-{
-	s64 diff = sample - *avg;
-	*avg += diff >> 3;
-}
-
 #else
 
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -658,7 +658,7 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, int sync);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -689,6 +689,7 @@ void init_entity_runnable_average(struct
 	 */
 	sa->util_avg = 0;
 	sa->util_sum = 0;
+	sa->util_dist_us = USEC_PER_SEC;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
@@ -1509,7 +1510,7 @@ static void task_numa_compare(struct tas
 		 * can be used from IRQ context.
 		 */
 		local_irq_disable();
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, 0);
 		local_irq_enable();
 	}
 
@@ -2734,6 +2735,13 @@ __update_load_avg(u64 now, int cpu, stru
 		return 0;
 	sa->last_update_time = now;
 
+	/*
+	 * Track avg utilization distance for select_idle_sibling() to try
+	 * to identify short running synchronous tasks that will perform
+	 * much better when migrated to tasty hot L2 data.
+	 */
+	update_avg(&sa->util_dist_us, min(delta, (u64)USEC_PER_SEC));
+
 	scale_freq = arch_scale_freq_capacity(NULL, cpu);
 	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
@@ -5408,17 +5416,43 @@ static int select_idle_cpu(struct task_s
 	return cpu;
 }
 
+static int select_idle_sync(struct task_struct *p, int target)
+{
+	u64 max = current->se.avg.util_dist_us;
+	u64 min = p->se.avg.util_dist_us;
+
+	if (min > max)
+		swap(min, max);
+
+	/*
+	 * If waker/wakee are both short duration and very similar,
+	 * prefer L2.  If either spends much time waiting or changes
+	 * behavior, utilization distances should quickly increase,
+	 * rendering them ineligible.
+	 */
+	if (max + min > 10 || max - min > 2)
+		return -1;
+	return target;
+}
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, int sync)
 {
 	struct sched_domain *sd;
-	int start, i = task_cpu(p);
+	int start, i;
 
 	if (idle_cpu(target))
 		return target;
 
+	if (sched_feat(IDLE_SYNC) && sync) {
+		i = select_idle_sync(p, target);
+		if (i != -1)
+			return i;
+	}
+
+	i = task_cpu(p);
 	/*
 	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
@@ -5573,9 +5607,10 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (!sd) {
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
-
+		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
+			sync &= want_affine;
+			new_cpu = select_idle_sibling(p, new_cpu, sync);
+		}
 	} else while (sd) {
 		struct sched_group *group;
 		int weight;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -78,4 +78,5 @@ SCHED_FEAT(AVG_CPU, true)
 SCHED_FEAT(PRINT_AVG, false)
 
 SCHED_FEAT(IDLE_SMT, true)
+SCHED_FEAT(IDLE_SYNC, true)
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1848,3 +1848,10 @@ static inline void update_idle_core(stru
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
+
+static inline void update_avg(u64 *avg, u64 sample)
+{
+	s64 diff = sample - *avg;
+	*avg += diff >> 3;
+}
+