Re: sysbench throughput degradation in 4.13+

From: Peter Zijlstra <peterz@infradead.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Rik van Riel <riel@redhat.com>,
	Eric Farman <farman@linux.vnet.ibm.com>,
	????????? <jinpuwang@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	"KVM-ML (kvm@vger.kernel.org)" <kvm@vger.kernel.org>,
	vcaputo@pengaru.com, Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Subject: Re: sysbench throughput degradation in 4.13+
Date: Wed, 4 Oct 2017 18:18:50 +0200	[thread overview]
Message-ID: <20171004161850.wgnu73dokpjfyfdk@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20171003083932.3qa7jw2spmi5n5pg@hirez.programming.kicks-ass.net>

On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> over the weekend.
> 
> The trivial thing regresses a wee bit on the overloaded case, I've not
> yet tried to fix it.

WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
addition -- based on the old scheme -- that I've tried in order to lift
the overloaded case (including hackbench).

Its not an unconditional win, but I'm tempted to default enable
WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs).

But let me first write a Changelog for the below and queue that. Then we
ran maybe run more things..



On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core):


WA_IDLE && NO_WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       7.391856936 seconds time elapsed                                          ( +-  0.66% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 54524.6
TCP_SENDFILE-10 : Avg: 48185.2
TCP_SENDFILE-20 : Avg: 29031.2
TCP_SENDFILE-40 : Avg: 9819.72
TCP_SENDFILE-80 : Avg: 5355.3
TCP_STREAM-1 : Avg: 41448.3
TCP_STREAM-10 : Avg: 24123.2
TCP_STREAM-20 : Avg: 15834.5
TCP_STREAM-40 : Avg: 5583.91
TCP_STREAM-80 : Avg: 2329.66
TCP_RR-1 : Avg: 80473.5
TCP_RR-10 : Avg: 72660.5
TCP_RR-20 : Avg: 52607.1
TCP_RR-40 : Avg: 57199.2
TCP_RR-80 : Avg: 25330.3
UDP_RR-1 : Avg: 108266
UDP_RR-10 : Avg: 95480
UDP_RR-20 : Avg: 68770.8
UDP_RR-40 : Avg: 76231
UDP_RR-80 : Avg: 34578.3
UDP_STREAM-1 : Avg: 64684.3
UDP_STREAM-10 : Avg: 52701.2
UDP_STREAM-20 : Avg: 30376.4
UDP_STREAM-40 : Avg: 15685.8
UDP_STREAM-80 : Avg: 8415.13
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok
  2: [30 secs]     transactions:                        64057  (2135.17 per sec.)
  5: [30 secs]     transactions:                        144295 (4809.68 per sec.)
 10: [30 secs]     transactions:                        274768 (9158.59 per sec.)
 20: [30 secs]     transactions:                        437140 (14570.70 per sec.)
 40: [30 secs]     transactions:                        663949 (22130.56 per sec.)
 80: [30 secs]     transactions:                        629927 (20995.56 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        50389  (1679.58 per sec.)
  5: [30 secs]     transactions:                        113934 (3797.69 per sec.)
 10: [30 secs]     transactions:                        217606 (7253.22 per sec.)
 20: [30 secs]     transactions:                        335021 (11166.75 per sec.)
 40: [30 secs]     transactions:                        518355 (17277.28 per sec.)
 80: [30 secs]     transactions:                        513424 (17112.44 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 2
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 4
        min=0, max=86
avg worker transfer: 190227.78 ops/sec 743.08KB/s
rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84%
rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87%
Latency percentiles (usec)
        50.0000th: 62
        75.0000th: 101
        90.0000th: 108
        95.0000th: 112
        *99.0000th: 119
        99.5000th: 124
        99.9000th: 4920
        min=0, max=12987
Throughput 664.328 MB/sec  2 clients  2 procs  max_latency=0.076 ms
Throughput 1573.72 MB/sec  5 clients  5 procs  max_latency=0.102 ms
Throughput 2948.7 MB/sec  10 clients  10 procs  max_latency=0.198 ms
Throughput 4602.38 MB/sec  20 clients  20 procs  max_latency=1.712 ms
Throughput 9253.17 MB/sec  40 clients  40 procs  max_latency=2.047 ms
Throughput 8056.01 MB/sec  80 clients  80 procs  max_latency=35.819 ms


-----------------------

WA_IDLE && WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       6.500797532 seconds time elapsed                                          ( +-  0.97% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 52224.3
TCP_SENDFILE-10 : Avg: 46504.3
TCP_SENDFILE-20 : Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9253.12
TCP_SENDFILE-80 : Avg: 4687.4
TCP_STREAM-1 : Avg: 42254
TCP_STREAM-10 : Avg: 25847.9
TCP_STREAM-20 : Avg: 18374.4
TCP_STREAM-40 : Avg: 5599.57
TCP_STREAM-80 : Avg: 2726.41
TCP_RR-1 : Avg: 82638.8
TCP_RR-10 : Avg: 73265.1
TCP_RR-20 : Avg: 52634.5
TCP_RR-40 : Avg: 56302.3
TCP_RR-80 : Avg: 26867.9
UDP_RR-1 : Avg: 107844
UDP_RR-10 : Avg: 95245.2
UDP_RR-20 : Avg: 68673.7
UDP_RR-40 : Avg: 75419.1
UDP_RR-80 : Avg: 35639.1
UDP_STREAM-1 : Avg: 66606
UDP_STREAM-10 : Avg: 52959.5
UDP_STREAM-20 : Avg: 29704
UDP_STREAM-40 : Avg: 15266.5
UDP_STREAM-80 : Avg: 7388.97
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 
  2: [30 secs]     transactions:                        64277  (2142.51 per sec.)
  5: [30 secs]     transactions:                        144010 (4800.19 per sec.)
 10: [30 secs]     transactions:                        274722 (9157.05 per sec.)
 20: [30 secs]     transactions:                        436325 (14543.55 per sec.)
 40: [30 secs]     transactions:                        665582 (22184.82 per sec.)
 80: [30 secs]     transactions:                        657185 (21904.18 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        51153  (1705.06 per sec.)
  5: [30 secs]     transactions:                        116403 (3879.93 per sec.)
 10: [30 secs]     transactions:                        217750 (7258.06 per sec.)
 20: [30 secs]     transactions:                        336619 (11220.00 per sec.)
 40: [30 secs]     transactions:                        520823 (17359.78 per sec.)
 80: [30 secs]     transactions:                        516690 (17221.16 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 3
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 5
        min=0, max=86
avg worker transfer: 185378.92 ops/sec 724.14KB/s
rps: 1004.82 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.51 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.38 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.23 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.89 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.73 p95 (usec) 9520 p99 (usec) 9552 p95/cputime 31.73% p99/cputime 31.84%
rps: 1220.05 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.85 p95 (usec) 14960 p99 (usec) 14960 p95/cputime 49.87% p99/cputime 49.87%
rps: 1243.86 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 213760 p99 (usec) 225024 p95/cputime 712.53% p99/cputime 750.08%
Latency percentiles (usec)
        50.0000th: 66
        75.0000th: 101
        90.0000th: 107
        95.0000th: 112
        *99.0000th: 120
        99.5000th: 126
        99.9000th: 390
        min=0, max=12964
Throughput 678.413 MB/sec  2 clients  2 procs  max_latency=0.105 ms
Throughput 1589.98 MB/sec  5 clients  5 procs  max_latency=0.084 ms
Throughput 3012.51 MB/sec  10 clients  10 procs  max_latency=0.262 ms
Throughput 4555.93 MB/sec  20 clients  20 procs  max_latency=0.515 ms
Throughput 8496.23 MB/sec  40 clients  40 procs  max_latency=2.040 ms
Throughput 8601.62 MB/sec  80 clients  80 procs  max_latency=2.712 ms


---
 include/linux/sched/topology.h |   8 ---
 kernel/sched/fair.c            | 131 ++++++++++++-----------------------------
 kernel/sched/features.h        |   2 +
 3 files changed, 39 insertions(+), 102 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..7d065abc7a47 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,14 +71,6 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-
-	/*
-	 * Some variables from the most recent sd_lb_stats for this domain,
-	 * used by wake_affine().
-	 */
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec01523..a1a6b6f52660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5638,91 +5638,60 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-struct llc_stats {
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
-	int		has_capacity;
-};
+/*
+ * The purpose of wake_affine() is to quickly determine on which CPU we can run
+ * soonest. For the purpose of speed we only consider the waking and previous
+ * CPU.
+ *
+ * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or
+ * 			will be) idle.
+ *
+ * wake_affine_weight() - considers the weight to reflect the average
+ * 			  scheduling latency of the CPUs. This seems to work
+ * 			  for the overloaded case.
+ */
 
-static bool get_llc_stats(struct llc_stats *stats, int cpu)
+static bool
+wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
+		 int this_cpu, int prev_cpu, int sync)
 {
-	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-
-	if (!sds)
-		return false;
+	if (idle_cpu(this_cpu))
+		return true;
 
-	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
-	stats->capacity		= READ_ONCE(sds->capacity);
-	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
+	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+		return true;
 
-	return true;
+	return false;
 }
 
-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- *
- * Since we're running on 'stale' values, we might in fact create an imbalance
- * but recomputing these values is expensive, as that'd mean iteration 2 cache
- * domains worth of CPUs.
- */
 static bool
-wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
-		int this_cpu, int prev_cpu, int sync)
+wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
+		   int this_cpu, int prev_cpu, int sync)
 {
-	struct llc_stats prev_stats, this_stats;
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	if (!get_llc_stats(&prev_stats, prev_cpu) ||
-	    !get_llc_stats(&this_stats, this_cpu))
-		return false;
+	this_eff_load = target_load(this_cpu, sd->wake_idx);
+	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current LLC.
-	 */
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
-		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
+		if (current_load > this_eff_load)
 			return true;
 
-		this_stats.load -= current_load;
+		this_eff_load -= current_load;
 	}
 
-	/*
-	 * The has_capacity stuff is not SMT aware, but by trying to balance
-	 * the nr_running on both ends we try and fill the domain at equal
-	 * rates, thereby first consuming cores before siblings.
-	 */
-
-	/* if the old cache has capacity, stay there */
-	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
-		return false;
-
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
-	/*
-	 * Check to see if we can move the load without causing too much
-	 * imbalance.
-	 */
 	task_load = task_h_load(p);
 
-	this_eff_load = 100;
-	this_eff_load *= prev_stats.capacity;
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= this_stats.capacity;
+	this_eff_load += task_load;
+	this_eff_load *= 100;
+	this_eff_load *= capacity_of(prev_cpu);
 
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
+	prev_eff_load -= task_load;
+	prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
 	return this_eff_load <= prev_eff_load;
 }
@@ -5731,22 +5700,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine;
+	bool affine = false;
 
-	/*
-	 * Default to no affine wakeups; wake_affine() should not effect a task
-	 * placement the load-balancer feels inclined to undo. The conservative
-	 * option is therefore to not move tasks when they wake up.
-	 */
-	affine = false;
+	if (sched_feat(WA_IDLE) && !affine)
+		affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync);
 
-	/*
-	 * If the wakeup is across cache domains, try to evaluate if movement
-	 * makes sense, otherwise rely on select_idle_siblings() to do
-	 * placement inside the cache domain.
-	 */
-	if (!cpus_share_cache(prev_cpu, this_cpu))
-		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_WEIGHT) && !affine)
+		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
 	if (affine) {
@@ -7895,7 +7855,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  */
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain_shared *shared = env->sd->shared;
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
@@ -7967,22 +7926,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
-
-	if (!shared)
-		return;
-
-	/*
-	 * Since these are sums over groups they can contain some CPUs
-	 * multiple times for the NUMA domains.
-	 *
-	 * Currently only wake_affine_llc() and find_busiest_group()
-	 * uses these numbers, only the last is affected by this problem.
-	 *
-	 * XXX fix that.
-	 */
-	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
-	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d3fb15555291..d40d33ec935f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,3 +81,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(WA_IDLE, true)
+SCHED_FEAT(WA_WEIGHT, false)