All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Rik van Riel <riel@redhat.com>,
	Eric Farman <farman@linux.vnet.ibm.com>,
	????????? <jinpuwang@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	"KVM-ML (kvm@vger.kernel.org)" <kvm@vger.kernel.org>,
	vcaputo@pengaru.com, Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Subject: Re: sysbench throughput degradation in 4.13+
Date: Wed, 4 Oct 2017 18:18:50 +0200	[thread overview]
Message-ID: <20171004161850.wgnu73dokpjfyfdk@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20171003083932.3qa7jw2spmi5n5pg@hirez.programming.kicks-ass.net>

On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> over the weekend.
> 
> The trivial thing regresses a wee bit on the overloaded case, I've not
> yet tried to fix it.

WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
addition -- based on the old scheme -- that I've tried in order to lift
the overloaded case (including hackbench).

Its not an unconditional win, but I'm tempted to default enable
WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs).

But let me first write a Changelog for the below and queue that. Then we
ran maybe run more things..



On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core):


WA_IDLE && NO_WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       7.391856936 seconds time elapsed                                          ( +-  0.66% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 54524.6
TCP_SENDFILE-10 : Avg: 48185.2
TCP_SENDFILE-20 : Avg: 29031.2
TCP_SENDFILE-40 : Avg: 9819.72
TCP_SENDFILE-80 : Avg: 5355.3
TCP_STREAM-1 : Avg: 41448.3
TCP_STREAM-10 : Avg: 24123.2
TCP_STREAM-20 : Avg: 15834.5
TCP_STREAM-40 : Avg: 5583.91
TCP_STREAM-80 : Avg: 2329.66
TCP_RR-1 : Avg: 80473.5
TCP_RR-10 : Avg: 72660.5
TCP_RR-20 : Avg: 52607.1
TCP_RR-40 : Avg: 57199.2
TCP_RR-80 : Avg: 25330.3
UDP_RR-1 : Avg: 108266
UDP_RR-10 : Avg: 95480
UDP_RR-20 : Avg: 68770.8
UDP_RR-40 : Avg: 76231
UDP_RR-80 : Avg: 34578.3
UDP_STREAM-1 : Avg: 64684.3
UDP_STREAM-10 : Avg: 52701.2
UDP_STREAM-20 : Avg: 30376.4
UDP_STREAM-40 : Avg: 15685.8
UDP_STREAM-80 : Avg: 8415.13
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok
  2: [30 secs]     transactions:                        64057  (2135.17 per sec.)
  5: [30 secs]     transactions:                        144295 (4809.68 per sec.)
 10: [30 secs]     transactions:                        274768 (9158.59 per sec.)
 20: [30 secs]     transactions:                        437140 (14570.70 per sec.)
 40: [30 secs]     transactions:                        663949 (22130.56 per sec.)
 80: [30 secs]     transactions:                        629927 (20995.56 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        50389  (1679.58 per sec.)
  5: [30 secs]     transactions:                        113934 (3797.69 per sec.)
 10: [30 secs]     transactions:                        217606 (7253.22 per sec.)
 20: [30 secs]     transactions:                        335021 (11166.75 per sec.)
 40: [30 secs]     transactions:                        518355 (17277.28 per sec.)
 80: [30 secs]     transactions:                        513424 (17112.44 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 2
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 4
        min=0, max=86
avg worker transfer: 190227.78 ops/sec 743.08KB/s
rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84%
rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87%
Latency percentiles (usec)
        50.0000th: 62
        75.0000th: 101
        90.0000th: 108
        95.0000th: 112
        *99.0000th: 119
        99.5000th: 124
        99.9000th: 4920
        min=0, max=12987
Throughput 664.328 MB/sec  2 clients  2 procs  max_latency=0.076 ms
Throughput 1573.72 MB/sec  5 clients  5 procs  max_latency=0.102 ms
Throughput 2948.7 MB/sec  10 clients  10 procs  max_latency=0.198 ms
Throughput 4602.38 MB/sec  20 clients  20 procs  max_latency=1.712 ms
Throughput 9253.17 MB/sec  40 clients  40 procs  max_latency=2.047 ms
Throughput 8056.01 MB/sec  80 clients  80 procs  max_latency=35.819 ms


-----------------------

WA_IDLE && WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       6.500797532 seconds time elapsed                                          ( +-  0.97% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 52224.3
TCP_SENDFILE-10 : Avg: 46504.3
TCP_SENDFILE-20 : Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9253.12
TCP_SENDFILE-80 : Avg: 4687.4
TCP_STREAM-1 : Avg: 42254
TCP_STREAM-10 : Avg: 25847.9
TCP_STREAM-20 : Avg: 18374.4
TCP_STREAM-40 : Avg: 5599.57
TCP_STREAM-80 : Avg: 2726.41
TCP_RR-1 : Avg: 82638.8
TCP_RR-10 : Avg: 73265.1
TCP_RR-20 : Avg: 52634.5
TCP_RR-40 : Avg: 56302.3
TCP_RR-80 : Avg: 26867.9
UDP_RR-1 : Avg: 107844
UDP_RR-10 : Avg: 95245.2
UDP_RR-20 : Avg: 68673.7
UDP_RR-40 : Avg: 75419.1
UDP_RR-80 : Avg: 35639.1
UDP_STREAM-1 : Avg: 66606
UDP_STREAM-10 : Avg: 52959.5
UDP_STREAM-20 : Avg: 29704
UDP_STREAM-40 : Avg: 15266.5
UDP_STREAM-80 : Avg: 7388.97
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 
  2: [30 secs]     transactions:                        64277  (2142.51 per sec.)
  5: [30 secs]     transactions:                        144010 (4800.19 per sec.)
 10: [30 secs]     transactions:                        274722 (9157.05 per sec.)
 20: [30 secs]     transactions:                        436325 (14543.55 per sec.)
 40: [30 secs]     transactions:                        665582 (22184.82 per sec.)
 80: [30 secs]     transactions:                        657185 (21904.18 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        51153  (1705.06 per sec.)
  5: [30 secs]     transactions:                        116403 (3879.93 per sec.)
 10: [30 secs]     transactions:                        217750 (7258.06 per sec.)
 20: [30 secs]     transactions:                        336619 (11220.00 per sec.)
 40: [30 secs]     transactions:                        520823 (17359.78 per sec.)
 80: [30 secs]     transactions:                        516690 (17221.16 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 3
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 5
        min=0, max=86
avg worker transfer: 185378.92 ops/sec 724.14KB/s
rps: 1004.82 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.51 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.38 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.23 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.89 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.73 p95 (usec) 9520 p99 (usec) 9552 p95/cputime 31.73% p99/cputime 31.84%
rps: 1220.05 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.85 p95 (usec) 14960 p99 (usec) 14960 p95/cputime 49.87% p99/cputime 49.87%
rps: 1243.86 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 213760 p99 (usec) 225024 p95/cputime 712.53% p99/cputime 750.08%
Latency percentiles (usec)
        50.0000th: 66
        75.0000th: 101
        90.0000th: 107
        95.0000th: 112
        *99.0000th: 120
        99.5000th: 126
        99.9000th: 390
        min=0, max=12964
Throughput 678.413 MB/sec  2 clients  2 procs  max_latency=0.105 ms
Throughput 1589.98 MB/sec  5 clients  5 procs  max_latency=0.084 ms
Throughput 3012.51 MB/sec  10 clients  10 procs  max_latency=0.262 ms
Throughput 4555.93 MB/sec  20 clients  20 procs  max_latency=0.515 ms
Throughput 8496.23 MB/sec  40 clients  40 procs  max_latency=2.040 ms
Throughput 8601.62 MB/sec  80 clients  80 procs  max_latency=2.712 ms


---
 include/linux/sched/topology.h |   8 ---
 kernel/sched/fair.c            | 131 ++++++++++++-----------------------------
 kernel/sched/features.h        |   2 +
 3 files changed, 39 insertions(+), 102 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..7d065abc7a47 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,14 +71,6 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-
-	/*
-	 * Some variables from the most recent sd_lb_stats for this domain,
-	 * used by wake_affine().
-	 */
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec01523..a1a6b6f52660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5638,91 +5638,60 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-struct llc_stats {
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
-	int		has_capacity;
-};
+/*
+ * The purpose of wake_affine() is to quickly determine on which CPU we can run
+ * soonest. For the purpose of speed we only consider the waking and previous
+ * CPU.
+ *
+ * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or
+ * 			will be) idle.
+ *
+ * wake_affine_weight() - considers the weight to reflect the average
+ * 			  scheduling latency of the CPUs. This seems to work
+ * 			  for the overloaded case.
+ */
 
-static bool get_llc_stats(struct llc_stats *stats, int cpu)
+static bool
+wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
+		 int this_cpu, int prev_cpu, int sync)
 {
-	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-
-	if (!sds)
-		return false;
+	if (idle_cpu(this_cpu))
+		return true;
 
-	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
-	stats->capacity		= READ_ONCE(sds->capacity);
-	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
+	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+		return true;
 
-	return true;
+	return false;
 }
 
-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- *
- * Since we're running on 'stale' values, we might in fact create an imbalance
- * but recomputing these values is expensive, as that'd mean iteration 2 cache
- * domains worth of CPUs.
- */
 static bool
-wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
-		int this_cpu, int prev_cpu, int sync)
+wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
+		   int this_cpu, int prev_cpu, int sync)
 {
-	struct llc_stats prev_stats, this_stats;
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	if (!get_llc_stats(&prev_stats, prev_cpu) ||
-	    !get_llc_stats(&this_stats, this_cpu))
-		return false;
+	this_eff_load = target_load(this_cpu, sd->wake_idx);
+	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current LLC.
-	 */
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
-		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
+		if (current_load > this_eff_load)
 			return true;
 
-		this_stats.load -= current_load;
+		this_eff_load -= current_load;
 	}
 
-	/*
-	 * The has_capacity stuff is not SMT aware, but by trying to balance
-	 * the nr_running on both ends we try and fill the domain at equal
-	 * rates, thereby first consuming cores before siblings.
-	 */
-
-	/* if the old cache has capacity, stay there */
-	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
-		return false;
-
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
-	/*
-	 * Check to see if we can move the load without causing too much
-	 * imbalance.
-	 */
 	task_load = task_h_load(p);
 
-	this_eff_load = 100;
-	this_eff_load *= prev_stats.capacity;
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= this_stats.capacity;
+	this_eff_load += task_load;
+	this_eff_load *= 100;
+	this_eff_load *= capacity_of(prev_cpu);
 
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
+	prev_eff_load -= task_load;
+	prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
 	return this_eff_load <= prev_eff_load;
 }
@@ -5731,22 +5700,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine;
+	bool affine = false;
 
-	/*
-	 * Default to no affine wakeups; wake_affine() should not effect a task
-	 * placement the load-balancer feels inclined to undo. The conservative
-	 * option is therefore to not move tasks when they wake up.
-	 */
-	affine = false;
+	if (sched_feat(WA_IDLE) && !affine)
+		affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync);
 
-	/*
-	 * If the wakeup is across cache domains, try to evaluate if movement
-	 * makes sense, otherwise rely on select_idle_siblings() to do
-	 * placement inside the cache domain.
-	 */
-	if (!cpus_share_cache(prev_cpu, this_cpu))
-		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_WEIGHT) && !affine)
+		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
 	if (affine) {
@@ -7895,7 +7855,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  */
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain_shared *shared = env->sd->shared;
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
@@ -7967,22 +7926,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
-
-	if (!shared)
-		return;
-
-	/*
-	 * Since these are sums over groups they can contain some CPUs
-	 * multiple times for the NUMA domains.
-	 *
-	 * Currently only wake_affine_llc() and find_busiest_group()
-	 * uses these numbers, only the last is affected by this problem.
-	 *
-	 * XXX fix that.
-	 */
-	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
-	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d3fb15555291..d40d33ec935f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,3 +81,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(WA_IDLE, true)
+SCHED_FEAT(WA_WEIGHT, false)

  parent reply	other threads:[~2017-10-04 16:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-12 14:14 sysbench throughput degradation in 4.13+ Eric Farman
2017-09-13  8:24 ` 王金浦
2017-09-22 15:03   ` Eric Farman
2017-09-22 15:53     ` Peter Zijlstra
2017-09-22 16:12       ` Eric Farman
2017-09-27  9:35         ` Peter Zijlstra
2017-09-27 16:27           ` Eric Farman
2017-09-28  9:08             ` Christian Borntraeger
2017-09-27 17:58           ` Rik van Riel
2017-09-28 11:04             ` Eric Farman
2017-09-28 12:36             ` Peter Zijlstra
2017-09-28 12:37             ` Peter Zijlstra
2017-10-02 22:53             ` Matt Fleming
2017-10-03  8:39               ` Peter Zijlstra
2017-10-03 16:02                 ` Rik van Riel
2017-10-04 16:18                 ` Peter Zijlstra [this message]
2017-10-04 18:02                   ` Rik van Riel
2017-10-06 10:36                   ` Matt Fleming
2017-10-10 14:51                     ` Matt Fleming
2017-10-10 15:16                       ` Peter Zijlstra
2017-10-10 17:26                         ` Ingo Molnar
2017-10-10 17:40                           ` Christian Borntraeger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171004161850.wgnu73dokpjfyfdk@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=borntraeger@de.ibm.com \
    --cc=farman@linux.vnet.ibm.com \
    --cc=jinpuwang@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=mjrosato@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=vcaputo@pengaru.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.