All of lore.kernel.org
 help / color / mirror / Atom feed
* sysbench throughput degradation in 4.13+
@ 2017-09-12 14:14 Eric Farman
  2017-09-13  8:24 ` 王金浦
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Farman @ 2017-09-12 14:14 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: linux-kernel, Ingo Molnar, Christian Borntraeger, kvm

Hi Peter, Rik,

Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB 
s390x host, we noticed a throughput degradation (anywhere between 13% 
and 40%, depending on test) when moving the host from kernel 4.12 to 
4.13.  The rest of the host and the entire guest remain unchanged; it is 
only the host kernel that changes.  Bisecting the host kernel blames 
commit 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()").

Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove 
effective_load()") from a clean 4.13.0 build erases the throughput 
degradation and returns us to what we see in 4.12.0.

A little poking around points us to a fix/improvement to this, commit 
90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), 
which went in the 4.14 merge window and an unmerged fix [1] that 
corrects a small error in that patch.  Hopeful, since we were running 
!NUMA_BALANCING, I applied these two patches to a clean 4.13.0 tree but 
continue to see the performance degradation.  Pulling current master or 
linux-next shows no improvement lurking in the shadows.

Running perf stat on the host during the guest sysbench run shows a 
significant increase in cpu-migrations over the 4.12.0 run.  Abbreviated 
examples follow:

# 4.12.0
# perf stat -p 11473 -- sleep 5
       62305.199305      task-clock (msec)         #   12.458 CPUs
            368,607      context-switches
              4,084      cpu-migrations
                416      page-faults

# 4.13.0
# perf stat -p 11444 -- sleep 5
       35892.653243      task-clock (msec)         #    7.176 CPUs
            249,251      context-switches
             56,850      cpu-migrations
                804      page-faults

# 4.13.0-revert-3fed382b46ba-and-815abf5af45f
# perf stat -p 11441 -- sleep 5
       62321.767146      task-clock (msec)         #   12.459 CPUs
            387,661      context-switches
              5,687      cpu-migrations
              1,652      page-faults

# 4.13.0-apply-90001d67be2f
# perf stat -p 11438 -- sleep 5
       48654.988291      task-clock (msec)         #    9.729 CPUs
            363,150      context-switches
             43,778      cpu-migrations
                641      page-faults

I'm not sure what doc to supply here and am unfamiliar with this code or 
its recent changes, but I'd be happy to pull/try whatever is needed to 
help debug things.  Looking forward to hearing what I can do.

Thanks,
Eric

[1] https://lkml.org/lkml/2017/9/6/196

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-12 14:14 sysbench throughput degradation in 4.13+ Eric Farman
@ 2017-09-13  8:24 ` 王金浦
  2017-09-22 15:03   ` Eric Farman
  0 siblings, 1 reply; 22+ messages in thread
From: 王金浦 @ 2017-09-13  8:24 UTC (permalink / raw)
  To: Eric Farman
  Cc: Peter Zijlstra, Rik van Riel, LKML, Ingo Molnar,
	Christian Borntraeger, KVM-ML (kvm@vger.kernel.org),
	vcaputo

2017-09-12 16:14 GMT+02:00 Eric Farman <farman@linux.vnet.ibm.com>:
> Hi Peter, Rik,
>
> Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB
> s390x host, we noticed a throughput degradation (anywhere between 13% and
> 40%, depending on test) when moving the host from kernel 4.12 to 4.13.  The
> rest of the host and the entire guest remain unchanged; it is only the host
> kernel that changes.  Bisecting the host kernel blames commit 3fed382b46ba
> ("sched/numa: Implement NUMA node level wake_affine()").
>
> Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove
> effective_load()") from a clean 4.13.0 build erases the throughput
> degradation and returns us to what we see in 4.12.0.
>
> A little poking around points us to a fix/improvement to this, commit
> 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which
> went in the 4.14 merge window and an unmerged fix [1] that corrects a small
> error in that patch.  Hopeful, since we were running !NUMA_BALANCING, I
> applied these two patches to a clean 4.13.0 tree but continue to see the
> performance degradation.  Pulling current master or linux-next shows no
> improvement lurking in the shadows.
>
> Running perf stat on the host during the guest sysbench run shows a
> significant increase in cpu-migrations over the 4.12.0 run.  Abbreviated
> examples follow:
>
> # 4.12.0
> # perf stat -p 11473 -- sleep 5
>       62305.199305      task-clock (msec)         #   12.458 CPUs
>            368,607      context-switches
>              4,084      cpu-migrations
>                416      page-faults
>
> # 4.13.0
> # perf stat -p 11444 -- sleep 5
>       35892.653243      task-clock (msec)         #    7.176 CPUs
>            249,251      context-switches
>             56,850      cpu-migrations
>                804      page-faults
>
> # 4.13.0-revert-3fed382b46ba-and-815abf5af45f
> # perf stat -p 11441 -- sleep 5
>       62321.767146      task-clock (msec)         #   12.459 CPUs
>            387,661      context-switches
>              5,687      cpu-migrations
>              1,652      page-faults
>
> # 4.13.0-apply-90001d67be2f
> # perf stat -p 11438 -- sleep 5
>       48654.988291      task-clock (msec)         #    9.729 CPUs
>            363,150      context-switches
>             43,778      cpu-migrations
>                641      page-faults
>
> I'm not sure what doc to supply here and am unfamiliar with this code or its
> recent changes, but I'd be happy to pull/try whatever is needed to help
> debug things.  Looking forward to hearing what I can do.
>
> Thanks,
> Eric
>
> [1] https://lkml.org/lkml/2017/9/6/196
>
+cc: vcaputo@pengaru.com
He reported a performance degradation also on 4.13-rc7, it might be
the same cause.

Best,
Jack

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-13  8:24 ` 王金浦
@ 2017-09-22 15:03   ` Eric Farman
  2017-09-22 15:53     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Farman @ 2017-09-22 15:03 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato



On 09/13/2017 04:24 AM, 王金浦 wrote:
> 2017-09-12 16:14 GMT+02:00 Eric Farman <farman@linux.vnet.ibm.com>:
>> Hi Peter, Rik,
>>
>> Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB
>> s390x host, we noticed a throughput degradation (anywhere between 13% and
>> 40%, depending on test) when moving the host from kernel 4.12 to 4.13.  The
>> rest of the host and the entire guest remain unchanged; it is only the host
>> kernel that changes.  Bisecting the host kernel blames commit 3fed382b46ba
>> ("sched/numa: Implement NUMA node level wake_affine()").
>>
>> Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove
>> effective_load()") from a clean 4.13.0 build erases the throughput
>> degradation and returns us to what we see in 4.12.0.
>>
>> A little poking around points us to a fix/improvement to this, commit
>> 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which
>> went in the 4.14 merge window and an unmerged fix [1] that corrects a small
>> error in that patch.  Hopeful, since we were running !NUMA_BALANCING, I
>> applied these two patches to a clean 4.13.0 tree but continue to see the
>> performance degradation.  Pulling current master or linux-next shows no
>> improvement lurking in the shadows.
>>
>> Running perf stat on the host during the guest sysbench run shows a
>> significant increase in cpu-migrations over the 4.12.0 run.  Abbreviated
>> examples follow:
>>
>> # 4.12.0
>> # perf stat -p 11473 -- sleep 5
>>        62305.199305      task-clock (msec)         #   12.458 CPUs
>>             368,607      context-switches
>>               4,084      cpu-migrations
>>                 416      page-faults
>>
>> # 4.13.0
>> # perf stat -p 11444 -- sleep 5
>>        35892.653243      task-clock (msec)         #    7.176 CPUs
>>             249,251      context-switches
>>              56,850      cpu-migrations
>>                 804      page-faults
>>
>> # 4.13.0-revert-3fed382b46ba-and-815abf5af45f
>> # perf stat -p 11441 -- sleep 5
>>        62321.767146      task-clock (msec)         #   12.459 CPUs
>>             387,661      context-switches
>>               5,687      cpu-migrations
>>               1,652      page-faults
>>
>> # 4.13.0-apply-90001d67be2f
>> # perf stat -p 11438 -- sleep 5
>>        48654.988291      task-clock (msec)         #    9.729 CPUs
>>             363,150      context-switches
>>              43,778      cpu-migrations
>>                 641      page-faults
>>
>> I'm not sure what doc to supply here and am unfamiliar with this code or its
>> recent changes, but I'd be happy to pull/try whatever is needed to help
>> debug things.  Looking forward to hearing what I can do.
>>
>> Thanks,
>> Eric
>>
>> [1] https://lkml.org/lkml/2017/9/6/196
>>
> +cc: vcaputo@pengaru.com
> He reported a performance degradation also on 4.13-rc7, it might be
> the same cause.
> 
> Best,
> Jack
> 

Hi Peter, Rik,

With OSS last week, I'm sure this got lost in the deluge, so here's a 
friendly ping.  I picked up 4.14.0-rc1 earlier this week, and still see 
the degradation described above.  Not really a surprise, since I don't 
see any other commits in this area beyond the ones I mentioned in my 
original note.

Anyway, I'm unsure what else to try or what doc to pull to help debug 
this, and would appreciate your expertise here.  We can repro this 
pretty easily as necessary to help get to the bottom of this.

Many thanks in advance,

  - Eric

(also, +cc Matt to help when I'm out of office myself.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-22 15:03   ` Eric Farman
@ 2017-09-22 15:53     ` Peter Zijlstra
  2017-09-22 16:12       ` Eric Farman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2017-09-22 15:53 UTC (permalink / raw)
  To: Eric Farman
  Cc: Rik van Riel, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote:
> Hi Peter, Rik,
> 
> With OSS last week, I'm sure this got lost in the deluge, so here's a
> friendly ping.

Very much so, inbox is a giant trainwreck ;-)

> I picked up 4.14.0-rc1 earlier this week, and still see the
> degradation described above.  Not really a surprise, since I don't see any
> other commits in this area beyond the ones I mentioned in my original note.
> 
> Anyway, I'm unsure what else to try or what doc to pull to help debug this,
> and would appreciate your expertise here.  We can repro this pretty easily
> as necessary to help get to the bottom of this.
> 
> Many thanks in advance,

Could you describe your sysbench setup? Are you running it on mysql or
postgresql, what other options?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-22 15:53     ` Peter Zijlstra
@ 2017-09-22 16:12       ` Eric Farman
  2017-09-27  9:35         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Farman @ 2017-09-22 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato



On 09/22/2017 11:53 AM, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote:
>> Hi Peter, Rik,
>>
>> With OSS last week, I'm sure this got lost in the deluge, so here's a
>> friendly ping.
> 
> Very much so, inbox is a giant trainwreck ;-)

My apologies.  :)

> 
>> I picked up 4.14.0-rc1 earlier this week, and still see the
>> degradation described above.  Not really a surprise, since I don't see any
>> other commits in this area beyond the ones I mentioned in my original note.
>>
>> Anyway, I'm unsure what else to try or what doc to pull to help debug this,
>> and would appreciate your expertise here.  We can repro this pretty easily
>> as necessary to help get to the bottom of this.
>>
>> Many thanks in advance,
> 
> Could you describe your sysbench setup? Are you running it on mysql or
> postgresql, what other options?
> 

MySQL.  We've tried a few different configs with both test=oltp and 
test=threads, but both show the same behavior.  What I have settled on 
for my repro is the following:

sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root 
--mysql-db=test1 --max-time=180 --max-requests=100000000 --num-threads=8 
--test=oltp prepare
sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root 
--mysql-db=test1 --max-time=180 --max-requests=100000000 --num-threads=8 
--test=oltp run

Have also some environments where multiple sysbench instances are run 
concurrently (I've tried up to 8, with db's test1-8 being specified), 
but doesn't appear to much matter either.

  - Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-22 16:12       ` Eric Farman
@ 2017-09-27  9:35         ` Peter Zijlstra
  2017-09-27 16:27           ` Eric Farman
  2017-09-27 17:58           ` Rik van Riel
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-09-27  9:35 UTC (permalink / raw)
  To: Eric Farman
  Cc: Rik van Riel, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
> 
> MySQL.  We've tried a few different configs with both test=oltp and
> test=threads, but both show the same behavior.  What I have settled on for
> my repro is the following:
> 

Right, didn't even need to run it in a guest to observe a regression.

So the below cures native sysbench and NAS bench for me, does it also
work for you virt thingy?


PRE (current tip/master):

ivb-ex sysbench:

  2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
  5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
 10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
 20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
 40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
 80: [30 secs]     transactions:                        355096 (11834.28 per sec.)

hsw-ex NAS:

OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83


POST (+patch):

ivb-ex sysbench:

  2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
  5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
 10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
 20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
 40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
 80: [30 secs]     transactions:                        631739 (21055.88 per sec.)

hsw-ex NAS:

lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52


This patch takes out all the shiny wake_affine stuff and goes back to
utter basics. Rik was there another NUMA benchmark that wanted your
fancy stuff? Because NAS isn't it.

(the previous, slightly fancier wake_affine was basically a !idle
extension of this, in that it would pick the 'shortest' of the 2 queues
and thereby run quickest, in approximation)

I'll try and run a number of other benchmarks I have around to see if
there's anything that shows a difference between the below trivial
wake_affine and the old 2-cpu-load one.

---
 include/linux/sched/topology.h |   8 ---
 kernel/sched/fair.c            | 125 ++---------------------------------------
 2 files changed, 6 insertions(+), 127 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..7d065abc7a47 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,14 +71,6 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-
-	/*
-	 * Some variables from the most recent sd_lb_stats for this domain,
-	 * used by wake_affine().
-	 */
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70ba32e08a23..66930ce338af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-struct llc_stats {
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
-	int		has_capacity;
-};
-
-static bool get_llc_stats(struct llc_stats *stats, int cpu)
-{
-	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-
-	if (!sds)
-		return false;
-
-	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
-	stats->capacity		= READ_ONCE(sds->capacity);
-	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
-
-	return true;
-}
-
-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- *
- * Since we're running on 'stale' values, we might in fact create an imbalance
- * but recomputing these values is expensive, as that'd mean iteration 2 cache
- * domains worth of CPUs.
- */
-static bool
-wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
-		int this_cpu, int prev_cpu, int sync)
-{
-	struct llc_stats prev_stats, this_stats;
-	s64 this_eff_load, prev_eff_load;
-	unsigned long task_load;
-
-	if (!get_llc_stats(&prev_stats, prev_cpu) ||
-	    !get_llc_stats(&this_stats, this_cpu))
-		return false;
-
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current LLC.
-	 */
-	if (sync) {
-		unsigned long current_load = task_h_load(current);
-
-		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
-			return true;
-
-		this_stats.load -= current_load;
-	}
-
-	/*
-	 * The has_capacity stuff is not SMT aware, but by trying to balance
-	 * the nr_running on both ends we try and fill the domain at equal
-	 * rates, thereby first consuming cores before siblings.
-	 */
-
-	/* if the old cache has capacity, stay there */
-	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
-		return false;
-
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
-	/*
-	 * Check to see if we can move the load without causing too much
-	 * imbalance.
-	 */
-	task_load = task_h_load(p);
-
-	this_eff_load = 100;
-	this_eff_load *= prev_stats.capacity;
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= this_stats.capacity;
-
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
-
-	return this_eff_load <= prev_eff_load;
-}
-
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine;
-
-	/*
-	 * Default to no affine wakeups; wake_affine() should not effect a task
-	 * placement the load-balancer feels inclined to undo. The conservative
-	 * option is therefore to not move tasks when they wake up.
-	 */
-	affine = false;
+	bool affine = false;
 
 	/*
-	 * If the wakeup is across cache domains, try to evaluate if movement
-	 * makes sense, otherwise rely on select_idle_siblings() to do
-	 * placement inside the cache domain.
+	 * If we can run _now_ on the waking CPU, go there, otherwise meh.
 	 */
-	if (!cpus_share_cache(prev_cpu, this_cpu))
-		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
+	if (idle_cpu(this_cpu))
+		affine = true;
+	else if (sync && cpu_rq(this_cpu)->nr_running == 1)
+		affine = true;
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
 	if (affine) {
@@ -7600,7 +7504,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  */
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain_shared *shared = env->sd->shared;
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
@@ -7672,22 +7575,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
-
-	if (!shared)
-		return;
-
-	/*
-	 * Since these are sums over groups they can contain some CPUs
-	 * multiple times for the NUMA domains.
-	 *
-	 * Currently only wake_affine_llc() and find_busiest_group()
-	 * uses these numbers, only the last is affected by this problem.
-	 *
-	 * XXX fix that.
-	 */
-	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
-	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27  9:35         ` Peter Zijlstra
@ 2017-09-27 16:27           ` Eric Farman
  2017-09-28  9:08             ` Christian Borntraeger
  2017-09-27 17:58           ` Rik van Riel
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Farman @ 2017-09-27 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato



On 09/27/2017 05:35 AM, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
>>
>> MySQL.  We've tried a few different configs with both test=oltp and
>> test=threads, but both show the same behavior.  What I have settled on for
>> my repro is the following:
>>
> 
> Right, didn't even need to run it in a guest to observe a regression.
> 
> So the below cures native sysbench and NAS bench for me, does it also
> work for you virt thingy?
> 

Ran a quick test this morning with 4.13.0 + 90001d67be2f + a731ebe6f17b 
and then with/without this patch.  An oltp sysbench run shows that guest 
cpu migrations decreased significantly, from ~27K to ~2K over 5 seconds.

So, we applied this patch to linux-next (next-20170926) and ran it 
against a couple sysbench tests:

--test=oltp
Baseline:	5655.26 transactions/second
Patched:	9618.13 transactions/second

--test=threads
Baseline:	25482.9 events/sec
Patched:	29577.9 events/sec

That's good!  With that...

Tested-by: Eric Farman <farman@linux.vnet.ibm.com>

Thanks!

  - Eric

> 
> PRE (current tip/master):
> 
> ivb-ex sysbench:
> 
>    2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>    5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>   10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>   20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>   40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>   80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
> 
> hsw-ex NAS:
> 
> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
> 
> 
> POST (+patch):
> 
> ivb-ex sysbench:
> 
>    2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>    5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>   10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>   20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>   40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>   80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
> 
> hsw-ex NAS:
> 
> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
> 
> 
> This patch takes out all the shiny wake_affine stuff and goes back to
> utter basics. Rik was there another NUMA benchmark that wanted your
> fancy stuff? Because NAS isn't it.
> 
> (the previous, slightly fancier wake_affine was basically a !idle
> extension of this, in that it would pick the 'shortest' of the 2 queues
> and thereby run quickest, in approximation)
> 
> I'll try and run a number of other benchmarks I have around to see if
> there's anything that shows a difference between the below trivial
> wake_affine and the old 2-cpu-load one.
> 
> ---
>   include/linux/sched/topology.h |   8 ---
>   kernel/sched/fair.c            | 125 ++---------------------------------------
>   2 files changed, 6 insertions(+), 127 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index d7b6dab956ec..7d065abc7a47 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -71,14 +71,6 @@ struct sched_domain_shared {
>   	atomic_t	ref;
>   	atomic_t	nr_busy_cpus;
>   	int		has_idle_cores;
> -
> -	/*
> -	 * Some variables from the most recent sd_lb_stats for this domain,
> -	 * used by wake_affine().
> -	 */
> -	unsigned long	nr_running;
> -	unsigned long	load;
> -	unsigned long	capacity;
>   };
> 
>   struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 70ba32e08a23..66930ce338af 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p)
>   	return 1;
>   }
> 
> -struct llc_stats {
> -	unsigned long	nr_running;
> -	unsigned long	load;
> -	unsigned long	capacity;
> -	int		has_capacity;
> -};
> -
> -static bool get_llc_stats(struct llc_stats *stats, int cpu)
> -{
> -	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> -
> -	if (!sds)
> -		return false;
> -
> -	stats->nr_running	= READ_ONCE(sds->nr_running);
> -	stats->load		= READ_ONCE(sds->load);
> -	stats->capacity		= READ_ONCE(sds->capacity);
> -	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
> -
> -	return true;
> -}
> -
> -/*
> - * Can a task be moved from prev_cpu to this_cpu without causing a load
> - * imbalance that would trigger the load balancer?
> - *
> - * Since we're running on 'stale' values, we might in fact create an imbalance
> - * but recomputing these values is expensive, as that'd mean iteration 2 cache
> - * domains worth of CPUs.
> - */
> -static bool
> -wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
> -		int this_cpu, int prev_cpu, int sync)
> -{
> -	struct llc_stats prev_stats, this_stats;
> -	s64 this_eff_load, prev_eff_load;
> -	unsigned long task_load;
> -
> -	if (!get_llc_stats(&prev_stats, prev_cpu) ||
> -	    !get_llc_stats(&this_stats, this_cpu))
> -		return false;
> -
> -	/*
> -	 * If sync wakeup then subtract the (maximum possible)
> -	 * effect of the currently running task from the load
> -	 * of the current LLC.
> -	 */
> -	if (sync) {
> -		unsigned long current_load = task_h_load(current);
> -
> -		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> -			return true;
> -
> -		this_stats.load -= current_load;
> -	}
> -
> -	/*
> -	 * The has_capacity stuff is not SMT aware, but by trying to balance
> -	 * the nr_running on both ends we try and fill the domain at equal
> -	 * rates, thereby first consuming cores before siblings.
> -	 */
> -
> -	/* if the old cache has capacity, stay there */
> -	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
> -		return false;
> -
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
> -	/*
> -	 * Check to see if we can move the load without causing too much
> -	 * imbalance.
> -	 */
> -	task_load = task_h_load(p);
> -
> -	this_eff_load = 100;
> -	this_eff_load *= prev_stats.capacity;
> -
> -	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
> -	prev_eff_load *= this_stats.capacity;
> -
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> -
> -	return this_eff_load <= prev_eff_load;
> -}
> -
>   static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>   		       int prev_cpu, int sync)
>   {
>   	int this_cpu = smp_processor_id();
> -	bool affine;
> -
> -	/*
> -	 * Default to no affine wakeups; wake_affine() should not effect a task
> -	 * placement the load-balancer feels inclined to undo. The conservative
> -	 * option is therefore to not move tasks when they wake up.
> -	 */
> -	affine = false;
> +	bool affine = false;
> 
>   	/*
> -	 * If the wakeup is across cache domains, try to evaluate if movement
> -	 * makes sense, otherwise rely on select_idle_siblings() to do
> -	 * placement inside the cache domain.
> +	 * If we can run _now_ on the waking CPU, go there, otherwise meh.
>   	 */
> -	if (!cpus_share_cache(prev_cpu, this_cpu))
> -		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
> +	if (idle_cpu(this_cpu))
> +		affine = true;
> +	else if (sync && cpu_rq(this_cpu)->nr_running == 1)
> +		affine = true;
> 
>   	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
>   	if (affine) {
> @@ -7600,7 +7504,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
>    */
>   static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
>   {
> -	struct sched_domain_shared *shared = env->sd->shared;
>   	struct sched_domain *child = env->sd->child;
>   	struct sched_group *sg = env->sd->groups;
>   	struct sg_lb_stats *local = &sds->local_stat;
> @@ -7672,22 +7575,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		if (env->dst_rq->rd->overload != overload)
>   			env->dst_rq->rd->overload = overload;
>   	}
> -
> -	if (!shared)
> -		return;
> -
> -	/*
> -	 * Since these are sums over groups they can contain some CPUs
> -	 * multiple times for the NUMA domains.
> -	 *
> -	 * Currently only wake_affine_llc() and find_busiest_group()
> -	 * uses these numbers, only the last is affected by this problem.
> -	 *
> -	 * XXX fix that.
> -	 */
> -	WRITE_ONCE(shared->nr_running,	sds->total_running);
> -	WRITE_ONCE(shared->load,	sds->total_load);
> -	WRITE_ONCE(shared->capacity,	sds->total_capacity);
>   }
> 
>   /**
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27  9:35         ` Peter Zijlstra
  2017-09-27 16:27           ` Eric Farman
@ 2017-09-27 17:58           ` Rik van Riel
  2017-09-28 11:04             ` Eric Farman
                               ` (3 more replies)
  1 sibling, 4 replies; 22+ messages in thread
From: Rik van Riel @ 2017-09-27 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Farman, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Wed, 27 Sep 2017 11:35:30 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
> > 
> > MySQL.  We've tried a few different configs with both test=oltp and
> > test=threads, but both show the same behavior.  What I have settled on for
> > my repro is the following:
> >   
> 
> Right, didn't even need to run it in a guest to observe a regression.
> 
> So the below cures native sysbench and NAS bench for me, does it also
> work for you virt thingy?
> 
> 
> PRE (current tip/master):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>   5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>  10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>  20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>  40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>  80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
> 
> hsw-ex NAS:
> 
> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
> 
> 
> POST (+patch):
> 
> ivb-ex sysbench:
> 
>   2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>   5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>  10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>  20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>  40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>  80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
> 
> hsw-ex NAS:
> 
> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
> 
> 
> This patch takes out all the shiny wake_affine stuff and goes back to
> utter basics. Rik was there another NUMA benchmark that wanted your
> fancy stuff? Because NAS isn't it.

I like the simplicity of your approach!  I hope it does not break
stuff like netperf...

I have been working on the patch below, which is much less optimistic
about when to do an affine wakeup than before.

It may be worth testing, in case it works better with some workload,
though relying on cached values still makes me somewhat uneasy.

I will try to get kernels tested here that implement both approaches,
to see what ends up working best.

---8<---
Subject: sched: make wake_affine_llc less eager

With the wake_affine_llc logic, tasks get moved around too eagerly,
and then moved back later, leading to poor performance for some
workloads.

Make wake_affine_llc less eager by comparing the minimum load of
the source LLC with the maximum load of the destination LLC, similar
to how source_load and target_load work for regular migration.

Also, get rid of an overly optimistic test that could potentially
pull across a lot of tasks if the target LLC happened to have fewer
runnable tasks at load balancing time.

Conversely, sync wakeups could happen without taking LLC loads
into account, if the waker would leave an idle CPU behind on
the target LLC.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/sched/topology.h |  3 ++-
 kernel/sched/fair.c            | 56 +++++++++++++++++++++++++++++++++---------
 2 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..0c295ff5049b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -77,7 +77,8 @@ struct sched_domain_shared {
 	 * used by wake_affine().
 	 */
 	unsigned long	nr_running;
-	unsigned long	load;
+	unsigned long	min_load;
+	unsigned long	max_load;
 	unsigned long	capacity;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86195add977f..7740c6776e08 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int type)
 	return max(rq->cpu_load[type-1], total);
 }
 
+static void min_max_load(int cpu, unsigned long *min_load,
+		 	 unsigned long *max_load)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long minl = ULONG_MAX;
+	unsigned long maxl = 0;
+	int i;
+
+	for (i = 0; i < CPU_LOAD_IDX_MAX; i++) {
+		minl = min(minl, rq->cpu_load[i]);
+		maxl = max(maxl, rq->cpu_load[i]);
+	}
+
+	*min_load = minl;
+	*max_load = maxl;
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -5310,7 +5327,8 @@ static int wake_wide(struct task_struct *p)
 
 struct llc_stats {
 	unsigned long	nr_running;
-	unsigned long	load;
+	unsigned long	min_load;
+	unsigned long	max_load;
 	unsigned long	capacity;
 	int		has_capacity;
 };
@@ -5323,7 +5341,8 @@ static bool get_llc_stats(struct llc_stats *stats, int cpu)
 		return false;
 
 	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
+	stats->min_load		= READ_ONCE(sds->min_load);
+	stats->max_load		= READ_ONCE(sds->max_load);
 	stats->capacity		= READ_ONCE(sds->capacity);
 	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
 
@@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 		unsigned long current_load = task_h_load(current);
 
 		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
+		if (current_load > this_stats.max_load)
+			return true;
+
+		/* allow if the CPU would go idle, regardless of LLC load */
+		if (current_load >= target_load(this_cpu, sd->wake_idx))
 			return true;
 
-		this_stats.load -= current_load;
+		this_stats.max_load -= current_load;
 	}
 
 	/*
@@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
 		return false;
 
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
 	/*
 	 * Check to see if we can move the load without causing too much
 	 * imbalance.
@@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
 	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= this_stats.capacity;
 
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
+	this_eff_load *= this_stats.max_load + task_load;
+	prev_eff_load *= prev_stats.min_load - task_load;
 
 	return this_eff_load <= prev_eff_load;
 }
@@ -7033,6 +7052,8 @@ enum group_type {
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long min_load;
+	unsigned long max_load;
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
@@ -7059,6 +7080,8 @@ struct sd_lb_stats {
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned long min_load;		/* Sum of lowest loadavg on CPUs */
+	unsigned long max_load;		/* Sum of highest loadavg on CPUs */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7077,6 +7100,8 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.local = NULL,
 		.total_running = 0UL,
 		.total_load = 0UL,
+		.min_load = 0UL,
+		.max_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -7358,7 +7383,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			int local_group, struct sg_lb_stats *sgs,
 			bool *overload)
 {
-	unsigned long load;
+	unsigned long load, min_load, max_load;
 	int i, nr_running;
 
 	memset(sgs, 0, sizeof(*sgs));
@@ -7372,7 +7397,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		else
 			load = source_load(i, load_idx);
 
+		min_max_load(i, &min_load, &max_load);
+
 		sgs->group_load += load;
+		sgs->min_load += min_load;
+		sgs->max_load += max_load;
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
@@ -7569,6 +7598,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_running += sgs->sum_nr_running;
 		sds->total_load += sgs->group_load;
+		sds->min_load += sgs->min_load;
+		sds->max_load += sgs->max_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
@@ -7596,7 +7627,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	 * XXX fix that.
 	 */
 	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
+	WRITE_ONCE(shared->min_load,	sds->min_load);
+	WRITE_ONCE(shared->max_load,	sds->max_load);
 	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 


-- 
All Rights Reversed

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27 16:27           ` Eric Farman
@ 2017-09-28  9:08             ` Christian Borntraeger
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-09-28  9:08 UTC (permalink / raw)
  To: Eric Farman, Peter Zijlstra
  Cc: Rik van Riel, 王金浦,
	LKML, Ingo Molnar, KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On 09/27/2017 06:27 PM, Eric Farman wrote:
> 
> 
> On 09/27/2017 05:35 AM, Peter Zijlstra wrote:
>> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
>>>
>>> MySQL.  We've tried a few different configs with both test=oltp and
>>> test=threads, but both show the same behavior.  What I have settled on for
>>> my repro is the following:
>>>
>>
>> Right, didn't even need to run it in a guest to observe a regression.
>>
>> So the below cures native sysbench and NAS bench for me, does it also
>> work for you virt thingy?
[...]

> --test=oltp
> Baseline:    5655.26 transactions/second
> Patched:    9618.13 transactions/second
> 
> --test=threads
> Baseline:    25482.9 events/sec
> Patched:    29577.9 events/sec
> 
> That's good!  With that...
> 
> Tested-by: Eric Farman <farman@linux.vnet.ibm.com>
> 
> Thanks!
> 
>  - Eric

Assuming that we settle on this or Riks alternative patch.
Are we going to schedule this for 4.13 stable as well?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27 17:58           ` Rik van Riel
@ 2017-09-28 11:04             ` Eric Farman
  2017-09-28 12:36             ` Peter Zijlstra
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Eric Farman @ 2017-09-28 11:04 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra
  Cc: 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato



On 09/27/2017 01:58 PM, Rik van Riel wrote:
> On Wed, 27 Sep 2017 11:35:30 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
>>>
>>> MySQL.  We've tried a few different configs with both test=oltp and
>>> test=threads, but both show the same behavior.  What I have settled on for
>>> my repro is the following:
>>>    
>>
>> Right, didn't even need to run it in a guest to observe a regression.
>>
>> So the below cures native sysbench and NAS bench for me, does it also
>> work for you virt thingy?
>>
>>
>> PRE (current tip/master):
>>
>> ivb-ex sysbench:
>>
>>    2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
>>    5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
>>   10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
>>   20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
>>   40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
>>   80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
>>
>> hsw-ex NAS:
>>
>> OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
>> OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
>> OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
>> lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
>> lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
>> lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
>>
>>
>> POST (+patch):
>>
>> ivb-ex sysbench:
>>
>>    2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
>>    5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
>>   10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
>>   20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
>>   40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
>>   80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
>>
>> hsw-ex NAS:
>>
>> lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
>> lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
>> lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
>>
>>
>> This patch takes out all the shiny wake_affine stuff and goes back to
>> utter basics. Rik was there another NUMA benchmark that wanted your
>> fancy stuff? Because NAS isn't it.
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.
> 
> It may be worth testing, in case it works better with some workload,
> though relying on cached values still makes me somewhat uneasy.
> 

Here are numbers for our environment, to compare the two patches:

sysbench --test=threads:
next-20170926:		25470.8
-with-Peters-patch:	29559.1
-with-Riks-patch:	29283

sysbench --test=oltp:
next-20170926:		5722.37
-with-Peters-patch:	9623.45
-with-Riks-patch:	9360.59

Didn't record host cpu migrations in all scenarios, but a spot check 
showed a similar reduction in both patches.

  - Eric

> I will try to get kernels tested here that implement both approaches,
> to see what ends up working best.
> 
> ---8<---
> Subject: sched: make wake_affine_llc less eager
> 
> With the wake_affine_llc logic, tasks get moved around too eagerly,
> and then moved back later, leading to poor performance for some
> workloads.
> 
> Make wake_affine_llc less eager by comparing the minimum load of
> the source LLC with the maximum load of the destination LLC, similar
> to how source_load and target_load work for regular migration.
> 
> Also, get rid of an overly optimistic test that could potentially
> pull across a lot of tasks if the target LLC happened to have fewer
> runnable tasks at load balancing time.
> 
> Conversely, sync wakeups could happen without taking LLC loads
> into account, if the waker would leave an idle CPU behind on
> the target LLC.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> ---
>   include/linux/sched/topology.h |  3 ++-
>   kernel/sched/fair.c            | 56 +++++++++++++++++++++++++++++++++---------
>   2 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index d7b6dab956ec..0c295ff5049b 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -77,7 +77,8 @@ struct sched_domain_shared {
>   	 * used by wake_affine().
>   	 */
>   	unsigned long	nr_running;
> -	unsigned long	load;
> +	unsigned long	min_load;
> +	unsigned long	max_load;
>   	unsigned long	capacity;
>   };
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 86195add977f..7740c6776e08 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int type)
>   	return max(rq->cpu_load[type-1], total);
>   }
> 
> +static void min_max_load(int cpu, unsigned long *min_load,
> +		 	 unsigned long *max_load)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	unsigned long minl = ULONG_MAX;
> +	unsigned long maxl = 0;
> +	int i;
> +
> +	for (i = 0; i < CPU_LOAD_IDX_MAX; i++) {
> +		minl = min(minl, rq->cpu_load[i]);
> +		maxl = max(maxl, rq->cpu_load[i]);
> +	}
> +
> +	*min_load = minl;
> +	*max_load = maxl;
> +}
> +
>   static unsigned long capacity_of(int cpu)
>   {
>   	return cpu_rq(cpu)->cpu_capacity;
> @@ -5310,7 +5327,8 @@ static int wake_wide(struct task_struct *p)
> 
>   struct llc_stats {
>   	unsigned long	nr_running;
> -	unsigned long	load;
> +	unsigned long	min_load;
> +	unsigned long	max_load;
>   	unsigned long	capacity;
>   	int		has_capacity;
>   };
> @@ -5323,7 +5341,8 @@ static bool get_llc_stats(struct llc_stats *stats, int cpu)
>   		return false;
> 
>   	stats->nr_running	= READ_ONCE(sds->nr_running);
> -	stats->load		= READ_ONCE(sds->load);
> +	stats->min_load		= READ_ONCE(sds->min_load);
> +	stats->max_load		= READ_ONCE(sds->max_load);
>   	stats->capacity		= READ_ONCE(sds->capacity);
>   	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
> 
> @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   		unsigned long current_load = task_h_load(current);
> 
>   		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> +		if (current_load > this_stats.max_load)
> +			return true;
> +
> +		/* allow if the CPU would go idle, regardless of LLC load */
> +		if (current_load >= target_load(this_cpu, sd->wake_idx))
>   			return true;
> 
> -		this_stats.load -= current_load;
> +		this_stats.max_load -= current_load;
>   	}
> 
>   	/*
> @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
>   		return false;
> 
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
>   	/*
>   	 * Check to see if we can move the load without causing too much
>   	 * imbalance.
> @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>   	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
>   	prev_eff_load *= this_stats.capacity;
> 
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> +	this_eff_load *= this_stats.max_load + task_load;
> +	prev_eff_load *= prev_stats.min_load - task_load;
> 
>   	return this_eff_load <= prev_eff_load;
>   }
> @@ -7033,6 +7052,8 @@ enum group_type {
>   struct sg_lb_stats {
>   	unsigned long avg_load; /*Avg load across the CPUs of the group */
>   	unsigned long group_load; /* Total load over the CPUs of the group */
> +	unsigned long min_load;
> +	unsigned long max_load;
>   	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>   	unsigned long load_per_task;
>   	unsigned long group_capacity;
> @@ -7059,6 +7080,8 @@ struct sd_lb_stats {
>   	unsigned long total_load;	/* Total load of all groups in sd */
>   	unsigned long total_capacity;	/* Total capacity of all groups in sd */
>   	unsigned long avg_load;	/* Average load across all groups in sd */
> +	unsigned long min_load;		/* Sum of lowest loadavg on CPUs */
> +	unsigned long max_load;		/* Sum of highest loadavg on CPUs */
> 
>   	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
>   	struct sg_lb_stats local_stat;	/* Statistics of the local group */
> @@ -7077,6 +7100,8 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>   		.local = NULL,
>   		.total_running = 0UL,
>   		.total_load = 0UL,
> +		.min_load = 0UL,
> +		.max_load = 0UL,
>   		.total_capacity = 0UL,
>   		.busiest_stat = {
>   			.avg_load = 0UL,
> @@ -7358,7 +7383,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   			int local_group, struct sg_lb_stats *sgs,
>   			bool *overload)
>   {
> -	unsigned long load;
> +	unsigned long load, min_load, max_load;
>   	int i, nr_running;
> 
>   	memset(sgs, 0, sizeof(*sgs));
> @@ -7372,7 +7397,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		else
>   			load = source_load(i, load_idx);
> 
> +		min_max_load(i, &min_load, &max_load);
> +
>   		sgs->group_load += load;
> +		sgs->min_load += min_load;
> +		sgs->max_load += max_load;
>   		sgs->group_util += cpu_util(i);
>   		sgs->sum_nr_running += rq->cfs.h_nr_running;
> 
> @@ -7569,6 +7598,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		/* Now, start updating sd_lb_stats */
>   		sds->total_running += sgs->sum_nr_running;
>   		sds->total_load += sgs->group_load;
> +		sds->min_load += sgs->min_load;
> +		sds->max_load += sgs->max_load;
>   		sds->total_capacity += sgs->group_capacity;
> 
>   		sg = sg->next;
> @@ -7596,7 +7627,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   	 * XXX fix that.
>   	 */
>   	WRITE_ONCE(shared->nr_running,	sds->total_running);
> -	WRITE_ONCE(shared->load,	sds->total_load);
> +	WRITE_ONCE(shared->min_load,	sds->min_load);
> +	WRITE_ONCE(shared->max_load,	sds->max_load);
>   	WRITE_ONCE(shared->capacity,	sds->total_capacity);
>   }
> 
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27 17:58           ` Rik van Riel
  2017-09-28 11:04             ` Eric Farman
@ 2017-09-28 12:36             ` Peter Zijlstra
  2017-09-28 12:37             ` Peter Zijlstra
  2017-10-02 22:53             ` Matt Fleming
  3 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-09-28 12:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote:
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...

So the old approach that looks at the weight of the two CPUs behaves
slightly better in the overloaded case. On the threads==nr_cpus load
points they match fairly evenly.

I seem to have misplaced my netperf scripts, but I'll have a play with
it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27 17:58           ` Rik van Riel
  2017-09-28 11:04             ` Eric Farman
  2017-09-28 12:36             ` Peter Zijlstra
@ 2017-09-28 12:37             ` Peter Zijlstra
  2017-10-02 22:53             ` Matt Fleming
  3 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-09-28 12:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote:
> @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  		unsigned long current_load = task_h_load(current);
>  
>  		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> +		if (current_load > this_stats.max_load)
> +			return true;
> +
> +		/* allow if the CPU would go idle, regardless of LLC load */
> +		if (current_load >= target_load(this_cpu, sd->wake_idx))
>  			return true;
>  
> -		this_stats.load -= current_load;
> +		this_stats.max_load -= current_load;
>  	}
>  
>  	/*
> @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
>  		return false;
>  
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
>  	/*
>  	 * Check to see if we can move the load without causing too much
>  	 * imbalance.
> @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
>  	prev_eff_load *= this_stats.capacity;
>  
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> +	this_eff_load *= this_stats.max_load + task_load;
> +	prev_eff_load *= prev_stats.min_load - task_load;
>  
>  	return this_eff_load <= prev_eff_load;
>  }

So I would really like a workload that needs this LLC/NUMA stuff.
Because I much prefer the simpler: 'on which of these two CPUs can I run
soonest' approach.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-09-27 17:58           ` Rik van Riel
                               ` (2 preceding siblings ...)
  2017-09-28 12:37             ` Peter Zijlstra
@ 2017-10-02 22:53             ` Matt Fleming
  2017-10-03  8:39               ` Peter Zijlstra
  3 siblings, 1 reply; 22+ messages in thread
From: Matt Fleming @ 2017-10-02 22:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Eric Farman, 王金浦,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> 
> I like the simplicity of your approach!  I hope it does not break
> stuff like netperf...
> 
> I have been working on the patch below, which is much less optimistic
> about when to do an affine wakeup than before.

Running netperf for this patch and Peter's patch shows that Peter's
comes out on top, with scores pretty close to v4.12 in most places on
my 2-NUMA node 48-CPU Xeon box.

I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
but I will next week.

netperf-tcp
                                4.12.0                 4.13.0                 4.13.0                 4.13.0
                               default                default                peterz+                  riel+
Min       64        1653.72 (   0.00%)     1554.29 (  -6.01%)     1601.71 (  -3.15%)     1627.01 (  -1.62%)
Min       128       3240.96 (   0.00%)     2858.49 ( -11.80%)     3122.62 (  -3.65%)     3063.67 (  -5.47%)
Min       256       5840.55 (   0.00%)     4077.43 ( -30.19%)     5529.98 (  -5.32%)     5362.18 (  -8.19%)
Min       1024     16812.11 (   0.00%)    11899.72 ( -29.22%)    16335.83 (  -2.83%)    15075.24 ( -10.33%)
Min       2048     26875.79 (   0.00%)    18852.35 ( -29.85%)    25902.85 (  -3.62%)    22804.82 ( -15.15%)
Min       3312     33060.18 (   0.00%)    20984.28 ( -36.53%)    32817.82 (  -0.73%)    29161.49 ( -11.79%)
Min       4096     34513.24 (   0.00%)    23253.94 ( -32.62%)    34167.80 (  -1.00%)    29349.09 ( -14.96%)
Min       8192     39836.88 (   0.00%)    28881.63 ( -27.50%)    39613.28 (  -0.56%)    35307.95 ( -11.37%)
Min       16384    44203.84 (   0.00%)    31616.74 ( -28.48%)    43608.86 (  -1.35%)    38130.44 ( -13.74%)
Hmean     64        1686.58 (   0.00%)     1613.25 (  -4.35%)     1657.04 (  -1.75%)     1655.38 (  -1.85%)
Hmean     128       3361.84 (   0.00%)     2945.34 ( -12.39%)     3173.47 (  -5.60%)     3122.38 (  -7.12%)
Hmean     256       5993.92 (   0.00%)     4423.32 ( -26.20%)     5618.26 (  -6.27%)     5523.72 (  -7.84%)
Hmean     1024     17225.83 (   0.00%)    12314.23 ( -28.51%)    16574.85 (  -3.78%)    15644.71 (  -9.18%)
Hmean     2048     27944.22 (   0.00%)    21301.63 ( -23.77%)    26395.38 (  -5.54%)    24067.57 ( -13.87%)
Hmean     3312     33760.48 (   0.00%)    22361.07 ( -33.77%)    33198.32 (  -1.67%)    30055.64 ( -10.97%)
Hmean     4096     35077.74 (   0.00%)    29153.73 ( -16.89%)    34479.40 (  -1.71%)    31215.64 ( -11.01%)
Hmean     8192     40674.31 (   0.00%)    33493.01 ( -17.66%)    40443.22 (  -0.57%)    37298.58 (  -8.30%)
Hmean     16384    45492.12 (   0.00%)    37177.64 ( -18.28%)    44308.62 (  -2.60%)    40728.33 ( -10.47%)
Max       64        1745.95 (   0.00%)     1649.03 (  -5.55%)     1710.52 (  -2.03%)     1702.65 (  -2.48%)
Max       128       3509.96 (   0.00%)     3082.35 ( -12.18%)     3204.19 (  -8.71%)     3174.41 (  -9.56%)
Max       256       6138.35 (   0.00%)     4687.62 ( -23.63%)     5694.52 (  -7.23%)     5722.08 (  -6.78%)
Max       1024     17732.13 (   0.00%)    13270.42 ( -25.16%)    16838.69 (  -5.04%)    16580.18 (  -6.50%)
Max       2048     28907.99 (   0.00%)    24816.39 ( -14.15%)    26792.86 (  -7.32%)    25003.60 ( -13.51%)
Max       3312     34512.60 (   0.00%)    23510.32 ( -31.88%)    33762.47 (  -2.17%)    31676.54 (  -8.22%)
Max       4096     35918.95 (   0.00%)    35245.77 (  -1.87%)    34866.23 (  -2.93%)    32537.07 (  -9.42%)
Max       8192     41749.55 (   0.00%)    41068.52 (  -1.63%)    42164.53 (   0.99%)    40105.50 (  -3.94%)
Max       16384    47234.74 (   0.00%)    41728.66 ( -11.66%)    45387.40 (  -3.91%)    44107.25 (  -6.62%)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-02 22:53             ` Matt Fleming
@ 2017-10-03  8:39               ` Peter Zijlstra
  2017-10-03 16:02                 ` Rik van Riel
  2017-10-04 16:18                 ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-10-03  8:39 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote:
> On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> > 
> > I like the simplicity of your approach!  I hope it does not break
> > stuff like netperf...
> > 
> > I have been working on the patch below, which is much less optimistic
> > about when to do an affine wakeup than before.
> 
> Running netperf for this patch and Peter's patch shows that Peter's
> comes out on top, with scores pretty close to v4.12 in most places on
> my 2-NUMA node 48-CPU Xeon box.
> 
> I haven't dug any further into why v4.13-peterz+ is worse than v4.12,
> but I will next week.

So I was waiting for Rik, who promised to run a bunch of NUMA workloads
over the weekend.

The trivial thing regresses a wee bit on the overloaded case, I've not
yet tried to fix it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-03  8:39               ` Peter Zijlstra
@ 2017-10-03 16:02                 ` Rik van Riel
  2017-10-04 16:18                 ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2017-10-03 16:02 UTC (permalink / raw)
  To: Peter Zijlstra, Matt Fleming
  Cc: Eric Farman, ?????????,
	LKML, jhladky, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

On Tue, 2017-10-03 at 10:39 +0200, Peter Zijlstra wrote:
> On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote:
> > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote:
> > > 
> > > I like the simplicity of your approach!  I hope it does not break
> > > stuff like netperf...
> > > 
> > > I have been working on the patch below, which is much less
> > > optimistic
> > > about when to do an affine wakeup than before.
> > 
> > Running netperf for this patch and Peter's patch shows that Peter's
> > comes out on top, with scores pretty close to v4.12 in most places
> > on
> > my 2-NUMA node 48-CPU Xeon box.
> > 
> > I haven't dug any further into why v4.13-peterz+ is worse than
> > v4.12,
> > but I will next week.
> 
> So I was waiting for Rik, who promised to run a bunch of NUMA
> workloads
> over the weekend.
> 
> The trivial thing regresses a wee bit on the overloaded case, I've
> not
> yet tried to fix it.

In Jirka's tests, your simple patch also came out
on top.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-03  8:39               ` Peter Zijlstra
  2017-10-03 16:02                 ` Rik van Riel
@ 2017-10-04 16:18                 ` Peter Zijlstra
  2017-10-04 18:02                   ` Rik van Riel
  2017-10-06 10:36                   ` Matt Fleming
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-10-04 16:18 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> over the weekend.
> 
> The trivial thing regresses a wee bit on the overloaded case, I've not
> yet tried to fix it.

WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
addition -- based on the old scheme -- that I've tried in order to lift
the overloaded case (including hackbench).

Its not an unconditional win, but I'm tempted to default enable
WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs).

But let me first write a Changelog for the below and queue that. Then we
ran maybe run more things..



On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core):


WA_IDLE && NO_WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       7.391856936 seconds time elapsed                                          ( +-  0.66% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 54524.6
TCP_SENDFILE-10 : Avg: 48185.2
TCP_SENDFILE-20 : Avg: 29031.2
TCP_SENDFILE-40 : Avg: 9819.72
TCP_SENDFILE-80 : Avg: 5355.3
TCP_STREAM-1 : Avg: 41448.3
TCP_STREAM-10 : Avg: 24123.2
TCP_STREAM-20 : Avg: 15834.5
TCP_STREAM-40 : Avg: 5583.91
TCP_STREAM-80 : Avg: 2329.66
TCP_RR-1 : Avg: 80473.5
TCP_RR-10 : Avg: 72660.5
TCP_RR-20 : Avg: 52607.1
TCP_RR-40 : Avg: 57199.2
TCP_RR-80 : Avg: 25330.3
UDP_RR-1 : Avg: 108266
UDP_RR-10 : Avg: 95480
UDP_RR-20 : Avg: 68770.8
UDP_RR-40 : Avg: 76231
UDP_RR-80 : Avg: 34578.3
UDP_STREAM-1 : Avg: 64684.3
UDP_STREAM-10 : Avg: 52701.2
UDP_STREAM-20 : Avg: 30376.4
UDP_STREAM-40 : Avg: 15685.8
UDP_STREAM-80 : Avg: 8415.13
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok
  2: [30 secs]     transactions:                        64057  (2135.17 per sec.)
  5: [30 secs]     transactions:                        144295 (4809.68 per sec.)
 10: [30 secs]     transactions:                        274768 (9158.59 per sec.)
 20: [30 secs]     transactions:                        437140 (14570.70 per sec.)
 40: [30 secs]     transactions:                        663949 (22130.56 per sec.)
 80: [30 secs]     transactions:                        629927 (20995.56 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        50389  (1679.58 per sec.)
  5: [30 secs]     transactions:                        113934 (3797.69 per sec.)
 10: [30 secs]     transactions:                        217606 (7253.22 per sec.)
 20: [30 secs]     transactions:                        335021 (11166.75 per sec.)
 40: [30 secs]     transactions:                        518355 (17277.28 per sec.)
 80: [30 secs]     transactions:                        513424 (17112.44 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 2
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 4
        min=0, max=86
avg worker transfer: 190227.78 ops/sec 743.08KB/s
rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84%
rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87%
Latency percentiles (usec)
        50.0000th: 62
        75.0000th: 101
        90.0000th: 108
        95.0000th: 112
        *99.0000th: 119
        99.5000th: 124
        99.9000th: 4920
        min=0, max=12987
Throughput 664.328 MB/sec  2 clients  2 procs  max_latency=0.076 ms
Throughput 1573.72 MB/sec  5 clients  5 procs  max_latency=0.102 ms
Throughput 2948.7 MB/sec  10 clients  10 procs  max_latency=0.198 ms
Throughput 4602.38 MB/sec  20 clients  20 procs  max_latency=1.712 ms
Throughput 9253.17 MB/sec  40 clients  40 procs  max_latency=2.047 ms
Throughput 8056.01 MB/sec  80 clients  80 procs  max_latency=35.819 ms


-----------------------

WA_IDLE && WA_WEIGHT:


 Performance counter stats for 'perf bench sched messaging -g 20 -t -l 10000' (10 runs):

       6.500797532 seconds time elapsed                                          ( +-  0.97% )

[ ok ] Starting network benchmark server.
TCP_SENDFILE-1 : Avg: 52224.3
TCP_SENDFILE-10 : Avg: 46504.3
TCP_SENDFILE-20 : Avg: 28610.3
TCP_SENDFILE-40 : Avg: 9253.12
TCP_SENDFILE-80 : Avg: 4687.4
TCP_STREAM-1 : Avg: 42254
TCP_STREAM-10 : Avg: 25847.9
TCP_STREAM-20 : Avg: 18374.4
TCP_STREAM-40 : Avg: 5599.57
TCP_STREAM-80 : Avg: 2726.41
TCP_RR-1 : Avg: 82638.8
TCP_RR-10 : Avg: 73265.1
TCP_RR-20 : Avg: 52634.5
TCP_RR-40 : Avg: 56302.3
TCP_RR-80 : Avg: 26867.9
UDP_RR-1 : Avg: 107844
UDP_RR-10 : Avg: 95245.2
UDP_RR-20 : Avg: 68673.7
UDP_RR-40 : Avg: 75419.1
UDP_RR-80 : Avg: 35639.1
UDP_STREAM-1 : Avg: 66606
UDP_STREAM-10 : Avg: 52959.5
UDP_STREAM-20 : Avg: 29704
UDP_STREAM-40 : Avg: 15266.5
UDP_STREAM-80 : Avg: 7388.97
[ ok ] Stopping network benchmark server.
[....] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 
  2: [30 secs]     transactions:                        64277  (2142.51 per sec.)
  5: [30 secs]     transactions:                        144010 (4800.19 per sec.)
 10: [30 secs]     transactions:                        274722 (9157.05 per sec.)
 20: [30 secs]     transactions:                        436325 (14543.55 per sec.)
 40: [30 secs]     transactions:                        665582 (22184.82 per sec.)
 80: [30 secs]     transactions:                        657185 (21904.18 per sec.)
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting PostgreSQL 9.4 database server: main.
  2: [30 secs]     transactions:                        51153  (1705.06 per sec.)
  5: [30 secs]     transactions:                        116403 (3879.93 per sec.)
 10: [30 secs]     transactions:                        217750 (7258.06 per sec.)
 20: [30 secs]     transactions:                        336619 (11220.00 per sec.)
 40: [30 secs]     transactions:                        520823 (17359.78 per sec.)
 80: [30 secs]     transactions:                        516690 (17221.16 per sec.)
[ ok ] Stopping PostgreSQL 9.4 database server: main.
Latency percentiles (usec)
        50.0000th: 3
        75.0000th: 3
        90.0000th: 3
        95.0000th: 3
        *99.0000th: 3
        99.5000th: 3
        99.9000th: 5
        min=0, max=86
avg worker transfer: 185378.92 ops/sec 724.14KB/s
rps: 1004.82 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51%
rps: 1052.51 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08%
rps: 1076.38 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79%
rps: 1100.23 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36%
rps: 1147.89 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45%
rps: 1171.73 p95 (usec) 9520 p99 (usec) 9552 p95/cputime 31.73% p99/cputime 31.84%
rps: 1220.05 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12%
rps: 1243.85 p95 (usec) 14960 p99 (usec) 14960 p95/cputime 49.87% p99/cputime 49.87%
rps: 1243.86 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97%
rps: 1266.39 p95 (usec) 213760 p99 (usec) 225024 p95/cputime 712.53% p99/cputime 750.08%
Latency percentiles (usec)
        50.0000th: 66
        75.0000th: 101
        90.0000th: 107
        95.0000th: 112
        *99.0000th: 120
        99.5000th: 126
        99.9000th: 390
        min=0, max=12964
Throughput 678.413 MB/sec  2 clients  2 procs  max_latency=0.105 ms
Throughput 1589.98 MB/sec  5 clients  5 procs  max_latency=0.084 ms
Throughput 3012.51 MB/sec  10 clients  10 procs  max_latency=0.262 ms
Throughput 4555.93 MB/sec  20 clients  20 procs  max_latency=0.515 ms
Throughput 8496.23 MB/sec  40 clients  40 procs  max_latency=2.040 ms
Throughput 8601.62 MB/sec  80 clients  80 procs  max_latency=2.712 ms


---
 include/linux/sched/topology.h |   8 ---
 kernel/sched/fair.c            | 131 ++++++++++++-----------------------------
 kernel/sched/features.h        |   2 +
 3 files changed, 39 insertions(+), 102 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..7d065abc7a47 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,14 +71,6 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-
-	/*
-	 * Some variables from the most recent sd_lb_stats for this domain,
-	 * used by wake_affine().
-	 */
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec01523..a1a6b6f52660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5638,91 +5638,60 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-struct llc_stats {
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
-	int		has_capacity;
-};
+/*
+ * The purpose of wake_affine() is to quickly determine on which CPU we can run
+ * soonest. For the purpose of speed we only consider the waking and previous
+ * CPU.
+ *
+ * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or
+ * 			will be) idle.
+ *
+ * wake_affine_weight() - considers the weight to reflect the average
+ * 			  scheduling latency of the CPUs. This seems to work
+ * 			  for the overloaded case.
+ */
 
-static bool get_llc_stats(struct llc_stats *stats, int cpu)
+static bool
+wake_affine_idle(struct sched_domain *sd, struct task_struct *p,
+		 int this_cpu, int prev_cpu, int sync)
 {
-	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-
-	if (!sds)
-		return false;
+	if (idle_cpu(this_cpu))
+		return true;
 
-	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
-	stats->capacity		= READ_ONCE(sds->capacity);
-	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
+	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+		return true;
 
-	return true;
+	return false;
 }
 
-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- *
- * Since we're running on 'stale' values, we might in fact create an imbalance
- * but recomputing these values is expensive, as that'd mean iteration 2 cache
- * domains worth of CPUs.
- */
 static bool
-wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
-		int this_cpu, int prev_cpu, int sync)
+wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
+		   int this_cpu, int prev_cpu, int sync)
 {
-	struct llc_stats prev_stats, this_stats;
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	if (!get_llc_stats(&prev_stats, prev_cpu) ||
-	    !get_llc_stats(&this_stats, this_cpu))
-		return false;
+	this_eff_load = target_load(this_cpu, sd->wake_idx);
+	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current LLC.
-	 */
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
 
-		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
+		if (current_load > this_eff_load)
 			return true;
 
-		this_stats.load -= current_load;
+		this_eff_load -= current_load;
 	}
 
-	/*
-	 * The has_capacity stuff is not SMT aware, but by trying to balance
-	 * the nr_running on both ends we try and fill the domain at equal
-	 * rates, thereby first consuming cores before siblings.
-	 */
-
-	/* if the old cache has capacity, stay there */
-	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
-		return false;
-
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
-	/*
-	 * Check to see if we can move the load without causing too much
-	 * imbalance.
-	 */
 	task_load = task_h_load(p);
 
-	this_eff_load = 100;
-	this_eff_load *= prev_stats.capacity;
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= this_stats.capacity;
+	this_eff_load += task_load;
+	this_eff_load *= 100;
+	this_eff_load *= capacity_of(prev_cpu);
 
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
+	prev_eff_load -= task_load;
+	prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
 	return this_eff_load <= prev_eff_load;
 }
@@ -5731,22 +5700,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine;
+	bool affine = false;
 
-	/*
-	 * Default to no affine wakeups; wake_affine() should not effect a task
-	 * placement the load-balancer feels inclined to undo. The conservative
-	 * option is therefore to not move tasks when they wake up.
-	 */
-	affine = false;
+	if (sched_feat(WA_IDLE) && !affine)
+		affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync);
 
-	/*
-	 * If the wakeup is across cache domains, try to evaluate if movement
-	 * makes sense, otherwise rely on select_idle_siblings() to do
-	 * placement inside the cache domain.
-	 */
-	if (!cpus_share_cache(prev_cpu, this_cpu))
-		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
+	if (sched_feat(WA_WEIGHT) && !affine)
+		affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
 	if (affine) {
@@ -7895,7 +7855,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  */
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain_shared *shared = env->sd->shared;
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
@@ -7967,22 +7926,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
-
-	if (!shared)
-		return;
-
-	/*
-	 * Since these are sums over groups they can contain some CPUs
-	 * multiple times for the NUMA domains.
-	 *
-	 * Currently only wake_affine_llc() and find_busiest_group()
-	 * uses these numbers, only the last is affected by this problem.
-	 *
-	 * XXX fix that.
-	 */
-	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
-	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d3fb15555291..d40d33ec935f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,3 +81,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(WA_IDLE, true)
+SCHED_FEAT(WA_WEIGHT, false)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-04 16:18                 ` Peter Zijlstra
@ 2017-10-04 18:02                   ` Rik van Riel
  2017-10-06 10:36                   ` Matt Fleming
  1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2017-10-04 18:02 UTC (permalink / raw)
  To: Peter Zijlstra, Matt Fleming
  Cc: Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

[-- Attachment #1: Type: text/plain, Size: 1679 bytes --]

On Wed, 2017-10-04 at 18:18 +0200, Peter Zijlstra wrote:
> On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> > So I was waiting for Rik, who promised to run a bunch of NUMA
> > workloads
> > over the weekend.
> > 
> > The trivial thing regresses a wee bit on the overloaded case, I've
> > not
> > yet tried to fix it.
> 
> WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
> addition -- based on the old scheme -- that I've tried in order to
> lift
> the overloaded case (including hackbench).
> 
> Its not an unconditional win, but I'm tempted to default enable
> WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs).

Enabling both makes sense to me.

We have four cases to deal with:
- mostly idle system, in that case we don't really care,
  since select_idle_sibling will find an idle core anywhere
- partially loaded system (say 1/2 or 2/3 full), in that case
  WA_IDLE will be a great policy to help locate an idle CPU
- fully loaded system, in this case either policy works well
- overloaded system, in this case WA_WEIGHT seems to do the
  trick, assuming load balancing results in largely similar
  loads between cores inside each LLC

The big danger is affine wakeups messing up the balance
the load balancer works on, with the two mechanisms
messing up each other's placement.

However, there seems to be very little we can actually
do about that, without the unacceptable overhead of
examining the instantaneous loads on every CPU in an
LLC - otherwise we end up either overshooting, or not
taking advantage of idle CPUs, due to the use of cached
load values.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-04 16:18                 ` Peter Zijlstra
  2017-10-04 18:02                   ` Rik van Riel
@ 2017-10-06 10:36                   ` Matt Fleming
  2017-10-10 14:51                     ` Matt Fleming
  1 sibling, 1 reply; 22+ messages in thread
From: Matt Fleming @ 2017-10-06 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote:
> On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote:
> > So I was waiting for Rik, who promised to run a bunch of NUMA workloads
> > over the weekend.
> > 
> > The trivial thing regresses a wee bit on the overloaded case, I've not
> > yet tried to fix it.
> 
> WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the
> addition -- based on the old scheme -- that I've tried in order to lift
> the overloaded case (including hackbench).

My results (2 nodes, 12 cores/node, 2 threads/core) show that you've
pretty much restored hackbench performance to v4.12. However, it's a
regression against v4.13 for hackbench-process-pipes (I'm guessing the
v4.13 improvement is due to Rik's original patches).

The last two result columns are your latest patch with NO_WA_WEIGHT
and then with WA_WEIGHT enabled.

(I hope you've all got wide terminals)

hackbench-process-pipes
                              4.12.0                 4.13.0                 4.13.0                 4.13.0                 4.13.0                 4.13.0
                             vanilla                vanilla             peterz-fix                rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean     1        1.1600 (   0.00%)      1.6037 ( -38.25%)      1.0727 (   7.53%)      1.0200 (  12.07%)      1.0837 (   6.58%)      1.1110 (   4.22%)
Amean     4        2.4207 (   0.00%)      2.2300 (   7.88%)      2.0520 (  15.23%)      1.9483 (  19.51%)      2.0623 (  14.80%)      2.2807 (   5.78%)
Amean     7        5.4140 (   0.00%)      3.2027 (  40.84%)      3.6100 (  33.32%)      3.5620 (  34.21%)      3.5590 (  34.26%)      4.6573 (  13.98%)
Amean     12       9.7130 (   0.00%)      4.7170 (  51.44%)      6.5280 (  32.79%)      6.2063 (  36.10%)      6.5670 (  32.39%)     10.5440 (  -8.56%)
Amean     21      11.6687 (   0.00%)      8.8073 (  24.52%)     14.4997 ( -24.26%)     10.2700 (  11.99%)     14.3187 ( -22.71%)     11.5417 (   1.09%)
Amean     30      14.6410 (   0.00%)     11.7003 (  20.09%)     23.7660 ( -62.32%)     13.9847 (   4.48%)     21.8957 ( -49.55%)     14.4847 (   1.07%)
Amean     48      19.8780 (   0.00%)     17.0317 (  14.32%)     37.6397 ( -89.35%)     19.7577 (   0.61%)     39.2110 ( -97.26%)     20.3293 (  -2.27%)
Amean     79      46.4200 (   0.00%)     27.1180 (  41.58%)     58.4037 ( -25.82%)     35.5537 (  23.41%)     60.8957 ( -31.18%)     49.7470 (  -7.17%)
Amean     110     57.7550 (   0.00%)     42.7013 (  26.06%)     73.0483 ( -26.48%)     48.8880 (  15.35%)     77.8597 ( -34.81%)     61.9353 (  -7.24%)
Amean     141     61.0490 (   0.00%)     48.0857 (  21.23%)     98.5567 ( -61.44%)     63.2187 (  -3.55%)     90.4857 ( -48.22%)     68.3100 ( -11.89%)
Amean     172     70.5180 (   0.00%)     59.3620 (  15.82%)    122.5423 ( -73.77%)     76.0197 (  -7.80%)    127.4023 ( -80.67%)     75.8233 (  -7.52%)
Amean     192     76.1643 (   0.00%)     65.1613 (  14.45%)    142.1393 ( -86.62%)     91.4923 ( -20.12%)    145.0663 ( -90.46%)     80.5867 (  -5.81%)

But things look pretty good for hackbench-process-sockets:

hackbench-process-sockets
                              4.12.0                 4.13.0                 4.13.0                 4.13.0                 4.13.0                 4.13.0
                             vanilla                vanilla             peterz-fix                rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight
Amean     1        0.9657 (   0.00%)      1.0850 ( -12.36%)      1.3737 ( -42.25%)      1.3093 ( -35.59%)      1.3220 ( -36.90%)      1.3937 ( -44.32%)
Amean     4        2.3040 (   0.00%)      3.3840 ( -46.88%)      2.1807 (   5.35%)      2.3010 (   0.13%)      2.2070 (   4.21%)      2.1770 (   5.51%)
Amean     7        4.5467 (   0.00%)      4.0787 (  10.29%)      5.0530 ( -11.14%)      3.7427 (  17.68%)      4.5517 (  -0.11%)      3.8560 (  15.19%)
Amean     12       5.7707 (   0.00%)      5.4440 (   5.66%)     10.5680 ( -83.13%)      7.7240 ( -33.85%)     10.5990 ( -83.67%)      5.9123 (  -2.45%)
Amean     21       8.9387 (   0.00%)      9.5850 (  -7.23%)     18.3103 (-104.84%)     10.9253 ( -22.23%)     18.1540 (-103.10%)      9.2627 (  -3.62%)
Amean     30      13.1243 (   0.00%)     14.0773 (  -7.26%)     25.6563 ( -95.49%)     15.7590 ( -20.07%)     25.6920 ( -95.76%)     14.6523 ( -11.64%)
Amean     48      25.1030 (   0.00%)     22.5233 (  10.28%)     40.5937 ( -61.71%)     24.6727 (   1.71%)     40.6357 ( -61.88%)     22.1807 (  11.64%)
Amean     79      39.9150 (   0.00%)     33.4220 (  16.27%)     66.3343 ( -66.19%)     40.2713 (  -0.89%)     65.8543 ( -64.99%)     35.3360 (  11.47%)
Amean     110     49.1700 (   0.00%)     46.1173 (   6.21%)     92.3153 ( -87.75%)     55.6567 ( -13.19%)     92.0567 ( -87.22%)     46.7280 (   4.97%)
Amean     141     59.3157 (   0.00%)     57.2670 (   3.45%)    118.5863 ( -99.92%)     70.4800 ( -18.82%)    118.6013 ( -99.95%)     57.8247 (   2.51%)
Amean     172     69.8163 (   0.00%)     68.2817 (   2.20%)    145.7583 (-108.77%)     83.0167 ( -18.91%)    144.4477 (-106.90%)     68.4457 (   1.96%)
Amean     192     75.9913 (   0.00%)     76.0503 (  -0.08%)    159.8487 (-110.35%)     91.0133 ( -19.77%)    159.6793 (-110.13%)     76.2690 (  -0.37%)

It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
regress but performance is restored for sockets.

Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
patch. So I've queued that up now and it should be done by tomorrow.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-06 10:36                   ` Matt Fleming
@ 2017-10-10 14:51                     ` Matt Fleming
  2017-10-10 15:16                       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Matt Fleming @ 2017-10-10 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
> 
> It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
> regress but performance is restored for sockets.
> 
> Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
> patch. So I've queued that up now and it should be done by tomorrow.

Yeah, netperf results look fine for either your NO_WA_WEIGHT or
WA_WEIGHT patch.

Any ETA on when this is going to tip?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-10 14:51                     ` Matt Fleming
@ 2017-10-10 15:16                       ` Peter Zijlstra
  2017-10-10 17:26                         ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2017-10-10 15:16 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato

On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote:
> On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
> > 
> > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
> > regress but performance is restored for sockets.
> > 
> > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
> > patch. So I've queued that up now and it should be done by tomorrow.
> 
> Yeah, netperf results look fine for either your NO_WA_WEIGHT or
> WA_WEIGHT patch.
> 
> Any ETA on when this is going to tip?

Just hit a few hours ago :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-10 15:16                       ` Peter Zijlstra
@ 2017-10-10 17:26                         ` Ingo Molnar
  2017-10-10 17:40                           ` Christian Borntraeger
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2017-10-10 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matt Fleming, Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, Christian Borntraeger,
	KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote:
> > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
> > > 
> > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
> > > regress but performance is restored for sockets.
> > > 
> > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
> > > patch. So I've queued that up now and it should be done by tomorrow.
> > 
> > Yeah, netperf results look fine for either your NO_WA_WEIGHT or
> > WA_WEIGHT patch.
> > 
> > Any ETA on when this is going to tip?
> 
> Just hit a few hours ago :-)

I admit that time machines are really handy!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: sysbench throughput degradation in 4.13+
  2017-10-10 17:26                         ` Ingo Molnar
@ 2017-10-10 17:40                           ` Christian Borntraeger
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Borntraeger @ 2017-10-10 17:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Matt Fleming, Rik van Riel, Eric Farman, ?????????,
	LKML, Ingo Molnar, KVM-ML (kvm@vger.kernel.org),
	vcaputo, Matthew Rosato



On 10/10/2017 07:26 PM, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote:
>>> On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote:
>>>>
>>>> It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes
>>>> regress but performance is restored for sockets.
>>>>
>>>> Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT
>>>> patch. So I've queued that up now and it should be done by tomorrow.
>>>
>>> Yeah, netperf results look fine for either your NO_WA_WEIGHT or
>>> WA_WEIGHT patch.
>>>
>>> Any ETA on when this is going to tip?
>>
>> Just hit a few hours ago :-)
> 
> I admit that time machines are really handy!
> 
> Thanks,

Are we going to schedule this for 4.13 stable as well?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-10-10 17:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12 14:14 sysbench throughput degradation in 4.13+ Eric Farman
2017-09-13  8:24 ` 王金浦
2017-09-22 15:03   ` Eric Farman
2017-09-22 15:53     ` Peter Zijlstra
2017-09-22 16:12       ` Eric Farman
2017-09-27  9:35         ` Peter Zijlstra
2017-09-27 16:27           ` Eric Farman
2017-09-28  9:08             ` Christian Borntraeger
2017-09-27 17:58           ` Rik van Riel
2017-09-28 11:04             ` Eric Farman
2017-09-28 12:36             ` Peter Zijlstra
2017-09-28 12:37             ` Peter Zijlstra
2017-10-02 22:53             ` Matt Fleming
2017-10-03  8:39               ` Peter Zijlstra
2017-10-03 16:02                 ` Rik van Riel
2017-10-04 16:18                 ` Peter Zijlstra
2017-10-04 18:02                   ` Rik van Riel
2017-10-06 10:36                   ` Matt Fleming
2017-10-10 14:51                     ` Matt Fleming
2017-10-10 15:16                       ` Peter Zijlstra
2017-10-10 17:26                         ` Ingo Molnar
2017-10-10 17:40                           ` Christian Borntraeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.