All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Eric Farman <farman@linux.vnet.ibm.com>
Cc: "Rik van Riel" <riel@redhat.com>, 王金浦 <jinpuwang@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Christian Borntraeger" <borntraeger@de.ibm.com>,
	"KVM-ML (kvm@vger.kernel.org)" <kvm@vger.kernel.org>,
	vcaputo@pengaru.com,
	"Matthew Rosato" <mjrosato@linux.vnet.ibm.com>
Subject: Re: sysbench throughput degradation in 4.13+
Date: Wed, 27 Sep 2017 11:35:30 +0200	[thread overview]
Message-ID: <20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <754f5a9f-5332-148d-2631-918fc7a7cfe9@linux.vnet.ibm.com>

On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote:
> 
> MySQL.  We've tried a few different configs with both test=oltp and
> test=threads, but both show the same behavior.  What I have settled on for
> my repro is the following:
> 

Right, didn't even need to run it in a guest to observe a regression.

So the below cures native sysbench and NAS bench for me, does it also
work for you virt thingy?


PRE (current tip/master):

ivb-ex sysbench:

  2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
  5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
 10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
 20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
 40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
 80: [30 secs]     transactions:                        355096 (11834.28 per sec.)

hsw-ex NAS:

OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83


POST (+patch):

ivb-ex sysbench:

  2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
  5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
 10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
 20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
 40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
 80: [30 secs]     transactions:                        631739 (21055.88 per sec.)

hsw-ex NAS:

lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52


This patch takes out all the shiny wake_affine stuff and goes back to
utter basics. Rik was there another NUMA benchmark that wanted your
fancy stuff? Because NAS isn't it.

(the previous, slightly fancier wake_affine was basically a !idle
extension of this, in that it would pick the 'shortest' of the 2 queues
and thereby run quickest, in approximation)

I'll try and run a number of other benchmarks I have around to see if
there's anything that shows a difference between the below trivial
wake_affine and the old 2-cpu-load one.

---
 include/linux/sched/topology.h |   8 ---
 kernel/sched/fair.c            | 125 ++---------------------------------------
 2 files changed, 6 insertions(+), 127 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d7b6dab956ec..7d065abc7a47 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,14 +71,6 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-
-	/*
-	 * Some variables from the most recent sd_lb_stats for this domain,
-	 * used by wake_affine().
-	 */
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70ba32e08a23..66930ce338af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-struct llc_stats {
-	unsigned long	nr_running;
-	unsigned long	load;
-	unsigned long	capacity;
-	int		has_capacity;
-};
-
-static bool get_llc_stats(struct llc_stats *stats, int cpu)
-{
-	struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-
-	if (!sds)
-		return false;
-
-	stats->nr_running	= READ_ONCE(sds->nr_running);
-	stats->load		= READ_ONCE(sds->load);
-	stats->capacity		= READ_ONCE(sds->capacity);
-	stats->has_capacity	= stats->nr_running < per_cpu(sd_llc_size, cpu);
-
-	return true;
-}
-
-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- *
- * Since we're running on 'stale' values, we might in fact create an imbalance
- * but recomputing these values is expensive, as that'd mean iteration 2 cache
- * domains worth of CPUs.
- */
-static bool
-wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
-		int this_cpu, int prev_cpu, int sync)
-{
-	struct llc_stats prev_stats, this_stats;
-	s64 this_eff_load, prev_eff_load;
-	unsigned long task_load;
-
-	if (!get_llc_stats(&prev_stats, prev_cpu) ||
-	    !get_llc_stats(&this_stats, this_cpu))
-		return false;
-
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current LLC.
-	 */
-	if (sync) {
-		unsigned long current_load = task_h_load(current);
-
-		/* in this case load hits 0 and this LLC is considered 'idle' */
-		if (current_load > this_stats.load)
-			return true;
-
-		this_stats.load -= current_load;
-	}
-
-	/*
-	 * The has_capacity stuff is not SMT aware, but by trying to balance
-	 * the nr_running on both ends we try and fill the domain at equal
-	 * rates, thereby first consuming cores before siblings.
-	 */
-
-	/* if the old cache has capacity, stay there */
-	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
-		return false;
-
-	/* if this cache has capacity, come here */
-	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
-		return true;
-
-	/*
-	 * Check to see if we can move the load without causing too much
-	 * imbalance.
-	 */
-	task_load = task_h_load(p);
-
-	this_eff_load = 100;
-	this_eff_load *= prev_stats.capacity;
-
-	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-	prev_eff_load *= this_stats.capacity;
-
-	this_eff_load *= this_stats.load + task_load;
-	prev_eff_load *= prev_stats.load - task_load;
-
-	return this_eff_load <= prev_eff_load;
-}
-
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int prev_cpu, int sync)
 {
 	int this_cpu = smp_processor_id();
-	bool affine;
-
-	/*
-	 * Default to no affine wakeups; wake_affine() should not effect a task
-	 * placement the load-balancer feels inclined to undo. The conservative
-	 * option is therefore to not move tasks when they wake up.
-	 */
-	affine = false;
+	bool affine = false;
 
 	/*
-	 * If the wakeup is across cache domains, try to evaluate if movement
-	 * makes sense, otherwise rely on select_idle_siblings() to do
-	 * placement inside the cache domain.
+	 * If we can run _now_ on the waking CPU, go there, otherwise meh.
 	 */
-	if (!cpus_share_cache(prev_cpu, this_cpu))
-		affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);
+	if (idle_cpu(this_cpu))
+		affine = true;
+	else if (sync && cpu_rq(this_cpu)->nr_running == 1)
+		affine = true;
 
 	schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
 	if (affine) {
@@ -7600,7 +7504,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  */
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain_shared *shared = env->sd->shared;
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
@@ -7672,22 +7575,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
-
-	if (!shared)
-		return;
-
-	/*
-	 * Since these are sums over groups they can contain some CPUs
-	 * multiple times for the NUMA domains.
-	 *
-	 * Currently only wake_affine_llc() and find_busiest_group()
-	 * uses these numbers, only the last is affected by this problem.
-	 *
-	 * XXX fix that.
-	 */
-	WRITE_ONCE(shared->nr_running,	sds->total_running);
-	WRITE_ONCE(shared->load,	sds->total_load);
-	WRITE_ONCE(shared->capacity,	sds->total_capacity);
 }
 
 /**

  reply	other threads:[~2017-09-27  9:35 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-12 14:14 sysbench throughput degradation in 4.13+ Eric Farman
2017-09-13  8:24 ` 王金浦
2017-09-22 15:03   ` Eric Farman
2017-09-22 15:53     ` Peter Zijlstra
2017-09-22 16:12       ` Eric Farman
2017-09-27  9:35         ` Peter Zijlstra [this message]
2017-09-27 16:27           ` Eric Farman
2017-09-28  9:08             ` Christian Borntraeger
2017-09-27 17:58           ` Rik van Riel
2017-09-28 11:04             ` Eric Farman
2017-09-28 12:36             ` Peter Zijlstra
2017-09-28 12:37             ` Peter Zijlstra
2017-10-02 22:53             ` Matt Fleming
2017-10-03  8:39               ` Peter Zijlstra
2017-10-03 16:02                 ` Rik van Riel
2017-10-04 16:18                 ` Peter Zijlstra
2017-10-04 18:02                   ` Rik van Riel
2017-10-06 10:36                   ` Matt Fleming
2017-10-10 14:51                     ` Matt Fleming
2017-10-10 15:16                       ` Peter Zijlstra
2017-10-10 17:26                         ` Ingo Molnar
2017-10-10 17:40                           ` Christian Borntraeger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=borntraeger@de.ibm.com \
    --cc=farman@linux.vnet.ibm.com \
    --cc=jinpuwang@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjrosato@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=vcaputo@pengaru.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.