linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Josef Bacik <jbacik@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	riel@redhat.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	morten.rasmussen@arm.com, kernel-team <Kernel-team@fb.com>
Subject: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
Date: Tue, 07 Jul 2015 11:43:44 +0200	[thread overview]
Message-ID: <1436262224.1836.74.camel@gmail.com> (raw)
In-Reply-To: <1436241678.1836.29.camel@gmail.com>

On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
> 
> > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
> > baseline with a slight regression at lower RPS and a slight improvement 
> > at high RPS.
> 
> Good.  I can likely drop the rest then (I like dinky, so do CPUs;).  I'm
> not real keen on the feature unless your numbers are really good, and
> odds are that ain't gonna happen.

More extensive testing in pedantic-man mode increased my confidence of
that enough to sign off and ship the dirt simple version.  Any further
twiddles should grow their own wings if they want to fly anyway, the
simplest form helps your real world load, as well as the not so real
pgbench, my numbers for that below.

virgin master, 2 socket box
postgres@nessler:~> pgbench.sh
clients 12      tps = 96233.854271     1.000
clients 24      tps = 142234.686166    1.000
clients 36      tps = 148433.534531    1.000
clients 48      tps = 133105.634302    1.000
clients 60      tps = 128903.080371    1.000
clients 72      tps = 128591.821782    1.000
clients 84      tps = 114445.967116    1.000
clients 96      tps = 109803.557524    1.000    avg   125219.017   1.000

V3 (KISS, below)
postgres@nessler:~> pgbench.sh
clients 12      tps = 120793.023637    1.255
clients 24      tps = 144668.961468    1.017
clients 36      tps = 156705.239251    1.055
clients 48      tps = 152004.886893    1.141
clients 60      tps = 138582.113864    1.075
clients 72      tps = 136286.891104    1.059
clients 84      tps = 137420.986043    1.200
clients 96      tps = 135199.060242    1.231   avg   140207.645   1.119   1.000

V2 NO_WAKE_WIDE_IDLE
postgres@nessler:~> pgbench.sh
clients 12      tps = 121821.966162    1.265
clients 24      tps = 146446.388366    1.029
clients 36      tps = 151373.362190    1.019
clients 48      tps = 156806.730746    1.178
clients 60      tps = 133933.491567    1.039
clients 72      tps = 131460.489424    1.022
clients 84      tps = 130859.340261    1.143
clients 96      tps = 130787.476584    1.191   avg   137936.155   1.101   0.983

V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
postgres@nessler:~> pgbench.sh
clients 12      tps = 121297.791570
clients 24      tps = 145939.488312
clients 36      tps = 155336.090263
clients 48      tps = 149018.245323
clients 60      tps = 136730.079391
clients 72      tps = 134886.116831
clients 84      tps = 130493.283398
clients 96      tps = 126043.336074


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h |    4 +--
 kernel/sched/fair.c   |   56 ++++++++++++++++++++++++--------------------------
 2 files changed, 29 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int waker_flips = current->wakee_flips;
+	unsigned int wakee_flips = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (waker_flips < wakee_flips)
+		swap(waker_flips, wakee_flips);
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		if (!want_affine)
+			goto select_idle;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+		new_cpu = select_idle_sibling(p, new_cpu);
 		goto unlock;
 	}
 



  reply	other threads:[~2015-07-07  9:43 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28  3:46 ` Mike Galbraith
2015-05-28  9:49   ` Morten Rasmussen
2015-05-28 10:57     ` Mike Galbraith
2015-05-28 11:48       ` Morten Rasmussen
2015-05-28 11:49         ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05   ` Peter Zijlstra
2015-05-28 14:27     ` Josef Bacik
2015-05-29 21:03     ` Josef Bacik
2015-05-30  3:55       ` Mike Galbraith
2015-06-01 19:38       ` Josef Bacik
2015-06-01 20:42         ` Peter Zijlstra
2015-06-01 21:03           ` Josef Bacik
2015-06-02 17:12           ` Josef Bacik
2015-06-03 14:12             ` Rik van Riel
2015-06-03 14:24               ` Peter Zijlstra
2015-06-03 14:49                 ` Josef Bacik
2015-06-03 15:30                 ` Mike Galbraith
2015-06-03 15:57                   ` Josef Bacik
2015-06-03 16:53                     ` Mike Galbraith
2015-06-03 17:16                       ` Josef Bacik
2015-06-03 17:43                         ` Mike Galbraith
2015-06-03 20:34                           ` Josef Bacik
2015-06-04  4:52                             ` Mike Galbraith
2015-06-01 22:15         ` Rik van Riel
2015-06-11 20:33     ` Josef Bacik
2015-06-12  3:42       ` Rik van Riel
2015-06-12  5:35     ` Mike Galbraith
2015-06-17 18:06       ` Josef Bacik
2015-06-18  0:55         ` Mike Galbraith
2015-06-18  3:46           ` Josef Bacik
2015-06-18  4:12             ` Mike Galbraith
2015-07-02 17:44               ` Josef Bacik
2015-07-03  6:40                 ` Mike Galbraith
2015-07-03  9:29                   ` Mike Galbraith
2015-07-04 15:57                   ` Mike Galbraith
2015-07-05  7:17                     ` Mike Galbraith
2015-07-06  5:13                       ` Mike Galbraith
2015-07-06 14:34                         ` Josef Bacik
2015-07-06 18:36                           ` Mike Galbraith
2015-07-06 19:41                             ` Josef Bacik
2015-07-07  4:01                               ` Mike Galbraith
2015-07-07  9:43                                 ` Mike Galbraith [this message]
2015-07-07 13:40                                   ` [patch] " Josef Bacik
2015-07-07 15:24                                     ` Mike Galbraith
2015-07-07 17:06                                   ` Josef Bacik
2015-07-08  6:13                                     ` [patch] sched: beef up wake_wide() Mike Galbraith
2015-07-09 13:26                                       ` Peter Zijlstra
2015-07-09 14:07                                         ` Mike Galbraith
2015-07-09 14:46                                           ` Mike Galbraith
2015-07-10  5:19                                         ` Mike Galbraith
2015-07-10 13:41                                           ` Josef Bacik
2015-07-10 20:59                                           ` Josef Bacik
2015-07-11  3:11                                             ` Mike Galbraith
2015-07-13 13:53                                               ` Josef Bacik
2015-07-14 11:19                                               ` Peter Zijlstra
2015-07-14 13:49                                                 ` Mike Galbraith
2015-07-14 14:07                                                   ` Peter Zijlstra
2015-07-14 14:17                                                     ` Mike Galbraith
2015-07-14 15:04                                                       ` Peter Zijlstra
2015-07-14 15:39                                                         ` Mike Galbraith
2015-07-14 16:01                                                           ` Josef Bacik
2015-07-14 17:59                                                             ` Mike Galbraith
2015-07-15 17:11                                                               ` Josef Bacik
2015-08-03 17:07                                                           ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:49     ` Ingo Molnar
2015-05-28 12:15       ` Mike Galbraith
2015-05-28 12:19         ` Peter Zijlstra
2015-05-28 12:29           ` Ingo Molnar
2015-05-28 15:22           ` David Ahern
2015-05-28 11:55 ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1436262224.1836.74.camel@gmail.com \
    --to=umgwanakikbuti@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=jbacik@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).