From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Josef Bacik <jbacik@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
riel@redhat.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
morten.rasmussen@arm.com, kernel-team <Kernel-team@fb.com>
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
Date: Sat, 04 Jul 2015 17:57:42 +0200 [thread overview]
Message-ID: <1436025462.17152.37.camel@gmail.com> (raw)
In-Reply-To: <1435905658.6418.52.camel@gmail.com>
On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:
> Hm. Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
>
> Ponder..
Since it was too hot to do outside chores (any excuse will do;)...
If we're (read /me) on track, the bellow should help. Per my tracing,
it may want a wee bit of toning down actually, though when I trace
virgin source I expect to see the same, namely Xorg and friends having
"wide-load" tattooed across their hindquarters earlier than they should.
It doesn't seem to hurt anything, but then demolishing a single llc box
is a tad more difficult than demolishing a NUMA box.
sched: beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when waking one of its N minions.
Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
the 1:N dispatcher of work, thus worth some extra effort.
Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
kernel/sched/fair.c | 89 +++++++++++++++++++++++++-----------------------
kernel/sched/features.h | 6 +++
2 files changed, 54 insertions(+), 41 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c
}
#ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, void *clear);
static unsigned long task_h_load(struct task_struct *p);
static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas
* Call select_idle_sibling to maybe find a better one.
*/
if (!cur)
- env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL);
assign:
task_numa_assign(env, cur, imp);
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
- int factor = this_cpu_read(sd_llc_size);
-
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
+ unsigned long waker_flips = current->wakee_flips;
+ unsigned long wakee_flips = p->wakee_flips;
+ int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+ if (waker_flips < wakee_flips) {
+ swap(waker_flips, wakee_flips);
+ /* Tell the caller that we're waking a 1:N waker */
+ ret += sched_feat(WAKE_WIDE_BALANCE);
}
-
- return 0;
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return ret;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou
/*
* Try and locate an idle CPU in the sched_domain.
*/
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, void *clear)
{
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
if (idle_cpu(target))
- return target;
+ goto done;
/*
* If the prevous cpu is cache affine and idle, don't be stupid.
*/
- if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
- return i;
+ if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) {
+ target = i;
+ goto done;
+ }
/*
* Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta
sg = sg->next;
} while (sg != sd->groups);
}
+ return target;
done:
+ if (clear)
+ *(void **)clear = 0;
+
return target;
}
/*
@@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
- int want_affine = 0;
+ int new_cpu = prev_cpu;
+ int want_affine = 0, want_balance = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = wake_wide(p);
+ want_balance = want_affine > 1;
+ want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine && !want_balance)
+ goto select;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct *
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag || want_balance)
sd = tmp;
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+select:
+ new_cpu = select_idle_sibling(p, new_cpu, &sd);
}
while (sd) {
struct sched_group *group;
int weight;
- if (!(sd->flags & sd_flag)) {
+ if (!(sd->flags & sd_flag) && !want_balance) {
sd = sd->child;
continue;
}
@@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_BALANCE, false)
next prev parent reply other threads:[~2015-07-05 10:29 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28 3:46 ` Mike Galbraith
2015-05-28 9:49 ` Morten Rasmussen
2015-05-28 10:57 ` Mike Galbraith
2015-05-28 11:48 ` Morten Rasmussen
2015-05-28 11:49 ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 14:27 ` Josef Bacik
2015-05-29 21:03 ` Josef Bacik
2015-05-30 3:55 ` Mike Galbraith
2015-06-01 19:38 ` Josef Bacik
2015-06-01 20:42 ` Peter Zijlstra
2015-06-01 21:03 ` Josef Bacik
2015-06-02 17:12 ` Josef Bacik
2015-06-03 14:12 ` Rik van Riel
2015-06-03 14:24 ` Peter Zijlstra
2015-06-03 14:49 ` Josef Bacik
2015-06-03 15:30 ` Mike Galbraith
2015-06-03 15:57 ` Josef Bacik
2015-06-03 16:53 ` Mike Galbraith
2015-06-03 17:16 ` Josef Bacik
2015-06-03 17:43 ` Mike Galbraith
2015-06-03 20:34 ` Josef Bacik
2015-06-04 4:52 ` Mike Galbraith
2015-06-01 22:15 ` Rik van Riel
2015-06-11 20:33 ` Josef Bacik
2015-06-12 3:42 ` Rik van Riel
2015-06-12 5:35 ` Mike Galbraith
2015-06-17 18:06 ` Josef Bacik
2015-06-18 0:55 ` Mike Galbraith
2015-06-18 3:46 ` Josef Bacik
2015-06-18 4:12 ` Mike Galbraith
2015-07-02 17:44 ` Josef Bacik
2015-07-03 6:40 ` Mike Galbraith
2015-07-03 9:29 ` Mike Galbraith
2015-07-04 15:57 ` Mike Galbraith [this message]
2015-07-05 7:17 ` Mike Galbraith
2015-07-06 5:13 ` Mike Galbraith
2015-07-06 14:34 ` Josef Bacik
2015-07-06 18:36 ` Mike Galbraith
2015-07-06 19:41 ` Josef Bacik
2015-07-07 4:01 ` Mike Galbraith
2015-07-07 9:43 ` [patch] " Mike Galbraith
2015-07-07 13:40 ` Josef Bacik
2015-07-07 15:24 ` Mike Galbraith
2015-07-07 17:06 ` Josef Bacik
2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith
2015-07-09 13:26 ` Peter Zijlstra
2015-07-09 14:07 ` Mike Galbraith
2015-07-09 14:46 ` Mike Galbraith
2015-07-10 5:19 ` Mike Galbraith
2015-07-10 13:41 ` Josef Bacik
2015-07-10 20:59 ` Josef Bacik
2015-07-11 3:11 ` Mike Galbraith
2015-07-13 13:53 ` Josef Bacik
2015-07-14 11:19 ` Peter Zijlstra
2015-07-14 13:49 ` Mike Galbraith
2015-07-14 14:07 ` Peter Zijlstra
2015-07-14 14:17 ` Mike Galbraith
2015-07-14 15:04 ` Peter Zijlstra
2015-07-14 15:39 ` Mike Galbraith
2015-07-14 16:01 ` Josef Bacik
2015-07-14 17:59 ` Mike Galbraith
2015-07-15 17:11 ` Josef Bacik
2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:49 ` Ingo Molnar
2015-05-28 12:15 ` Mike Galbraith
2015-05-28 12:19 ` Peter Zijlstra
2015-05-28 12:29 ` Ingo Molnar
2015-05-28 15:22 ` David Ahern
2015-05-28 11:55 ` Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1436025462.17152.37.camel@gmail.com \
--to=umgwanakikbuti@gmail.com \
--cc=Kernel-team@fb.com \
--cc=jbacik@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=morten.rasmussen@arm.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).