From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753308AbbGNOHU (ORCPT <rfc822;w@1wt.eu>);
	Tue, 14 Jul 2015 10:07:20 -0400
Received: from bombadil.infradead.org ([198.137.202.9]:39284 "EHLO
	bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751777AbbGNOHT (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 14 Jul 2015 10:07:19 -0400
Date: Tue, 14 Jul 2015 16:07:10 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>, riel@redhat.com, mingo@redhat.com,
        linux-kernel@vger.kernel.org, morten.rasmussen@arm.com,
        kernel-team <Kernel-team@fb.com>
Subject: Re: [patch] sched: beef up wake_wide()
Message-ID: <20150714140710.GL19282@twins.programming.kicks-ass.net>
References: <1436241678.1836.29.camel@gmail.com>
 <1436262224.1836.74.camel@gmail.com>
 <559C0700.6090009@fb.com>
 <1436336026.3767.53.camel@gmail.com>
 <20150709132654.GE3644@twins.programming.kicks-ass.net>
 <1436505566.5715.50.camel@gmail.com>
 <55A03232.2090500@fb.com>
 <1436584311.3429.7.camel@gmail.com>
 <20150714111905.GJ3644@twins.programming.kicks-ass.net>
 <1436881757.7983.12.camel@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1436881757.7983.12.camel@gmail.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> 
> > OK, how about something like the below; it tightens things up by
> > applying two rules:
> > 
> >  - We really should not continue looking for a balancing domain once
> >    SD_LOAD_BALANCE is not set.
> > 
> >  - SD (balance) flags should really be set in a single contiguous range,
> >    always starting at the bottom.
> > 
> > The latter means what if !want_affine and the (first) sd doesn't have
> > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > junk you didn't like..
> > 
> > Hmm?
> 
> Yeah, that's better.  It's not big hairy deal either way, it just bugged
> me to knowingly toss those cycles out the window ;-)
> 
> select_idle_sibling() looks kinda funny down there, but otoh when the
> day comes (hah) that we can just balance, it's closer to the exit.

Right, not too pretty, does this look beter? 

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: mingo@redhat.com
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
 include/linux/sched.h |    4 +-
 kernel/sched/fair.c   |   68 +++++++++++++++++++++++---------------------------
 2 files changed, 34 insertions(+), 38 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
-	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine)
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+	if (affine_sd) { /* Prefer affinity over any other flags */
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+			new_cpu = cpu;
 
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
-	}
+		new_cpu = select_idle_sibling(p, new_cpu);
 
-	while (sd) {
+	} else while (sd) {
 		struct sched_group *group;
 		int weight;
 
@@ -5085,10 +5081,10 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
-	rcu_read_unlock();
 
+	rcu_read_unlock();
 	return new_cpu;
+
 }
 
 /*