From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753905AbbE1Jrs (ORCPT <rfc822;w@1wt.eu>);
	Thu, 28 May 2015 05:47:48 -0400
Received: from foss.arm.com ([217.140.101.70]:54307 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753821AbbE1JrY (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 28 May 2015 05:47:24 -0400
Date: Thu, 28 May 2015 10:49:14 +0100
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>, "riel@redhat.com" <riel@redhat.com>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for
 BALANCE_WAKE
Message-ID: <20150528094914.GJ26396@e105550-lin.cambridge.arm.com>
References: <1432761736-22093-1-git-send-email-jbacik@fb.com>
 <1432784798.3237.81.camel@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1432784798.3237.81.camel@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, May 28, 2015 at 04:46:38AM +0100, Mike Galbraith wrote:
> On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote:
> > [ sorry if you get this twice, it seems like the first submission got lost ]
> > 
> > At Facebook we have a pretty heavily multi-threaded application that is
> > sensitive to latency.  We have been pulling forward the old SD_WAKE_IDLE code
> > because it gives us a pretty significant performance gain (like 20%).  It turns
> > out this is because there are cases where the scheduler puts our task on a busy
> > CPU when there are idle CPU's in the system.  We verify this by reading the
> > cpu_delay_req_avg_us from the scheduler netlink stuff.  With our crappy patch we
> > get much lower numbers vs baseline.
> > 
> > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
> > looking for an idle sibling, preferring affinity over all else.  This is not
> > helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
> > garuntee affinity.  Fix this by first trying to find an idle sibling, and then
> > if the cpu is not idle fall through to the logic to find an idle cpu.  With this
> > patch we get slightly better performance than with our forward port of
> > SD_WAKE_IDLE.  Thanks,
> 
> The job description isn't really find idle. it's find least loaded.

And make sure that the task doesn't migrate away from any data that
might still be in the last-level cache?

IUIC, the goal of SD_BALANCE_WAKE is changed from finding the least
loaded target cpu that shares last-level cache with the previous cpu, to
finding an idle cpu and prefer ones that shares the last-level cache but
extend the search beyond sd_llc if necessary.

> 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 241213b..03dafa3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		new_cpu = select_idle_sibling(p, prev_cpu);
> > -		goto unlock;
> > +		if (idle_cpu(new_cpu))
> > +			goto unlock;
> >  	}
> >  
> >  	while (sd) {
> 
> Instead of doing what for most will be a redundant idle_cpu() call,
> perhaps a couple cycles can be saved if you move the sd assignment above
> affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ?

Isn't sd == NULL is most cases if you don't move the sd assignment
before the affine_sd assignment? The break after the affine_sd
assignment means that sd isn't assigned under 'normal' circumstances
(sibling waking cpu, no restrictions in tsk_cpus_allowed(p), and
SD_WAKE_AFFINE set) which causes the while (sd) loop to be bypassed and
we end up returning the result of select_idle_sibling() anyway.

I must be missing something?

Morten