From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754036AbcEBOuT (ORCPT <rfc822;w@1wt.eu>);
	Mon, 2 May 2016 10:50:19 -0400
Received: from mx2.suse.de ([195.135.220.15]:44493 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753598AbcEBOuI (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 2 May 2016 10:50:08 -0400
Message-ID: <1462200604.3736.42.camel@suse.de>
Subject: Re: sched: tweak select_idle_sibling to look for idle threads
From: Mike Galbraith <mgalbraith@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>, Ingo Molnar <mingo@kernel.org>,
        Matt Fleming <matt@codeblueprint.co.uk>, linux-kernel@vger.kernel.org
Date: Mon, 02 May 2016 16:50:04 +0200
In-Reply-To: <20160502084615.GB3430@twins.programming.kicks-ass.net>
References: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com>
	 <20160409190554.honue3gtian2p6vr@floor.thefacebook.com>
	 <20160430124731.GE2975@worktop.cust.blueprintrf.com>
	 <1462086753.9717.29.camel@suse.de>
	 <20160502084615.GB3430@twins.programming.kicks-ass.net>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.16.5 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, 2016-05-02 at 10:46 +0200, Peter Zijlstra wrote:
> On Sun, May 01, 2016 at 09:12:33AM +0200, Mike Galbraith wrote:
> 
> > Nah, tbench is just variance prone.  It got dinged up at clients=cores
> > on my desktop box, on 4 sockets the high end got seriously dinged up.
> 
> 
> Ha!, check this:
> 
> root@ivb-ep:~# echo OLD_IDLE > /debug/sched_features ; echo
> NO_ORDER_IDLE > /debug/sched_features ; echo IDLE_CORE >
> /debug/sched_features ; echo NO_FORCE_CORE > /debug/sched_features ;
> tbench 20 -t 10
> 
> Throughput 5956.32 MB/sec  20 clients  20 procs  max_latency=0.126 ms
> 
> 
> root@ivb-ep:~# echo OLD_IDLE > /debug/sched_features ; echo ORDER_IDLE >
> /debug/sched_features ; echo IDLE_CORE > /debug/sched_features ; echo
> NO_FORCE_CORE > /debug/sched_features ; tbench 20 -t 10
> 
> Throughput 5011.86 MB/sec  20 clients  20 procs  max_latency=0.116 ms
> 
> 
> 
> That little ORDER_IDLE thing hurts silly. That's a little patch I had
> lying about because some people complained that tasks hop around the
> cache domain, instead of being stuck to a CPU.
> 
> I suspect what happens is that by all CPUs starting to look for idle at
> the same place (the first cpu in the domain) they all find the same idle
> cpu and things pile up.
> 
> The old behaviour, where they all start iterating from where they were
> avoids some of that, at the cost of making tasks hop around.
> 
> Lets see if I can get the same behaviour out of the cpumask iteration
> code..

Order is one thing, but what the old behavior does first and foremost
is when the box starts getting really busy, only looking at target's
sibling shuts select_idle_sibling() down instead of letting it wreck
things.  Once cores are moving, there are no large piles of anything
left to collect other than pain.

We really need a good way to know we're not gonna turn the box into a
shredder.  The wake_wide() thing might help some, likely wants some
twiddling, in_interrupt() might be another time to try hard.

Anyway, the has_idle_cores business seems to shut select_idle_sibling()
down rather nicely when the the box gets busy.  Forcing either core,
target's sibling or go fish turned in a top end win on 48 rq/socket.

Oh btw, did you know single socket boxen have no sd_busy?  That doesn't
look right.

fromm:~/:[0]# for i in 1 2 4 8 16 32 64 128 256; do tbench.sh $i 30 2>&1| grep Throughput; done
Throughput 511.016 MB/sec  1 clients  1 procs  max_latency=0.113 ms
Throughput 1042.03 MB/sec  2 clients  2 procs  max_latency=0.098 ms
Throughput 1953.12 MB/sec  4 clients  4 procs  max_latency=0.236 ms
Throughput 3694.99 MB/sec  8 clients  8 procs  max_latency=0.308 ms
Throughput 7080.95 MB/sec  16 clients  16 procs  max_latency=0.442 ms
Throughput 13444.7 MB/sec  32 clients  32 procs  max_latency=1.417 ms
Throughput 20191.3 MB/sec  64 clients  64 procs  max_latency=4.554 ms
Throughput 41115.4 MB/sec  128 clients  128 procs  max_latency=13.414 ms
Throughput 66844.4 MB/sec  256 clients  256 procs  max_latency=50.069 ms

5226         /*
5227          * If there are idle cores to be had, go find one.
5228          */
5229         if (sched_feat(IDLE_CORE) && test_idle_cores(target)) {
5230                 i = select_idle_core(p, target);
5231                 if ((unsigned)i < nr_cpumask_bits)
5232                         return i;
5233  
5234                 /*
5235                  * Failed to find an idle core; stop looking for one.
5236                  */
5237                 clear_idle_cores(target);
5238         }
5239 #if 1
5240         for_each_cpu(i, cpu_smt_mask(target)) {
5241                 if (idle_cpu(i))
5242                         return i;
5243         }
5244  
5245         return target;
5246 #endif
5247  
5248         if (sched_feat(FORCE_CORE)) {