From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751791AbcEKER5 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 11 May 2016 00:17:57 -0400
Received: from mail-wm0-f67.google.com ([74.125.82.67]:35163 "EHLO
	mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751676AbcEKERz (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 11 May 2016 00:17:55 -0400
Message-ID: <1462940271.3717.57.camel@gmail.com>
Subject: Re: sched: tweak select_idle_sibling to look for idle threads
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Yuyang Du <yuyang.du@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Chris Mason <clm@fb.com>,
        Ingo Molnar <mingo@kernel.org>,
        Matt Fleming <matt@codeblueprint.co.uk>, linux-kernel@vger.kernel.org
Date: Wed, 11 May 2016 06:17:51 +0200
In-Reply-To: <20160510191646.GA4870@intel.com>
References: <1462694935.4155.83.camel@suse.de>
	 <20160508185747.GL16093@intel.com> <1462765540.3803.44.camel@suse.de>
	 <20160508202201.GM16093@intel.com> <1462779853.3803.128.camel@suse.de>
	 <20160509011311.GQ16093@intel.com> <1462786745.3803.181.camel@suse.de>
	 <20160509232623.GR16093@intel.com> <1462866562.3702.33.camel@suse.de>
	 <1462893965.3702.56.camel@gmail.com> <20160510191646.GA4870@intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.16.5 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 2016-05-11 at 03:16 +0800, Yuyang Du wrote:
> On Tue, May 10, 2016 at 05:26:05PM +0200, Mike Galbraith wrote:
> > On Tue, 2016-05-10 at 09:49 +0200, Mike Galbraith wrote:
> > 
> > >  Only whacking
> > > cfs_rq_runnable_load_avg() with a rock makes schbench -m  -t
> > >  -a work well.  'Course a rock in its gearbox also
> > > rendered load balancing fairly busted for the general case :)
> > 
> > Smaller rock doesn't injure heavy tbench, but more importantly, still
> > demonstrates the issue when you want full spread.
> > 
> > schbench -m4 -t38 -a
> > 
> > cputime 30000 threads 38 p99 177
> > cputime 30000 threads 39 p99 10160
> > 
> > LB_TIP_AVG_HIGH
> > cputime 30000 threads 38 p99 193
> > cputime 30000 threads 39 p99 184
> > cputime 30000 threads 40 p99 203
> > cputime 30000 threads 41 p99 202
> > cputime 30000 threads 42 p99 205
> > cputime 30000 threads 43 p99 218
> > cputime 30000 threads 44 p99 237
> > cputime 30000 threads 45 p99 245
> > cputime 30000 threads 46 p99 262
> > cputime 30000 threads 47 p99 296
> > cputime 30000 threads 48 p99 3308
> > 
> > 47*4+4=nr_cpus yay
>  
> yay... and haha, "a perfect world"...

Yup.. for this load.

> > ---
> >  kernel/sched/fair.c     |    3 +++
> >  kernel/sched/features.h |    1 +
> >  2 files changed, 4 insertions(+)
> > 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3027,6 +3027,9 @@ void remove_entity_load_avg(struct sched
> >  
> >  static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
> >  {
> > +> > 	> > if (sched_feat(LB_TIP_AVG_HIGH) && cfs_rq->load.weight > cfs_rq->runnable_load_avg*2)
> > +> > 	> > 	> > return cfs_rq->runnable_load_avg + min_t(unsigned long, NICE_0_LOAD,
> > +> > 	> > 	> > 	> > 	> > 	> > 	> > 	> >  cfs_rq->load.weight/2);
> >  > > 	> > return cfs_rq->runnable_load_avg;
> >  }
>   
> cfs_rq->runnable_load_avg is for sure no greater than (in this case much less
> than, maybe 1/2 of) load.weight, whereas load_avg is not necessarily a rock
> in gearbox that only impedes speed up, but also speed down.

Yeah, just like everything else, it'll cuts both ways (why you can't
win the sched game).  If I can believe tbench, at tasks=cpus, reducing
lag increased utilization and reduced latency a wee bit, as did the
reserve thing once a booboo got fixed up.  Makes sense, robbing Peter
to pay Paul should work out better for Paul.

NO_LB_TIP_AVG_HIGH
Throughput 27132.9 MB/sec  96 clients  96 procs  max_latency=7.656 ms
Throughput 28464.1 MB/sec  96 clients  96 procs  max_latency=9.905 ms
Throughput 25369.8 MB/sec  96 clients  96 procs  max_latency=7.192 ms
Throughput 25670.3 MB/sec  96 clients  96 procs  max_latency=5.874 ms
Throughput 29309.3 MB/sec  96 clients  96 procs  max_latency=1.331 ms
avg        27189   1.000                                     6.391   1.000

NO_LB_TIP_AVG_HIGH IDLE_RESERVE
Throughput 24437.5 MB/sec  96 clients  96 procs  max_latency=1.837 ms
Throughput 29464.7 MB/sec  96 clients  96 procs  max_latency=1.594 ms
Throughput 28023.6 MB/sec  96 clients  96 procs  max_latency=1.494 ms
Throughput 28299.0 MB/sec  96 clients  96 procs  max_latency=10.404 ms
Throughput 29072.1 MB/sec  96 clients  96 procs  max_latency=5.575 ms
avg        27859   1.024                                     4.180   0.654

LB_TIP_AVG_HIGH NO_IDLE_RESERVE
Throughput 29068.1 MB/sec  96 clients  96 procs  max_latency=5.599 ms
Throughput 26435.6 MB/sec  96 clients  96 procs  max_latency=3.703 ms
Throughput 23930.0 MB/sec  96 clients  96 procs  max_latency=7.742 ms
Throughput 29464.2 MB/sec  96 clients  96 procs  max_latency=1.549 ms
Throughput 24250.9 MB/sec  96 clients  96 procs  max_latency=1.518 ms
avg        26629   0.979                                     4.022   0.629

LB_TIP_AVG_HIGH IDLE_RESERVE
Throughput 30340.1 MB/sec  96 clients  96 procs  max_latency=1.465 ms
Throughput 29042.9 MB/sec  96 clients  96 procs  max_latency=4.515 ms
Throughput 26718.7 MB/sec  96 clients  96 procs  max_latency=1.822 ms
Throughput 28694.4 MB/sec  96 clients  96 procs  max_latency=1.503 ms
Throughput 28918.2 MB/sec  96 clients  96 procs  max_latency=7.599 ms
avg        28742   1.057                                     3.380   0.528

> But I really don't know the load references in select_task_rq() should be
> what kind. So maybe the real issue is a mix of them, i.e., conflated balancing
> and just wanting an idle cpu. ?

Depends on the goal.  For both, load lagging reality means the high
frequency component is squelched, meaning less migration cost, but also
higher latency due to stacking.  It's a tradeoff where Chris' latency
is everything" benchmark, and _maybe_ the real world load it's based
upon is on Peter's end of the rob Peter to pay Paul transaction.  The
benchmark says it definitely is, the real world load may have already
been fixed up by the select_idle_sibling() rewrite.

	-Mike