From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751791AbcEKER5 (ORCPT ); Wed, 11 May 2016 00:17:57 -0400 Received: from mail-wm0-f67.google.com ([74.125.82.67]:35163 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751676AbcEKERz (ORCPT ); Wed, 11 May 2016 00:17:55 -0400 Message-ID: <1462940271.3717.57.camel@gmail.com> Subject: Re: sched: tweak select_idle_sibling to look for idle threads From: Mike Galbraith To: Yuyang Du Cc: Peter Zijlstra , Chris Mason , Ingo Molnar , Matt Fleming , linux-kernel@vger.kernel.org Date: Wed, 11 May 2016 06:17:51 +0200 In-Reply-To: <20160510191646.GA4870@intel.com> References: <1462694935.4155.83.camel@suse.de> <20160508185747.GL16093@intel.com> <1462765540.3803.44.camel@suse.de> <20160508202201.GM16093@intel.com> <1462779853.3803.128.camel@suse.de> <20160509011311.GQ16093@intel.com> <1462786745.3803.181.camel@suse.de> <20160509232623.GR16093@intel.com> <1462866562.3702.33.camel@suse.de> <1462893965.3702.56.camel@gmail.com> <20160510191646.GA4870@intel.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2016-05-11 at 03:16 +0800, Yuyang Du wrote: > On Tue, May 10, 2016 at 05:26:05PM +0200, Mike Galbraith wrote: > > On Tue, 2016-05-10 at 09:49 +0200, Mike Galbraith wrote: > > > > > Only whacking > > > cfs_rq_runnable_load_avg() with a rock makes schbench -m -t > > > -a work well. 'Course a rock in its gearbox also > > > rendered load balancing fairly busted for the general case :) > > > > Smaller rock doesn't injure heavy tbench, but more importantly, still > > demonstrates the issue when you want full spread. > > > > schbench -m4 -t38 -a > > > > cputime 30000 threads 38 p99 177 > > cputime 30000 threads 39 p99 10160 > > > > LB_TIP_AVG_HIGH > > cputime 30000 threads 38 p99 193 > > cputime 30000 threads 39 p99 184 > > cputime 30000 threads 40 p99 203 > > cputime 30000 threads 41 p99 202 > > cputime 30000 threads 42 p99 205 > > cputime 30000 threads 43 p99 218 > > cputime 30000 threads 44 p99 237 > > cputime 30000 threads 45 p99 245 > > cputime 30000 threads 46 p99 262 > > cputime 30000 threads 47 p99 296 > > cputime 30000 threads 48 p99 3308 > > > > 47*4+4=nr_cpus yay > > yay... and haha, "a perfect world"... Yup.. for this load. > > --- > > kernel/sched/fair.c | 3 +++ > > kernel/sched/features.h | 1 + > > 2 files changed, 4 insertions(+) > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3027,6 +3027,9 @@ void remove_entity_load_avg(struct sched > > > > static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq) > > { > > +> > > > if (sched_feat(LB_TIP_AVG_HIGH) && cfs_rq->load.weight > cfs_rq->runnable_load_avg*2) > > +> > > > > > return cfs_rq->runnable_load_avg + min_t(unsigned long, NICE_0_LOAD, > > +> > > > > > > > > > > > > > > > cfs_rq->load.weight/2); > > > > > > return cfs_rq->runnable_load_avg; > > } > > cfs_rq->runnable_load_avg is for sure no greater than (in this case much less > than, maybe 1/2 of) load.weight, whereas load_avg is not necessarily a rock > in gearbox that only impedes speed up, but also speed down. Yeah, just like everything else, it'll cuts both ways (why you can't win the sched game). If I can believe tbench, at tasks=cpus, reducing lag increased utilization and reduced latency a wee bit, as did the reserve thing once a booboo got fixed up. Makes sense, robbing Peter to pay Paul should work out better for Paul. NO_LB_TIP_AVG_HIGH Throughput 27132.9 MB/sec 96 clients 96 procs max_latency=7.656 ms Throughput 28464.1 MB/sec 96 clients 96 procs max_latency=9.905 ms Throughput 25369.8 MB/sec 96 clients 96 procs max_latency=7.192 ms Throughput 25670.3 MB/sec 96 clients 96 procs max_latency=5.874 ms Throughput 29309.3 MB/sec 96 clients 96 procs max_latency=1.331 ms avg 27189 1.000 6.391 1.000 NO_LB_TIP_AVG_HIGH IDLE_RESERVE Throughput 24437.5 MB/sec 96 clients 96 procs max_latency=1.837 ms Throughput 29464.7 MB/sec 96 clients 96 procs max_latency=1.594 ms Throughput 28023.6 MB/sec 96 clients 96 procs max_latency=1.494 ms Throughput 28299.0 MB/sec 96 clients 96 procs max_latency=10.404 ms Throughput 29072.1 MB/sec 96 clients 96 procs max_latency=5.575 ms avg 27859 1.024 4.180 0.654 LB_TIP_AVG_HIGH NO_IDLE_RESERVE Throughput 29068.1 MB/sec 96 clients 96 procs max_latency=5.599 ms Throughput 26435.6 MB/sec 96 clients 96 procs max_latency=3.703 ms Throughput 23930.0 MB/sec 96 clients 96 procs max_latency=7.742 ms Throughput 29464.2 MB/sec 96 clients 96 procs max_latency=1.549 ms Throughput 24250.9 MB/sec 96 clients 96 procs max_latency=1.518 ms avg 26629 0.979 4.022 0.629 LB_TIP_AVG_HIGH IDLE_RESERVE Throughput 30340.1 MB/sec 96 clients 96 procs max_latency=1.465 ms Throughput 29042.9 MB/sec 96 clients 96 procs max_latency=4.515 ms Throughput 26718.7 MB/sec 96 clients 96 procs max_latency=1.822 ms Throughput 28694.4 MB/sec 96 clients 96 procs max_latency=1.503 ms Throughput 28918.2 MB/sec 96 clients 96 procs max_latency=7.599 ms avg 28742 1.057 3.380 0.528 > But I really don't know the load references in select_task_rq() should be > what kind. So maybe the real issue is a mix of them, i.e., conflated balancing > and just wanting an idle cpu. ? Depends on the goal. For both, load lagging reality means the high frequency component is squelched, meaning less migration cost, but also higher latency due to stacking. It's a tradeoff where Chris' latency is everything" benchmark, and _maybe_ the real world load it's based upon is on Peter's end of the rob Peter to pay Paul transaction. The benchmark says it definitely is, the real world load may have already been fixed up by the select_idle_sibling() rewrite. -Mike