From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932080Ab1IGQWl (ORCPT ); Wed, 7 Sep 2011 12:22:41 -0400 Received: from e9.ny.us.ibm.com ([32.97.182.139]:37577 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756382Ab1IGQWi (ORCPT ); Wed, 7 Sep 2011 12:22:38 -0400 Date: Wed, 7 Sep 2011 16:30:47 +0530 From: Srivatsa Vaddagiri To: Paul Turner Cc: Kamalesh Babulal , Vladimir Davydov , "linux-kernel@vger.kernel.org" , Peter Zijlstra , Bharata B Rao , Dhaval Giani , Vaidyanathan Srinivasan , Ingo Molnar , Pavel Emelianov Subject: Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Message-ID: <20110907110046.GC7768@linux.vnet.ibm.com> Reply-To: vatsa@linux.vnet.ibm.com References: <20110503092846.022272244@google.com> <20110607154542.GA2991@linux.vnet.ibm.com> <1307529966.4928.8.camel@dhcp-10-30-22-158.sw.ru> <20110608163234.GA23031@linux.vnet.ibm.com> <20110610181719.GA30330@linux.vnet.ibm.com> <20110615053716.GA390@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 21, 2011 at 12:48:17PM -0700, Paul Turner wrote: > Hi Kamalesh, > > Can you see what things look like under v7? > > There's been a few improvements to quota re-distribution that should > hopefully help your test case. > > The remaining idle% I see on my machines appear to be a product of > load-balancer inefficiency. which is quite a complex problem to solve! I am still surprised that we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post the exact count of migrations we saw on latest tip over a 20-sec window? Anyway, here's a "hack" to minimize the idle time induced due to load-balance issues. It brings down idle time from 7+% to ~0% ..I am not too happy about this, but I don't see any other simpler solutions to solve the idle time issue completely (other than making load-balancer completely fair!). -- Fix excessive idle time reported when cgroups are capped. The patch introduces the notion of "steal" (or "grace") time which is the surplus time/bandwidth each cgroup is allowed to consume, subject to a maximum steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal" or "grace" time when the lone task running on a cpu is about to be throttled. Signed-off-by: Srivatsa Vaddagiri Index: linux-3.1-rc4/include/linux/sched.h =================================================================== --- linux-3.1-rc4.orig/include/linux/sched.h 2011-09-07 14:57:49.529602231 +0800 +++ linux-3.1-rc4/include/linux/sched.h 2011-09-07 14:58:49.952418107 +0800 @@ -2042,6 +2042,7 @@ #ifdef CONFIG_CFS_BANDWIDTH extern unsigned int sysctl_sched_cfs_bandwidth_slice; +extern unsigned int sysctl_sched_cfs_max_steal_time; #endif #ifdef CONFIG_RT_MUTEXES Index: linux-3.1-rc4/kernel/sched.c =================================================================== --- linux-3.1-rc4.orig/kernel/sched.c 2011-09-07 14:57:49.532854588 +0800 +++ linux-3.1-rc4/kernel/sched.c 2011-09-07 14:58:49.955453578 +0800 @@ -254,7 +254,7 @@ #ifdef CONFIG_CFS_BANDWIDTH raw_spinlock_t lock; ktime_t period; - u64 quota, runtime; + u64 quota, runtime, steal_time; s64 hierarchal_quota; u64 runtime_expires; Index: linux-3.1-rc4/kernel/sched_fair.c =================================================================== --- linux-3.1-rc4.orig/kernel/sched_fair.c 2011-09-07 14:57:49.533644483 +0800 +++ linux-3.1-rc4/kernel/sched_fair.c 2011-09-07 15:16:09.338824132 +0800 @@ -101,6 +101,18 @@ * default: 5 msec, units: microseconds */ unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; + +/* + * "Surplus" quota given to a cgroup to prevent a CPU from becoming idle. + * + * This would have been unnecessary had the load-balancer been "ideal" in + * loading tasks uniformly across all CPUs, which would have allowed + * all cgroups to claim their "quota" completely. In the absence of an + * "ideal" load-balancer, cgroups are unable to utilize their quota, leading + * to unexpected idle time. This knob allows a CPU to keep running a + * task beyond its throttled point before becoming idle. + */ +unsigned int sysctl_sched_cfs_max_steal_time = 100000UL; #endif static const struct sched_class fair_sched_class; @@ -1288,6 +1300,11 @@ return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC; } +static inline u64 sched_cfs_max_steal_time(void) +{ + return (u64)sysctl_sched_cfs_max_steal_time * NSEC_PER_USEC; +} + /* * Replenish runtime according to assigned quota and update expiration time. * We use sched_clock_cpu directly instead of rq->clock to avoid adding @@ -1303,6 +1320,7 @@ return; now = sched_clock_cpu(smp_processor_id()); + cfs_b->steal_time = 0; cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); } @@ -1337,6 +1355,12 @@ cfs_b->runtime -= amount; cfs_b->idle = 0; } + + if (!amount && rq_of(cfs_rq)->nr_running == 1 && + cfs_b->steal_time < sched_cfs_max_steal_time()) { + amount = min_amount; + cfs_b->steal_time += amount; + } } expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); @@ -1378,7 +1402,8 @@ * whether the global deadline has advanced. */ - if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) { + if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0 || + (rq_of(cfs_rq)->nr_running == 1 && cfs_b->steal_time < sched_cfs_max_steal_time())) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { Index: linux-3.1-rc4/kernel/sysctl.c =================================================================== --- linux-3.1-rc4.orig/kernel/sysctl.c 2011-09-07 14:57:49.534454409 +0800 +++ linux-3.1-rc4/kernel/sysctl.c 2011-09-07 14:58:49.958452846 +0800 @@ -388,6 +388,14 @@ .proc_handler = proc_dointvec_minmax, .extra1 = &one, }, + { + .procname = "sched_cfs_max_steal_time_us", + .data = &sysctl_sched_cfs_max_steal_time, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &one, + }, #endif #ifdef CONFIG_PROVE_LOCKING {