From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758095Ab2F1Dg4 (ORCPT ); Wed, 27 Jun 2012 23:36:56 -0400 Received: from mail-gh0-f174.google.com ([209.85.160.174]:49038 "EHLO mail-gh0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754328Ab2F1Dgz convert rfc822-to-8bit (ORCPT ); Wed, 27 Jun 2012 23:36:55 -0400 MIME-Version: 1.0 From: Paul Turner Date: Wed, 27 Jun 2012 20:36:24 -0700 Message-ID: Subject: sched: per-entity load-tracking To: linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Argh, regenerated mbox and forgot to fix the subject. This should have been: sched: per-entity load-tracking. I won't bother re-posting, but if you find yourself reading this email; look to the much more interesting "Series short description". Thanks, - Paul On Wed, Jun 27, 2012 at 7:24 PM, Paul Turner wrote: > Hi all, > > Please find attached the latest version for CFS load-tracking. > > It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but > could be extended to RT as well) basis. This results in a bottom-up > load-computation in which entities contribute to their parents' load, as > opposed to the current top-down where the parent averages its children.  In > particular this allows us to correctly migrate load with their accompanying > entities and provides the necessary inputs for intelligent load-balancing and > power-management. > > Modulo review I think this is now close to a mergable state. > > Notable updates: > ---------------- > - Few stability issues fixed; in particular on systems with tasks having idle >  periods spanning several days.  Minor code cleanups. > - 32 bit performance improved (now reported faster[*] on 32-bit ARM than >  without, presumably due to better load-balance or reduced overhead.  Thanks >  to Vincent Guittot and others for looking at this) > - All configs delinted > - Config dependencies seperated so that load-tracking can be used without >  FAIR_GROUP_SCHED so that we may generally use it for power governors and >  per-entity load-balance (in progress). > > Thanks to Peter, Vincent, Morten, Nikunj, and others for testing and comments. > > Rewinding some context since I've been buried with internal work and it's been > a while since my last posting: > > Background: > ---------- > We currently track the load average at the parenting cfs_rq level. We divide > execution into a series of consecutive "windows, w_i".  Within each we track: >  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level. > > The load average is then \Sum w_j / 2^n. > > This this works reasonably well but there's a few problems > 1) Since decay occurs at boundary window boundary there are 'skews': > At a window boundary the 'foreground' time has a bias against the > time immediately preceding it (as a result of the folding division) > e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary). > > The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the > window your load lands on. > > 2) Everything within a window 'w' is accounted equally, we only fold at > the boundaries.  This actually means you can't set 'w' large enough to > really have meaningful coverage of the sched period without throwing > decay out the window.  But then with w/2 << sched_period (currently > w/2=5ms) the average ends up having fairly drastic swings even with > stable loads. > > (Note:  Before we even looked at migrating to per-entity tracking we evaluating > foreground window into the considered average until it was "complete", this > represented a measurable improvement in stability under predictable load.) > > 3) Since the load sum average is per-cfs_rq and not per-entity when a task > entity migrates we lose its contribution to load-sum and effectively double > count it while it former sum decays. > > New approach: > ------------- > Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity > basis.  The load sum for a cfs_rq is then just the sum of its childrens' load > averages.  The obvious immediately nice property is that when we migrate thread > A from cpu1-->cpu2, we carry its load with it; leaving the global load sum > unmodified. > > The 'windows' above are replaced with more fine-grained tracking, that is (for > an entity j): > > runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*] > > Where: u_i is the usage in the last i`th ~ms and y is chosen such that > y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1 > sched_period ago contributes about ~50% as current execution. > > Now, the real challenge in tracking at an entity basis is handling blocked > entities.  Obviously runnable entities will be updated naturally as we iterate > through them but there could be O(many) blocked entities so we can't afford to > iterate through them and update their tracking. > > That our decay for a unit period is exponential introduces a particularly nice > property here: > We can separate the contributed load on a cfs_rq into blocked and runnable. > Runnable load is just the sum of all load_avg_j above, maintained by the > entities themselves as they run and self update (when they update their > load_avg they update the cumulative sum also). > > Blocked load then looks like: >  load_avg_j = weight_k * \Sum u[k]_n * y^n > > To account an entirely idle period we then only need to multiply by y. > > This ends up being orders of magnitude more accurate than the current > tracking schema, even with the old shares_window massively dilated to > better capture a stable load-state.  It's also obviously stable under > migration. > > As referenced above this also allows us to potentially improve decisions within > the load-balancer, for both distribution and power-management. > > Example: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's > currently a bit of a gamble as to whether you get an {AB, B} or {A, BB} split > since they have equal weight (assume 1024).  With per-task tracking we can > actually consider them at their contributed weight and see a stable ~{800,{400, > 400}} load-split.  Likewise within balance_tasks we can consider the load > migrated to be that actually contributed.  We have some patches looking at this > and will post them as they mature. > > Thanks, > > - Paul > > --- > > Ben Segall (1): >      sched: maintain per-rq runnable averages > > Paul Turner (15): >      sched: track the runnable average on a per-task entitiy basis >      sched: aggregate load contributed by task entities on parenting cfs_rq >      sched: maintain the load contribution of blocked entities >      sched: add an rq migration call-back to sched_class >      sched: account for blocked load waking back up >      sched: aggregate total task_group load >      sched: compute load contribution by a group entity >      sched: normalize tg load contributions against runnable time >      sched: maintain runnable averages across throttled periods >      sched: replace update_shares weight distribution with per-entity computation >      sched: refactor update_shares_cpu() -> update_blocked_avgs() >      sched: update_cfs_shares at period edge >      sched: make __update_entity_runnable_avg() fast >      sched: implement usage tracking >      sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking > > >  include/linux/sched.h |   18 + >  kernel/sched/core.c   |    5 >  kernel/sched/debug.c  |   39 ++ >  kernel/sched/fair.c   |  793 +++++++++++++++++++++++++++++++++++++++---------- >  kernel/sched/sched.h  |   60 ++-- >  5 files changed, 720 insertions(+), 195 deletions(-) > > -- > Signature >