linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tejun Heo <tj@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
Date: Mon, 1 May 2017 16:17:33 +0200	[thread overview]
Message-ID: <20170501141733.shphf35psasefraj@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20170426165123.GA17921@linaro.org>

Hi,

sorry for being late to the party, trying to catch-up..

On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
> Ok, I see your point and agree that there is an issue when propagating
> load_avg of a task group which has tasks with lower weight than the share
> but your proposal has got issue because it uses runnable_load_avg instead
> of load_avg and this makes propagation of loadavg_avg incorrect, something
> like below which keeps using load_avg solve the problem
> 
> +	if (gcfs_rq->load.weight) {
> +		long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
> +
> +		load = min(gcfs_rq->avg.load_avg *
> +			   shares / scale_load_down(gcfs_rq->load.weight), shares);
> 


So this here does:

  ( tg->load_avg = \Sum cfs_rq->load_avg )

  load      = cfs_rq->load.weight

  tg_weight = tg->load_avg - cfs_rq->contrib + load


           tg->shares * load
  shares = -----------------
               tg_weight


                        cfs_rq->load_avg
  avg_shares = shares * ----------------
                              load

	       tg->shares * cfs_rq->load_avg
             = -----------------------------
                      tg_weight


  ( se->load.weight = shares )

  se->load_avg = min(shares, avg_shares);


So where shares (and se->load.weight) are an upper bound (due to using
cfs_rq->load.weight, see calc_cfs_shares), avg_shares is supposedly a
more accurate representation based on our PELT averages.

This looks OK; and I agree with Vincent that we should use
cfs_rq->avg.load_avg, not cfs_rq->runnable_load_avg, since tg->load_avg
is a sum of the former, not the latter.

Also, arguably calculating the above avg_shares directly (using the
second equation) might be more precise; but I doubt it makes much of a
difference, however since we do min(), we should at least clamp against
MIN_SHARES again.

Furthermore, it appears to me we want a different tg_weight value for
the avg_shares, something like:

  tg_weight = tg->load_avg - cfs_rq->contrib + cfs_rq->avg.load_avg

To better match with the numerator's units, otherwise it will have a
tendency to push avg_shares down further than it needs to be.


(All assuming it actually works of course.. compile tested only)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2634,7 +2634,9 @@ account_entity_dequeue(struct cfs_rq *cf
 # ifdef CONFIG_SMP
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+
+	tg_shares = READ_ONCE(tg->shares);
 
 	/*
 	 * This really should be: cfs_rq->avg.load_avg, but instead we use
@@ -2665,12 +2667,7 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
 static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
@@ -3075,37 +3072,69 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * If there is load in our group cfs_rq (@gcfs_rq), then update its
+	 * corresponding group se (@se) load value to reflect any change in
+	 * weights.
+	 *
+	 * Also see effective_load(), where we do something similar.
+	 *
+	 *   ( tg->load_avg = \Sum cfs_rq->load_avg )
+	 *   
+	 *   tg_weight = tg->load_avg - cfs_rq->contrib
+	 *   
+	 *   
+	 *            tg->shares * cfs_rq->weight
+	 *   shares = ---------------------------
+	 *            tg_weight  + cfs_rq->weight
+	 *   
+	 *   
+	 *		  tg->shares * cfs_rq->load_avg
+	 *   avg_shares = -----------------------------
+	 *                tg_weight  + cfs_rq->load_avg
+	 *   
+	 *   
+	 *   ( se->load.weight = shares )
+	 *   
+	 *   se->load_avg = min(shares, avg_shares);
+	 *
+	 * Where @shares is an upper bound to @avg_shares, see the comments
+	 * in calc_cfs_shares().
 	 */
-	if (load) {
-		long tg_load;
-
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
+	if (gcfs_rq->load.weight) {
+		long tg_weight1, tg_weight2, tg_shares, shares, avg_shares;
+		struct task_group *tg = gcfs_rq->tg;
 
 		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
+		 * This open-codes calc_cfs_shares(), in order to ensure
+		 * we use consistent values for @shares and @avg_shares,
+		 * as well as make sure we clip the result properly.
 		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+
+		tg_shares = READ_ONCE(tg->shares);
+
+		load = scale_load_down(gcfs_rq->load.weight);
+
+		tg_weight1 = atomic_long_read(&tg->load_avg);
+		tg_weight2 = (tg_weight1 -= cfs_rq->tg_load_avg_contrib);
+		tg_weight1 += load;
+
+		shares = tg_shares * load;
+		if (tg_weight1)
+			shares /= tg_weight1;
+
+
+		load = READ_ONCE(gcfs_rq->avg.load_avg);
+
+		tg_weight2 += load;
+
+		avg_shares = tg_shares * load;
+		if (tg_weight2)
+			avg_shares /= tg_weight2;
+
+		load = clamp_t(long, min(shares, avg_shares), MIN_SHARES, tg_shares);
 	}
 
 	delta = load - se->avg.load_avg;

  parent reply	other threads:[~2017-05-01 14:17 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra [this message]
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170501141733.shphf35psasefraj@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).