Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
Date: Tue, 2 May 2017 10:19:05 +0200	[thread overview]
Message-ID: <20170502081905.GA4626@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <20170501215604.GB19079@htj.duckdns.org>

On Mon, May 01, 2017 at 05:56:04PM -0400, Tejun Heo wrote:

> I still wonder what gcfs_rq se->load_avg.avg is good for tho?  It's
> nice to keep the value in line but is it actually used anywhere?

Its used to manage its parent's cfs_rq sums boundary conditions. When we
enqueue the group se to the parent group, we do:

  enqueue_entity()
    enqueue_entity_load_avg()
      // update runnable
      // add se.avg to cfs_rq.avg if 'migrated'

Now obviously group se's never actually migrate between CPUs, but they
still pass through that code. I think the 'migrate' condition also
triggers for new entities for example, they 'migrate' into the system.

Similar for dequeue, that should trigger when groups entities
get destroyed, they 'migrate' out of the system.

But I get what you're saying. They're not overly useful as such.

> So, if changing gcfs_rq se->load_avg.avg to match the gcfs_rq's
> runnable_load_avg is icky, and I can see why that would be, we can
> simply introduce a separate channel of propagation ...

Can you have a play with something like the below? I suspect
'shares_runnable' might work for you here.

Its still 'icky', but not more so than calc_cfs_shares() was to begin
with. When many CPUs are involved the global sum (tg_shares) is still
dominated by avg_load and the division will have a downward bias, but
given the purpose that might not be too bad.

---
 kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276fcb62..7d1fb5f421bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2630,26 +2630,56 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 	cfs_rq->nr_running--;
 }
 
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->runnable_load_avg, which does not include the
+		 * blocked load.
+		 */
+		load = cfs_rq->runnable_load_avg;
+		break;
+
+	case shares_avg:
+		load = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->load.weight, which is its upper bound. This helps
+		 * ramp up the shares for small weight interactive tasks.
+		 */
+		load = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
-	/* Ensure tg_weight >= load */
+	/*
+	 * This ensures the sum is up-to-date for this CPU, in case of the other
+	 * two approximations it biases the sum towards their value and in case
+	 * of (near) UP ensures the division ends up <= 1.
+	 */
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
 	tg_weight += load;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2665,15 +2695,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static inline long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
 {
 	return tg->shares;
 }
@@ -2715,7 +2741,7 @@ static void update_cfs_shares(struct sched_entity *se)
 	if (likely(se->load.weight == tg->shares))
 		return;
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
+	shares = calc_cfs_shares(cfs_rq, tg, shares_weight);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }