From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S967251AbdDSQyQ (ORCPT <rfc822;w@1wt.eu>);
        Wed, 19 Apr 2017 12:54:16 -0400
Received: from mail-wm0-f43.google.com ([74.125.82.43]:38462 "EHLO
        mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S966954AbdDSQyL (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 19 Apr 2017 12:54:11 -0400
From: Vincent Guittot <vincent.guittot@linaro.org>
To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org
Cc: dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
        yuyang.du@intel.com, pjt@google.com, bsegall@google.com,
        Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v2] sched/cfs: make util/load_avg more stable
Date: Wed, 19 Apr 2017 18:54:04 +0200
Message-Id: <1492620844-30979-1-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1492619370-29246-1-git-send-email-vincent.guittot@linaro.org>
References: <1492619370-29246-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

In the current implementation of load/util_avg, we assume that the ongoing
time segment has fully elapsed, and util/load_sum is divided by LOAD_AVG_MAX,
even if part of the time segment still remains to run. As a consequence, this
remaining part is considered as idle time and generates unexpected variations
of util_avg of a busy CPU in the range ]1002..1024[ whereas util_avg should
stay at 1023.

In order to keep the metric stable, we should not consider the ongoing time
segment when computing load/util_avg but only the segments that have already
fully elapsed. Bu to not consider the current time segment adds unwanted
latency in the load/util_avg responsivness especially when the time is scaled
instead of the contribution. Instead of waiting for the current time segment
to have fully elapsed before accounting it in load/util_avg, we can already
account the elapsed part but change the range used to compute load/util_avg
accordingly.

At the very beginning of a new time segment, the past segments have been
decayed and the max value is MAX_LOAD_AVG*y. At the very end of the current
time segment, the max value becomes 1024(us) + MAX_LOAD_AVG*y which is equal
to MAX_LOAD_AVG. In fact, the max value is
sa->period_contrib + MAX_LOAD_AVG*y at any time in the time segment.

Taking advantage of the fact that MAX_LOAD_AVG*y == MAX_LOAD_AVG-1024, the
range becomes [0..MAX_LOAD_AVG-1024+sa->period_contrib].

As the elapsed part is already accounted in load/util_sum, we update the max
value according to the current position in the time segment instead of
removing its contribution.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Fold both patches in one

 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f83a35..c3b8f0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3017,12 +3017,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	/*
 	 * Step 2: update *_avg.
 	 */
-	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+	sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
 	if (cfs_rq) {
 		cfs_rq->runnable_load_avg =
-			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+			div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib);
 	}
-	sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+	sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib);
 
 	return 1;
 }
-- 
2.7.4