All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/11] track CPU utilization
@ 2018-06-28 15:45 Vincent Guittot
  2018-06-28 15:45 ` [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file Vincent Guittot
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot

This patchset initially tracked only the utilization of RT rq. During
OSPM summit, it has been discussed the opportunity to extend it in order
to get an estimate of the utilization of the CPU.

- Patches 1 move pelt code in a dedicated file and remove some blank lines
  
- Patches 2-3 add utilization tracking for rt_rq.

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of 
frequency switches of the CPU whereas the CPU never goes idles during the 
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) : 
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
    total time:                          15.0009s
    total number of events:              4903
    total time taken by event execution: 14.9972
    per-request statistics:
         min:                                  1.23ms
         avg:                                  3.06ms
         max:                                 13.16ms
         approx.  95 percentile:              12.73ms

Threads fairness:
    events (avg/stddev):           4903.0000/0.00
    execution time (avg/stddev):   14.9972/0.00

- With patchset:
Test execution summary:
    total time:                          15.0014s
    total number of events:              7694
    total time taken by event execution: 14.9979
    per-request statistics:
         min:                                  1.23ms
         avg:                                  1.95ms
         max:                                 10.49ms
         approx.  95 percentile:              10.39ms

Threads fairness:
    events (avg/stddev):           7694.0000/0.00
    execution time (avg/stddev):   14.9979/0.00

The performance improvement is 56% for this use case.

- Patches 4-5 add utilization tracking for dl_rq in order to solve similar
  problem as with rt_rq. Nevertheless, we keep using dl bandwidth as default
  level of requirement for dl tasks. The dl utilization is used to check that
  the CPU is not overloaded which is not always reflected when using dl
  bandwidth

- Patches 6-7 add utilization tracking for interrupt and use it select OPP
  A test with iperf on hikey 6220 gives: 
    w/o patchset	    w/ patchset
    Tx 276 Mbits/sec        304 Mbits/sec +10%
    Rx 299 Mbits/sec        328 Mbits/sec +09%
    
    8 iterations of iperf -c server_address -r -t 5
    stdev is lower than 1%
    Only WFI idle state is enable (shallowest arm idle state)

- Patch 8 merges sugov_aggregate_util and sugov_get_util as proposed by Peter

- Patches 9 uses rt, dl and interrupt utilization in the scale_rt_capacity()
  and remove  the use of sched_rt_avg_update.

- Patches 10 removes the unused sched_avg_update code

- Patch 11 removes the unused sched_time_avg_ms

Change since v6:
- add more comments load tracking metrics 
- merge sugov_aggregate_util and sugov_get_util

Change since v4:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
  patch 1/2


Vincent Guittot (11):
  sched/pelt: Move pelt related code in a dedicated file
  sched/rt: add rt_rq utilization tracking
  cpufreq/schedutil: use rt utilization tracking
  sched/dl: add dl_rq utilization tracking
  cpufreq/schedutil: use dl utilization tracking
  sched/irq: add irq utilization tracking
  cpufreq/schedutil: take into account interrupt
  sched: schedutil: remove sugov_aggregate_util()
  sched: use pelt for scale_rt_capacity()
  sched: remove rt_avg code
  proc/sched: remove unused sched_time_avg_ms

 include/linux/sched/sysctl.h     |   1 -
 kernel/sched/Makefile            |   2 +-
 kernel/sched/core.c              |  38 +---
 kernel/sched/cpufreq_schedutil.c |  65 ++++---
 kernel/sched/deadline.c          |   8 +-
 kernel/sched/fair.c              | 403 +++++----------------------------------
 kernel/sched/pelt.c              | 399 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h              |  72 +++++++
 kernel/sched/rt.c                |  15 +-
 kernel/sched/sched.h             |  68 +++++--
 kernel/sysctl.c                  |   8 -
 11 files changed, 632 insertions(+), 447 deletions(-)
 create mode 100644 kernel/sched/pelt.c
 create mode 100644 kernel/sched/pelt.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 23:26   ` [tip:sched/core] sched/pelt: Move PELT " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 02/11] sched/rt: add rt_rq utilization tracking Vincent Guittot
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, pelt code is moved in a
dedicated file.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/Makefile |   2 +-
 kernel/sched/fair.c   | 333 +-------------------------------------------------
 kernel/sched/pelt.c   | 311 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h   |  43 +++++++
 kernel/sched/sched.h  |  19 +++
 5 files changed, 375 insertions(+), 333 deletions(-)
 create mode 100644 kernel/sched/pelt.c
 create mode 100644 kernel/sched/pelt.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b3..7fe1834 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4cc1441..bdab9ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -255,9 +255,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
 	return cfs_rq->rq;
 }
 
-/* An entity is a task if it doesn't "own" a runqueue */
-#define entity_is_task(se)	(!se->my_q)
-
 static inline struct task_struct *task_of(struct sched_entity *se)
 {
 	SCHED_WARN_ON(!entity_is_task(se));
@@ -419,7 +416,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
 	return container_of(cfs_rq, struct rq, cfs);
 }
 
-#define entity_is_task(se)	1
 
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
@@ -692,7 +688,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-
+#include "pelt.h"
 #include "sched-pelt.h"
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
@@ -2749,19 +2745,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 #ifdef CONFIG_SMP
-/*
- * XXX we want to get rid of these helpers and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
-	return scale_load_down(se->load.weight);
-}
-
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
-
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -3062,314 +3045,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
 }
 
 #ifdef CONFIG_SMP
-/*
- * Approximate:
- *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
- */
-static u64 decay_load(u64 val, u64 n)
-{
-	unsigned int local_n;
-
-	if (unlikely(n > LOAD_AVG_PERIOD * 63))
-		return 0;
-
-	/* after bounds checking we can collapse to 32-bit */
-	local_n = n;
-
-	/*
-	 * As y^PERIOD = 1/2, we can combine
-	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
-	 * With a look-up table which covers y^n (n<PERIOD)
-	 *
-	 * To achieve constant time decay_load.
-	 */
-	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
-		val >>= local_n / LOAD_AVG_PERIOD;
-		local_n %= LOAD_AVG_PERIOD;
-	}
-
-	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
-	return val;
-}
-
-static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
-{
-	u32 c1, c2, c3 = d3; /* y^0 == 1 */
-
-	/*
-	 * c1 = d1 y^p
-	 */
-	c1 = decay_load((u64)d1, periods);
-
-	/*
-	 *            p-1
-	 * c2 = 1024 \Sum y^n
-	 *            n=1
-	 *
-	 *              inf        inf
-	 *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
-	 *              n=0        n=p
-	 */
-	c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
-
-	return c1 + c2 + c3;
-}
-
-/*
- * Accumulate the three separate parts of the sum; d1 the remainder
- * of the last (incomplete) period, d2 the span of full periods and d3
- * the remainder of the (incomplete) current period.
- *
- *           d1          d2           d3
- *           ^           ^            ^
- *           |           |            |
- *         |<->|<----------------->|<--->|
- * ... |---x---|------| ... |------|-----x (now)
- *
- *                           p-1
- * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
- *                           n=1
- *
- *    = u y^p +					(Step 1)
- *
- *                     p-1
- *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
- *                     n=1
- */
-static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
-{
-	unsigned long scale_freq, scale_cpu;
-	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
-	u64 periods;
-
-	scale_freq = arch_scale_freq_capacity(cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
-	delta += sa->period_contrib;
-	periods = delta / 1024; /* A period is 1024us (~1ms) */
-
-	/*
-	 * Step 1: decay old *_sum if we crossed period boundaries.
-	 */
-	if (periods) {
-		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
-		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
-
-		/*
-		 * Step 2
-		 */
-		delta %= 1024;
-		contrib = __accumulate_pelt_segments(periods,
-				1024 - sa->period_contrib, delta);
-	}
-	sa->period_contrib = delta;
-
-	contrib = cap_scale(contrib, scale_freq);
-	if (load)
-		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
-	if (running)
-		sa->util_sum += contrib * scale_cpu;
-
-	return periods;
-}
-
-/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series.  To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
- *
- * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
- *      p0            p1           p2
- *     (now)       (~1ms ago)  (~2ms ago)
- *
- * Let u_i denote the fraction of p_i that the entity was runnable.
- *
- * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
- *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
- *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- *   y^32 = 0.5
- *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
- *
- * When a period "rolls over" and we have new u_0`, multiplying the previous
- * sum again by y is sufficient to update:
- *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
- */
-static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
-{
-	u64 delta;
-
-	delta = now - sa->last_update_time;
-	/*
-	 * This should only happen when time goes backwards, which it
-	 * unfortunately does during sched clock init when we swap over to TSC.
-	 */
-	if ((s64)delta < 0) {
-		sa->last_update_time = now;
-		return 0;
-	}
-
-	/*
-	 * Use 1024ns as the unit of measurement since it's a reasonable
-	 * approximation of 1us and fast to compute.
-	 */
-	delta >>= 10;
-	if (!delta)
-		return 0;
-
-	sa->last_update_time += delta << 10;
-
-	/*
-	 * running is a subset of runnable (weight) so running can't be set if
-	 * runnable is clear. But there are some corner cases where the current
-	 * se has been already dequeued but cfs_rq->curr still points to it.
-	 * This means that weight will be 0 but not running for a sched_entity
-	 * but also for a cfs_rq if the latter becomes idle. As an example,
-	 * this happens during idle_balance() which calls
-	 * update_blocked_averages()
-	 */
-	if (!load)
-		runnable = running = 0;
-
-	/*
-	 * Now we know we crossed measurement unit boundaries. The *_avg
-	 * accrues by two steps:
-	 *
-	 * Step 1: accumulate *_sum since last_update_time. If we haven't
-	 * crossed period boundaries, finish.
-	 */
-	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
-		return 0;
-
-	return 1;
-}
-
-static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
-{
-	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
-
-	/*
-	 * Step 2: update *_avg.
-	 */
-	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
-	sa->util_avg = sa->util_sum / divider;
-}
-
-/*
- * When a task is dequeued, its estimated utilization should not be update if
- * its util_avg has not been updated at least once.
- * This flag is used to synchronize util_avg updates with util_est updates.
- * We map this information into the LSB bit of the utilization saved at
- * dequeue time (i.e. util_est.dequeued).
- */
-#define UTIL_AVG_UNCHANGED 0x1
-
-static inline void cfs_se_util_change(struct sched_avg *avg)
-{
-	unsigned int enqueued;
-
-	if (!sched_feat(UTIL_EST))
-		return;
-
-	/* Avoid store if the flag has been already set */
-	enqueued = avg->util_est.enqueued;
-	if (!(enqueued & UTIL_AVG_UNCHANGED))
-		return;
-
-	/* Reset flag to report util_avg has been updated */
-	enqueued &= ~UTIL_AVG_UNCHANGED;
-	WRITE_ONCE(avg->util_est.enqueued, enqueued);
-}
-
-/*
- * sched_entity:
- *
- *   task:
- *     se_runnable() == se_weight()
- *
- *   group: [ see update_cfs_group() ]
- *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
- *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
- * cfq_rs:
- *
- *   load_sum = \Sum se_weight(se) * se->avg.load_sum
- *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
- */
-
-static int
-__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
-{
-	if (entity_is_task(se))
-		se->runnable_weight = se->load.weight;
-
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
-		return 1;
-	}
-
-	return 0;
-}
-
-static int
-__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	if (entity_is_task(se))
-		se->runnable_weight = se->load.weight;
-
-	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
-
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
-		cfs_se_util_change(&se->avg);
-		return 1;
-	}
-
-	return 0;
-}
-
-static int
-__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
-{
-	if (___update_load_sum(now, cpu, &cfs_rq->avg,
-				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
-				cfs_rq->curr != NULL)) {
-
-		___update_load_avg(&cfs_rq->avg, 1, 1);
-		return 1;
-	}
-
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg
@@ -4045,12 +3720,6 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 
 #else /* CONFIG_SMP */
 
-static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
-{
-	return 0;
-}
-
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
new file mode 100644
index 0000000..e6ecbb2
--- /dev/null
+++ b/kernel/sched/pelt.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per Entity Load Tracking
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ *  Interactivity improvements by Mike Galbraith
+ *  (C) 2007 Mike Galbraith <efault@gmx.de>
+ *
+ *  Various enhancements by Dmitry Adamushko.
+ *  (C) 2007 Dmitry Adamushko <dmitry.adamushko@gmail.com>
+ *
+ *  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  Copyright IBM Corporation, 2007
+ *  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
+ *
+ *  Scaled math optimizations by Thomas Gleixner
+ *  Copyright (C) 2007, Thomas Gleixner <tglx@linutronix.de>
+ *
+ *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
+ *
+ *  Move PELT related code from fair.c into this pelt.c file
+ *  Author: Vincent Guittot <vincent.guittot@linaro.org>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "sched-pelt.h"
+#include "pelt.h"
+
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static u64 decay_load(u64 val, u64 n)
+{
+	unsigned int local_n;
+
+	if (unlikely(n > LOAD_AVG_PERIOD * 63))
+		return 0;
+
+	/* after bounds checking we can collapse to 32-bit */
+	local_n = n;
+
+	/*
+	 * As y^PERIOD = 1/2, we can combine
+	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+	 * With a look-up table which covers y^n (n<PERIOD)
+	 *
+	 * To achieve constant time decay_load.
+	 */
+	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+		val >>= local_n / LOAD_AVG_PERIOD;
+		local_n %= LOAD_AVG_PERIOD;
+	}
+
+	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+	return val;
+}
+
+static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
+{
+	u32 c1, c2, c3 = d3; /* y^0 == 1 */
+
+	/*
+	 * c1 = d1 y^p
+	 */
+	c1 = decay_load((u64)d1, periods);
+
+	/*
+	 *            p-1
+	 * c2 = 1024 \Sum y^n
+	 *            n=1
+	 *
+	 *              inf        inf
+	 *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
+	 *              n=0        n=p
+	 */
+	c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
+
+	return c1 + c2 + c3;
+}
+
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
+/*
+ * Accumulate the three separate parts of the sum; d1 the remainder
+ * of the last (incomplete) period, d2 the span of full periods and d3
+ * the remainder of the (incomplete) current period.
+ *
+ *           d1          d2           d3
+ *           ^           ^            ^
+ *           |           |            |
+ *         |<->|<----------------->|<--->|
+ * ... |---x---|------| ... |------|-----x (now)
+ *
+ *                           p-1
+ * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
+ *                           n=1
+ *
+ *    = u y^p +					(Step 1)
+ *
+ *                     p-1
+ *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
+ *                     n=1
+ */
+static __always_inline u32
+accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+	       unsigned long load, unsigned long runnable, int running)
+{
+	unsigned long scale_freq, scale_cpu;
+	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
+	u64 periods;
+
+	scale_freq = arch_scale_freq_capacity(cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+	delta += sa->period_contrib;
+	periods = delta / 1024; /* A period is 1024us (~1ms) */
+
+	/*
+	 * Step 1: decay old *_sum if we crossed period boundaries.
+	 */
+	if (periods) {
+		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_load_sum =
+			decay_load(sa->runnable_load_sum, periods);
+		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+
+		/*
+		 * Step 2
+		 */
+		delta %= 1024;
+		contrib = __accumulate_pelt_segments(periods,
+				1024 - sa->period_contrib, delta);
+	}
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (load)
+		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_load_sum += runnable * contrib;
+	if (running)
+		sa->util_sum += contrib * scale_cpu;
+
+	return periods;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p2
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ *   y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+		  unsigned long load, unsigned long runnable, int running)
+{
+	u64 delta;
+
+	delta = now - sa->last_update_time;
+	/*
+	 * This should only happen when time goes backwards, which it
+	 * unfortunately does during sched clock init when we swap over to TSC.
+	 */
+	if ((s64)delta < 0) {
+		sa->last_update_time = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024ns as the unit of measurement since it's a reasonable
+	 * approximation of 1us and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+
+	sa->last_update_time += delta << 10;
+
+	/*
+	 * running is a subset of runnable (weight) so running can't be set if
+	 * runnable is clear. But there are some corner cases where the current
+	 * se has been already dequeued but cfs_rq->curr still points to it.
+	 * This means that weight will be 0 but not running for a sched_entity
+	 * but also for a cfs_rq if the latter becomes idle. As an example,
+	 * this happens during idle_balance() which calls
+	 * update_blocked_averages()
+	 */
+	if (!load)
+		runnable = running = 0;
+
+	/*
+	 * Now we know we crossed measurement unit boundaries. The *_avg
+	 * accrues by two steps:
+	 *
+	 * Step 1: accumulate *_sum since last_update_time. If we haven't
+	 * crossed period boundaries, finish.
+	 */
+	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+		return 0;
+
+	return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+{
+	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
+	/*
+	 * Step 2: update *_avg.
+	 */
+	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
+	sa->util_avg = sa->util_sum / divider;
+}
+
+/*
+ * sched_entity:
+ *
+ *   task:
+ *     se_runnable() == se_weight()
+ *
+ *   group: [ see update_cfs_group() ]
+ *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
+ *   load_sum := runnable_sum
+ *   load_avg = se_weight(se) * runnable_avg
+ *
+ *   runnable_load_sum := runnable_sum
+ *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
+ * cfq_rq:
+ *
+ *   load_sum = \Sum se_weight(se) * se->avg.load_sum
+ *   load_avg = \Sum se->avg.load_avg
+ *
+ *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ *   runnable_load_avg = \Sum se->avg.runable_load_avg
+ */
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		return 1;
+	}
+
+	return 0;
+}
+
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+				cfs_rq->curr == se)) {
+
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		cfs_se_util_change(&se->avg);
+		return 1;
+	}
+
+	return 0;
+}
+
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+	if (___update_load_sum(now, cpu, &cfs_rq->avg,
+				scale_load_down(cfs_rq->load.weight),
+				scale_load_down(cfs_rq->runnable_weight),
+				cfs_rq->curr != NULL)) {
+
+		___update_load_avg(&cfs_rq->avg, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
new file mode 100644
index 0000000..9cac73e
--- /dev/null
+++ b/kernel/sched/pelt.h
@@ -0,0 +1,43 @@
+#ifdef CONFIG_SMP
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+
+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+	unsigned int enqueued;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Avoid store if the flag has been already set */
+	enqueued = avg->util_est.enqueued;
+	if (!(enqueued & UTIL_AVG_UNCHANGED))
+		return;
+
+	/* Reset flag to report util_avg has been updated */
+	enqueued &= ~UTIL_AVG_UNCHANGED;
+	WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
+#else
+
+static inline int
+update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+#endif
+
+
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf..db6878e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -666,7 +666,26 @@ struct dl_rq {
 	u64			bw_ratio;
 };
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se)	(!se->my_q)
+#else
+#define entity_is_task(se)	1
+#endif
+
 #ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return scale_load_down(se->runnable_weight);
+}
 
 static inline bool sched_asym_prefer(int a, int b)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/11] sched/rt: add rt_rq utilization tracking
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
  2018-06-28 15:45 ` [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 23:27   ` [tip:sched/core] sched/rt: Add " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 03/11] cpufreq/schedutil: use rt " Vincent Guittot
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity but not what cfs want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of rt tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 15 ++++++++++++++-
 kernel/sched/pelt.c  | 25 +++++++++++++++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/rt.c    | 13 +++++++++++++
 kernel/sched/sched.h |  7 +++++++
 5 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdab9ed..328bedc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,6 +7289,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+	if (READ_ONCE(rq->avg_rt.util_avg))
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7348,6 +7356,10 @@ static void update_blocked_averages(int cpu)
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
 	}
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	/* Don't need periodic decay once load/util_avg are null */
+	if (rt_rq_has_blocked(rq))
+		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
@@ -7413,9 +7425,10 @@ static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index e6ecbb2..a00b1ba 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -309,3 +309,28 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 
 	return 0;
 }
+
+/*
+ * rt_rq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ *   load_avg and runnable_load_avg are not supported and meaningless.
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+				running,
+				running,
+				running)) {
+
+		___update_load_avg(&rq->avg_rt, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73e..b2983b7 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@
 int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
 int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	return 0;
+}
+
 #endif
 
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 47556b0..0e3e57a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@
  */
 #include "sched.h"
 
+#include "pelt.h"
+
 int sched_rr_timeslice = RR_TIMESLICE;
 int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
 
@@ -1572,6 +1574,14 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	rt_queue_push_tasks(rq);
 
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+
 	return p;
 }
 
@@ -1579,6 +1589,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
@@ -2308,6 +2320,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db6878e..f2b12b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -592,6 +592,7 @@ struct rt_rq {
 	unsigned long		rt_nr_total;
 	int			overloaded;
 	struct plist_head	pushable_tasks;
+
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
@@ -847,6 +848,7 @@ struct rq {
 
 	u64			rt_avg;
 	u64			age_stamp;
+	struct sched_avg	avg_rt;
 	u64			idle_stamp;
 	u64			avg_idle;
 
@@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 	return util;
 }
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+	return rq->avg_rt.util_avg;
+}
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/11] cpufreq/schedutil: use rt utilization tracking
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
  2018-06-28 15:45 ` [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file Vincent Guittot
  2018-06-28 15:45 ` [PATCH 02/11] sched/rt: add rt_rq utilization tracking Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-06  5:56   ` Viresh Kumar
  2018-07-15 23:27   ` [tip:sched/core] cpufreq/schedutil: Use RT " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 04/11] sched/dl: add dl_rq " Vincent Guittot
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
can preempt and steal cfs's running time.

rt util_avg is used to take into account the utilization of rt tasks
on the CPU when selecting OPP. If a rt task migrate, the rt utilization
will not migrate but will decay over time. On an overloaded CPU, cfs
utilization reflects the remaining utilization avialable on CPU. When rt
task migrates, the cfs utilization will increase when tasks will start to
use the newly available capacity. At the same pace, rt utilization will
decay and both variations will compensate each other  to keep unchanged
overall utilization and will prevent any OPP drop.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3cde464..9c5e92e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
 	/* The fields below are only needed when sharing a policy: */
 	unsigned long		util_cfs;
 	unsigned long		util_dl;
+	unsigned long		util_rt;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -186,15 +187,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 	sg_cpu->util_cfs = cpu_util_cfs(rq);
 	sg_cpu->util_dl  = cpu_util_dl(rq);
+	sg_cpu->util_rt  = cpu_util_rt(rq);
 }
 
 static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util;
 
 	if (rq->rt.rt_nr_running)
 		return sg_cpu->max;
 
+	util = sg_cpu->util_dl;
+	util += sg_cpu->util_cfs;
+	util += sg_cpu->util_rt;
+
 	/*
 	 * Utilization required by DEADLINE must always be granted while, for
 	 * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
@@ -205,7 +212,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
-	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+	return min(sg_cpu->max, util);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/11] sched/dl: add dl_rq utilization tracking
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (2 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 03/11] cpufreq/schedutil: use rt " Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 23:28   ` [tip:sched/core] sched/dl: Add " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 05/11] cpufreq/schedutil: use dl " Vincent Guittot
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

Similarly to what happens with rt tasks, cfs tasks can be preempted by dl
tasks and the cfs's utilization might no longer describes the real
utilization level.
Current dl bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the dl sched class. We track
dl class utilization to estimate the system utilization.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/deadline.c |  6 ++++++
 kernel/sched/fair.c     | 11 ++++++++---
 kernel/sched/pelt.c     | 23 +++++++++++++++++++++++
 kernel/sched/pelt.h     |  6 ++++++
 kernel/sched/sched.h    |  1 +
 5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fbfc3f1..f4de2698 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,7 @@
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
 #include "sched.h"
+#include "pelt.h"
 
 struct dl_bandwidth def_dl_bandwidth;
 
@@ -1761,6 +1762,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	deadline_queue_push_tasks(rq);
 
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+
 	return p;
 }
 
@@ -1768,6 +1772,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
@@ -1784,6 +1789,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 {
 	update_curr_dl(rq);
 
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
 	/*
 	 * Even when we have runtime, update_curr_dl() might have resulted in us
 	 * not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 328bedc..ffce4b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,11 +7289,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
-static inline bool rt_rq_has_blocked(struct rq *rq)
+static inline bool others_rqs_have_blocked(struct rq *rq)
 {
 	if (READ_ONCE(rq->avg_rt.util_avg))
 		return true;
 
+	if (READ_ONCE(rq->avg_dl.util_avg))
+		return true;
+
 	return false;
 }
 
@@ -7357,8 +7360,9 @@ static void update_blocked_averages(int cpu)
 			done = false;
 	}
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
-	if (rt_rq_has_blocked(rq))
+	if (others_rqs_have_blocked(rq))
 		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -7426,9 +7430,10 @@ static inline void update_blocked_averages(int cpu)
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a00b1ba..8b78b63 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -334,3 +334,26 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 
 	return 0;
 }
+
+/*
+ * dl_rq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+				running,
+				running,
+				running)) {
+
+		___update_load_avg(&rq->avg_dl, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index b2983b7..0e4f912 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -4,6 +4,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
 int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -45,6 +46,11 @@ update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+static inline int
+update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f2b12b0..7d7d4f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -849,6 +849,7 @@ struct rq {
 	u64			rt_avg;
 	u64			age_stamp;
 	struct sched_avg	avg_rt;
+	struct sched_avg	avg_dl;
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/11] cpufreq/schedutil: use dl utilization tracking
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (3 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 04/11] sched/dl: add dl_rq " Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-06  5:59   ` Viresh Kumar
  2018-07-15 23:28   ` [tip:sched/core] cpufreq/schedutil: Use DL " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 06/11] sched/irq: add irq " Vincent Guittot
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the dl bandwidth requirement to define the
utilization of the CPU

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
 kernel/sched/sched.h             |  7 ++++++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9c5e92e..edfbfc1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
 	/* The fields below are only needed when sharing a policy: */
 	unsigned long		util_cfs;
 	unsigned long		util_dl;
+	unsigned long		bw_dl;
 	unsigned long		util_rt;
 	unsigned long		max;
 
@@ -187,6 +188,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 	sg_cpu->util_cfs = cpu_util_cfs(rq);
 	sg_cpu->util_dl  = cpu_util_dl(rq);
+	sg_cpu->bw_dl    = cpu_bw_dl(rq);
 	sg_cpu->util_rt  = cpu_util_rt(rq);
 }
 
@@ -198,20 +200,29 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	if (rq->rt.rt_nr_running)
 		return sg_cpu->max;
 
-	util = sg_cpu->util_dl;
-	util += sg_cpu->util_cfs;
+	util = sg_cpu->util_cfs;
 	util += sg_cpu->util_rt;
 
+	if ((util + sg_cpu->util_dl) >= sg_cpu->max)
+		return sg_cpu->max;
+
 	/*
-	 * Utilization required by DEADLINE must always be granted while, for
-	 * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
-	 * gracefully reduce the frequency when no tasks show up for longer
+	 * As there is still idle time on the CPU, we need to compute the
+	 * utilization level of the CPU.
+	 *
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
 	 * periods of time.
 	 *
 	 * Ideally we would like to set util_dl as min/guaranteed freq and
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
+
+	/* Add DL bandwidth requirement */
+	util += sg_cpu->bw_dl;
+
 	return min(sg_cpu->max, util);
 }
 
@@ -367,7 +378,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
  */
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
 {
-	if (cpu_util_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->util_dl)
+	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
 		sg_policy->need_freq_update = true;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7d7d4f4..ef5d6aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2192,11 +2192,16 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif
 
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-static inline unsigned long cpu_util_dl(struct rq *rq)
+static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
 }
 
+static inline unsigned long cpu_util_dl(struct rq *rq)
+{
+	return READ_ONCE(rq->avg_dl.util_avg);
+}
+
 static inline unsigned long cpu_util_cfs(struct rq *rq)
 {
 	unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/11] sched/irq: add irq utilization tracking
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (4 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 05/11] cpufreq/schedutil: use dl " Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 23:29   ` [tip:sched/core] sched/irq: Add IRQ " tip-bot for Vincent Guittot
  2018-07-26  3:09   ` [PATCH 06/11] sched/irq: add irq " Wanpeng Li
  2018-06-28 15:45 ` [PATCH 07/11] cpufreq/schedutil: take into account interrupt Vincent Guittot
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because
  rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is :
  avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  4 +++-
 kernel/sched/fair.c  | 13 ++++++++++---
 kernel/sched/pelt.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  | 16 ++++++++++++++++
 kernel/sched/sched.h |  3 +++
 5 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..e5263a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -18,6 +18,8 @@
 #include "../workqueue_internal.h"
 #include "../smpboot.h"
 
+#include "pelt.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
@@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
-		sched_rt_avg_update(rq, irq_delta + steal);
+		update_irq_load_avg(rq, irq_delta + steal);
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffce4b2..d2758e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
-static inline bool others_rqs_have_blocked(struct rq *rq)
+static inline bool others_have_blocked(struct rq *rq)
 {
 	if (READ_ONCE(rq->avg_rt.util_avg))
 		return true;
@@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
 	if (READ_ONCE(rq->avg_dl.util_avg))
 		return true;
 
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	if (READ_ONCE(rq->avg_irq.util_avg))
+		return true;
+#endif
+
 	return false;
 }
 
@@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
 	}
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
-	if (others_rqs_have_blocked(rq))
+	if (others_have_blocked(rq))
 		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 8b78b63..ead6d8b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 
 	return 0;
 }
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+/*
+ * irq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+	int ret = 0;
+	/*
+	 * We know the time that has been used by interrupt since last update
+	 * but we don't when. Let be pessimistic and assume that interrupt has
+	 * happened just before the update. This is not so far from reality
+	 * because interrupt will most probably wake up task and trig an update
+	 * of rq clock during which the metric si updated.
+	 * We start to decay with normal context time and then we add the
+	 * interrupt context time.
+	 * We can safely remove running from rq->clock because
+	 * rq->clock += delta with delta >= running
+	 */
+	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+				0,
+				0,
+				0);
+	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+				1,
+				1,
+				1);
+
+	if (ret)
+		___update_load_avg(&rq->avg_irq, 1, 1);
+
+	return ret;
+}
+#endif
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912..d2894db 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+int update_irq_load_avg(struct rq *rq, u64 running);
+#else
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+	return 0;
+}
+#endif
+
 /*
  * When a task is dequeued, its estimated utilization should not be update if
  * its util_avg has not been updated at least once.
@@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	return 0;
 }
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef5d6aa..377be2b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,6 +850,9 @@ struct rq {
 	u64			age_stamp;
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	struct sched_avg	avg_irq;
+#endif
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/11] cpufreq/schedutil: take into account interrupt
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (5 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 06/11] sched/irq: add irq " Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-06  6:00   ` Viresh Kumar
  2018-07-15 23:29   ` [tip:sched/core] cpufreq/schedutil: Take time spent in interrupts into account tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util() Vincent Guittot
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

The time spent under interrupt can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
rqs utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is :
  irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives:
iperf -c server_address -r -t 5

w/o patch		w/ patch
Tx 276 Mbits/sec        304 Mbits/sec +10%
Rx 299 Mbits/sec        328 Mbits/sec +09%

8 iterations
stdev is lower than 1%
Only WFI idle state is enable (shallowest diel state)

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
 kernel/sched/sched.h             | 13 +++++++++++++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index edfbfc1..b77bfef 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -58,6 +58,7 @@ struct sugov_cpu {
 	unsigned long		util_dl;
 	unsigned long		bw_dl;
 	unsigned long		util_rt;
+	unsigned long		util_irq;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->util_dl  = cpu_util_dl(rq);
 	sg_cpu->bw_dl    = cpu_bw_dl(rq);
 	sg_cpu->util_rt  = cpu_util_rt(rq);
+	sg_cpu->util_irq = cpu_util_irq(rq);
 }
 
 static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util;
+	unsigned long util, max = sg_cpu->max;
 
 	if (rq->rt.rt_nr_running)
 		return sg_cpu->max;
 
+	if (unlikely(sg_cpu->util_irq >= max))
+		return max;
+
+	/* Sum rq utilization */
 	util = sg_cpu->util_cfs;
 	util += sg_cpu->util_rt;
 
-	if ((util + sg_cpu->util_dl) >= sg_cpu->max)
-		return sg_cpu->max;
+	/*
+	 * Interrupt time is not seen by rqs utilization nso we can compare
+	 * them with the CPU capacity
+	 */
+	if ((util + sg_cpu->util_dl) >= max)
+		return max;
 
 	/*
 	 * As there is still idle time on the CPU, we need to compute the
@@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
 
+	/* Weight rqs utilization to normal context window */
+	util *= (max - sg_cpu->util_irq);
+	util /= max;
+
+	/* Add interrupt utilization */
+	util += sg_cpu->util_irq;
+
 	/* Add DL bandwidth requirement */
 	util += sg_cpu->bw_dl;
 
-	return min(sg_cpu->max, util);
+	return min(max, util);
 }
 
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 377be2b..9438e68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2221,4 +2221,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 {
 	return rq->avg_rt.util_avg;
 }
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+	return rq->avg_irq.util_avg;
+}
+#else
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+	return 0;
+}
+
+#endif
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util()
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (6 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 07/11] cpufreq/schedutil: take into account interrupt Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-06  6:02   ` Viresh Kumar
  2018-07-15 23:30   ` [tip:sched/core] sched/cpufreq: Remove sugov_aggregate_util() tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 09/11] sched: use pelt for scale_rt_capacity() Vincent Guittot
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased after adding irq tracking and fixed some compilation errors]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
 kernel/sched/sched.h             |  2 +-
 2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index b77bfef..d04f941 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -53,12 +53,7 @@ struct sugov_cpu {
 	unsigned int		iowait_boost_max;
 	u64			last_update;
 
-	/* The fields below are only needed when sharing a policy: */
-	unsigned long		util_cfs;
-	unsigned long		util_dl;
 	unsigned long		bw_dl;
-	unsigned long		util_rt;
-	unsigned long		util_irq;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-static void sugov_get_util(struct sugov_cpu *sg_cpu)
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util, irq, max;
 
-	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
-	sg_cpu->util_cfs = cpu_util_cfs(rq);
-	sg_cpu->util_dl  = cpu_util_dl(rq);
-	sg_cpu->bw_dl    = cpu_bw_dl(rq);
-	sg_cpu->util_rt  = cpu_util_rt(rq);
-	sg_cpu->util_irq = cpu_util_irq(rq);
-}
-
-static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
-{
-	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util, max = sg_cpu->max;
+	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
 
 	if (rq->rt.rt_nr_running)
-		return sg_cpu->max;
+		return max;
+
+	irq = cpu_util_irq(rq);
 
-	if (unlikely(sg_cpu->util_irq >= max))
+	if (unlikely(irq >= max))
 		return max;
 
 	/* Sum rq utilization */
-	util = sg_cpu->util_cfs;
-	util += sg_cpu->util_rt;
+	util = cpu_util_cfs(rq);
+	util += cpu_util_rt(rq);
 
 	/*
 	 * Interrupt time is not seen by rqs utilization nso we can compare
 	 * them with the CPU capacity
 	 */
-	if ((util + sg_cpu->util_dl) >= max)
+	if ((util + cpu_util_dl(rq)) >= max)
 		return max;
 
 	/*
@@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 */
 
 	/* Weight rqs utilization to normal context window */
-	util *= (max - sg_cpu->util_irq);
+	util *= (max - irq);
 	util /= max;
 
 	/* Add interrupt utilization */
-	util += sg_cpu->util_irq;
+	util += irq;
 
 	/* Add DL bandwidth requirement */
 	util += sg_cpu->bw_dl;
@@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 
 	busy = sugov_cpu_is_busy(sg_cpu);
 
-	sugov_get_util(sg_cpu);
+	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;
-	util = sugov_aggregate_util(sg_cpu);
 	sugov_iowait_apply(sg_cpu, time, &util, &max);
 	next_f = get_next_freq(sg_policy, util, max);
 	/*
@@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long j_util, j_max;
 
-		sugov_get_util(j_sg_cpu);
+		j_util = sugov_get_util(j_sg_cpu);
 		j_max = j_sg_cpu->max;
-		j_util = sugov_aggregate_util(j_sg_cpu);
 		sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);
 
 		if (j_util * max > j_max * util) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9438e68..59a633d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2219,7 +2219,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
-	return rq->avg_rt.util_avg;
+	return READ_ONCE(rq->avg_rt.util_avg);
 }
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/11] sched: use pelt for scale_rt_capacity()
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (7 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util() Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 22:15   ` Ingo Molnar
  2018-07-15 23:32   ` [tip:sched/core] sched/core: Use PELT " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 10/11] sched: remove rt_avg code Vincent Guittot
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

The utilization of the CPU by rt, dl and interrupts are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for cfs class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for cfs instead of a scaling factor because rt, dl and
interrupt provide now absolute utilization value.

The same formula as schedutil is used:
  irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/deadline.c |  2 --
 kernel/sched/fair.c     | 41 +++++++++++++++++++----------------------
 kernel/sched/pelt.c     |  2 +-
 kernel/sched/rt.c       |  2 --
 4 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4de2698..68b8a9f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2758e3..ce0dcbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, used, age_stamp, avg;
-	s64 delta;
-
-	/*
-	 * Since we're reading these variables without serialization make sure
-	 * we read them once before doing sanity checks on them.
-	 */
-	age_stamp = READ_ONCE(rq->age_stamp);
-	avg = READ_ONCE(rq->rt_avg);
-	delta = __rq_clock_broken(rq) - age_stamp;
+	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long used, irq, free;
 
-	if (unlikely(delta < 0))
-		delta = 0;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	irq = READ_ONCE(rq->avg_irq.util_avg);
 
-	total = sched_avg_period() + delta;
+	if (unlikely(irq >= max))
+		return 1;
+#endif
 
-	used = div_u64(avg, total);
+	used = READ_ONCE(rq->avg_rt.util_avg);
+	used += READ_ONCE(rq->avg_dl.util_avg);
 
-	if (likely(used < SCHED_CAPACITY_SCALE))
-		return SCHED_CAPACITY_SCALE - used;
+	if (unlikely(used >= max))
+		return 1;
 
-	return 1;
+	free = max - used;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	free *= (max - irq);
+	free /= max;
+#endif
+	return free;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
+	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
 
-	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-
-	capacity *= scale_rt_capacity(cpu);
-	capacity >>= SCHED_CAPACITY_SHIFT;
+	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
 
 	if (!capacity)
 		capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index ead6d8b..35475c0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
 	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
-	sa->util_avg = sa->util_sum / divider;
+	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0e3e57a..2a881bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -970,8 +970,6 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/11] sched: remove rt_avg code
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (8 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 09/11] sched: use pelt for scale_rt_capacity() Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-07-15 23:33   ` [tip:sched/core] sched/core: Remove the " tip-bot for Vincent Guittot
  2018-06-28 15:45 ` [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms Vincent Guittot
  2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
  11 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar

rt_avg is no more used anywhere so we can remove all related code

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 26 --------------------------
 kernel/sched/fair.c  |  2 --
 kernel/sched/sched.h | 17 -----------------
 3 files changed, 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5263a4..e9aae7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -652,23 +652,6 @@ bool sched_can_stop_tick(struct rq *rq)
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
-	s64 period = sched_avg_period();
-
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (rq->age_stamp));
-		rq->age_stamp += period;
-		rq->rt_avg /= 2;
-	}
-}
-
 #endif /* CONFIG_SMP */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5719,13 +5702,6 @@ void set_rq_offline(struct rq *rq)
 	}
 }
 
-static void set_cpu_rq_start_time(unsigned int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-
-	rq->age_stamp = sched_clock_cpu(cpu);
-}
-
 /*
  * used to mark begin/end of suspend/resume:
  */
@@ -5843,7 +5819,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
-	set_cpu_rq_start_time(cpu);
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -6111,7 +6086,6 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP
 	idle_thread_set_boot_cpu();
-	set_cpu_rq_start_time(smp_processor_id());
 #endif
 	init_sched_fair_class();
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce0dcbf..7ddb13a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5322,8 +5322,6 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
-
-	sched_avg_update(this_rq);
 }
 
 /* Used instead of source_load when we know the type == 0 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59a633d..c71ea81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -846,8 +846,6 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
-	u64			rt_avg;
-	u64			age_stamp;
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
@@ -1712,11 +1710,6 @@ extern const_debug unsigned int sysctl_sched_time_avg;
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-static inline u64 sched_avg_period(void)
-{
-	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
@@ -1753,8 +1746,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
 #endif
 
 #ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
 #ifndef arch_scale_cpu_capacity
 static __always_inline
 unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1765,12 +1756,6 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 	return SCHED_CAPACITY_SCALE;
 }
 #endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
-	rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
-	sched_avg_update(rq);
-}
 #else
 #ifndef arch_scale_cpu_capacity
 static __always_inline
@@ -1779,8 +1764,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
 	return SCHED_CAPACITY_SCALE;
 }
 #endif
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
 struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (9 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 10/11] sched: remove rt_avg code Vincent Guittot
@ 2018-06-28 15:45 ` Vincent Guittot
  2018-06-28 15:51   ` Luis R. Rodriguez
  2018-07-15 23:33   ` [tip:sched/core] sched/sysctl: Remove unused sched_time_avg_ms sysctl tip-bot for Vincent Guittot
  2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
  11 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-28 15:45 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: rjw, juri.lelli, dietmar.eggemann, Morten.Rasmussen,
	viresh.kumar, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio,
	Vincent Guittot, Ingo Molnar, Kees Cook, Luis R. Rodriguez

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
Remove it

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/sysctl.h | 1 -
 kernel/sched/core.c          | 8 --------
 kernel/sched/sched.h         | 1 -
 kernel/sysctl.c              | 8 --------
 4 files changed, 18 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a151..913488d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,7 +40,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
-extern __read_mostly unsigned int sysctl_sched_time_avg;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9aae7f..6935691 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,14 +48,6 @@ const_debug unsigned int sysctl_sched_features =
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
-/*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c71ea81..47b9175 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1706,7 +1706,6 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 
-extern const_debug unsigned int sysctl_sched_time_avg;
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2d9837c..f22f76b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,14 +368,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "sched_time_avg_ms",
-		.data		= &sysctl_sched_time_avg,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms
  2018-06-28 15:45 ` [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms Vincent Guittot
@ 2018-06-28 15:51   ` Luis R. Rodriguez
  2018-06-29  5:49     ` Vincent Guittot
  2018-07-15 23:33   ` [tip:sched/core] sched/sysctl: Remove unused sched_time_avg_ms sysctl tip-bot for Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Luis R. Rodriguez @ 2018-06-28 15:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, viresh.kumar, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio, Ingo Molnar, Kees Cook, Luis R. Rodriguez

On Thu, Jun 28, 2018 at 05:45:14PM +0200, Vincent Guittot wrote:
> /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
> Remove it
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms
  2018-06-28 15:51   ` Luis R. Rodriguez
@ 2018-06-29  5:49     ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-06-29  5:49 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar, Kees Cook

On Thu, 28 Jun 2018 at 21:03, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
>
> On Thu, Jun 28, 2018 at 05:45:14PM +0200, Vincent Guittot wrote:
> > /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
> > Remove it
> >
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>

Thanks

Vincent

>
>   Luis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 00/11] track CPU utilization
  2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
                   ` (10 preceding siblings ...)
  2018-06-28 15:45 ` [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms Vincent Guittot
@ 2018-07-05 12:36 ` Peter Zijlstra
  2018-07-05 13:32   ` Vincent Guittot
                     ` (2 more replies)
  11 siblings, 3 replies; 44+ messages in thread
From: Peter Zijlstra @ 2018-07-05 12:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, viresh.kumar, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio

On Thu, Jun 28, 2018 at 05:45:03PM +0200, Vincent Guittot wrote:
> Vincent Guittot (11):
>   sched/pelt: Move pelt related code in a dedicated file
>   sched/rt: add rt_rq utilization tracking
>   cpufreq/schedutil: use rt utilization tracking
>   sched/dl: add dl_rq utilization tracking
>   cpufreq/schedutil: use dl utilization tracking
>   sched/irq: add irq utilization tracking
>   cpufreq/schedutil: take into account interrupt
>   sched: schedutil: remove sugov_aggregate_util()
>   sched: use pelt for scale_rt_capacity()
>   sched: remove rt_avg code
>   proc/sched: remove unused sched_time_avg_ms
> 
>  include/linux/sched/sysctl.h     |   1 -
>  kernel/sched/Makefile            |   2 +-
>  kernel/sched/core.c              |  38 +---
>  kernel/sched/cpufreq_schedutil.c |  65 ++++---
>  kernel/sched/deadline.c          |   8 +-
>  kernel/sched/fair.c              | 403 +++++----------------------------------
>  kernel/sched/pelt.c              | 399 ++++++++++++++++++++++++++++++++++++++
>  kernel/sched/pelt.h              |  72 +++++++
>  kernel/sched/rt.c                |  15 +-
>  kernel/sched/sched.h             |  68 +++++--
>  kernel/sysctl.c                  |   8 -
>  11 files changed, 632 insertions(+), 447 deletions(-)
>  create mode 100644 kernel/sched/pelt.c
>  create mode 100644 kernel/sched/pelt.h

OK, this looks good I suppose. Rafael, are you OK with me taking these?

I have the below on top because I once again forgot how it all worked;
does this work for you Vincent?

---
Subject: sched/cpufreq: Clarify sugov_get_util()

Add a few comments (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 18 deletions(-)

--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ *   cpu_util_{cfs,rt,dl,irq}()
+ *   cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued in
+ * the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured meric but a value computed
+ * based on the task model parameters and gives the minimal u required to meet
+ * deadlines.
+ */
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
@@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru
 	if (rt_rq_is_runnable(&rq->rt))
 		return max;
 
+	/*
+	 * Early check to see if IRQ/steal time saturates the CPU, can be
+	 * because of inaccuracies in how we track these -- see
+	 * update_irq_load_avg().
+	 */
 	irq = cpu_util_irq(rq);
-
 	if (unlikely(irq >= max))
 		return max;
 
-	/* Sum rq utilization */
+	/*
+	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
+	 * CFS tasks and we use the same metric to track the effective
+	 * utilization (PELT windows are synchronized) we can directly add them
+	 * to obtain the CPU's actual utilization.
+	 */
 	util = cpu_util_cfs(rq);
 	util += cpu_util_rt(rq);
 
 	/*
-	 * Interrupt time is not seen by rqs utilization nso we can compare
-	 * them with the CPU capacity
+	 * We do not make cpu_util_dl() a permanent part of this sum because we
+	 * want to use cpu_bw_dl() later on, but we need to check if the
+	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
+	 * f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
 	if ((util + cpu_util_dl(rq)) >= max)
 		return max;
 
 	/*
-	 * As there is still idle time on the CPU, we need to compute the
-	 * utilization level of the CPU.
+	 * There is still idle time; further improve the number by using the
+	 * irq metric. Because IRQ/steal time is hidden from the task clock we
+	 * need to scale the task numbers:
 	 *
+	 *              1 - irq
+	 *   U' = irq + ------- * U
+	 *                max
+	 */
+	util *= (max - irq);
+	util /= max;
+	util += irq;
+
+	/*
 	 * Bandwidth required by DEADLINE must always be granted while, for
 	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
 	 * to gracefully reduce the frequency when no tasks show up for longer
@@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
-
-	/* Weight rqs utilization to normal context window */
-	util *= (max - irq);
-	util /= max;
-
-	/* Add interrupt utilization */
-	util += irq;
-
-	/* Add DL bandwidth requirement */
-	util += sg_cpu->bw_dl;
-
-	return min(max, util);
+	return min(max, util + sg_cpu->bw_dl);
 }
 
 /**

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 00/11] track CPU utilization
  2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
@ 2018-07-05 13:32   ` Vincent Guittot
  2018-07-06  6:05   ` Viresh Kumar
  2018-07-15 23:34   ` [tip:sched/core] sched/cpufreq: Clarify sugov_get_util() tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-07-05 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Rafael J. Wysocki, Juri Lelli,
	Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino

Hi Peter

On Thu, 5 Jul 2018 at 14:36, Peter Zijlstra <peterz@infradead.org> wrote:
>

>
> OK, this looks good I suppose. Rafael, are you OK with me taking these?
>
> I have the below on top because I once again forgot how it all worked;
> does this work for you Vincent?

Yes looks good to me

Thanks

>
> ---
> Subject: sched/cpufreq: Clarify sugov_get_util()
>
> Add a few comments (hopefully) clarifying some of the magic in
> sugov_get_util().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 51 insertions(+), 18 deletions(-)
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 03/11] cpufreq/schedutil: use rt utilization tracking
  2018-06-28 15:45 ` [PATCH 03/11] cpufreq/schedutil: use rt " Vincent Guittot
@ 2018-07-06  5:56   ` Viresh Kumar
  2018-07-15 23:27   ` [tip:sched/core] cpufreq/schedutil: Use RT " tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: Viresh Kumar @ 2018-07-06  5:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio, Ingo Molnar

On 28-06-18, 17:45, Vincent Guittot wrote:
> Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
> can preempt and steal cfs's running time.
> 
> rt util_avg is used to take into account the utilization of rt tasks
> on the CPU when selecting OPP. If a rt task migrate, the rt utilization
> will not migrate but will decay over time. On an overloaded CPU, cfs
> utilization reflects the remaining utilization avialable on CPU. When rt
> task migrates, the cfs utilization will increase when tasks will start to
> use the newly available capacity. At the same pace, rt utilization will
> decay and both variations will compensate each other  to keep unchanged
> overall utilization and will prevent any OPP drop.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 05/11] cpufreq/schedutil: use dl utilization tracking
  2018-06-28 15:45 ` [PATCH 05/11] cpufreq/schedutil: use dl " Vincent Guittot
@ 2018-07-06  5:59   ` Viresh Kumar
  2018-07-15 23:28   ` [tip:sched/core] cpufreq/schedutil: Use DL " tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: Viresh Kumar @ 2018-07-06  5:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio, Ingo Molnar

On 28-06-18, 17:45, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can detect when CPU is fully used so we should run at max.
> Otherwise, we keep using the dl bandwidth requirement to define the
> utilization of the CPU
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
>  kernel/sched/sched.h             |  7 ++++++-
>  2 files changed, 23 insertions(+), 7 deletions(-)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt
  2018-06-28 15:45 ` [PATCH 07/11] cpufreq/schedutil: take into account interrupt Vincent Guittot
@ 2018-07-06  6:00   ` Viresh Kumar
  2018-07-06  9:14     ` Peter Zijlstra
  2018-07-15 23:29   ` [tip:sched/core] cpufreq/schedutil: Take time spent in interrupts into account tip-bot for Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Viresh Kumar @ 2018-07-06  6:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio, Ingo Molnar

On 28-06-18, 17:45, Vincent Guittot wrote:
> The time spent under interrupt can be significant but it is not reflected
> in the utilization of CPU when deciding to choose an OPP. Now that we have
> access to this metric, schedutil can take it into account when selecting
> the OPP for a CPU.
> rqs utilization don't see the time spend under interrupt context and report
> their value in the normal context time window. We need to compensate this when
> adding interrupt utilization
> 
> The CPU utilization is :
>   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> 
> A test with iperf on hikey (octo arm64) gives:
> iperf -c server_address -r -t 5
> 
> w/o patch		w/ patch
> Tx 276 Mbits/sec        304 Mbits/sec +10%
> Rx 299 Mbits/sec        328 Mbits/sec +09%
> 
> 8 iterations
> stdev is lower than 1%
> Only WFI idle state is enable (shallowest diel state)
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
>  kernel/sched/sched.h             | 13 +++++++++++++
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index edfbfc1..b77bfef 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -58,6 +58,7 @@ struct sugov_cpu {
>  	unsigned long		util_dl;
>  	unsigned long		bw_dl;
>  	unsigned long		util_rt;
> +	unsigned long		util_irq;
>  	unsigned long		max;
>  
>  	/* The field below is for single-CPU policies only: */
> @@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>  	sg_cpu->util_dl  = cpu_util_dl(rq);
>  	sg_cpu->bw_dl    = cpu_bw_dl(rq);
>  	sg_cpu->util_rt  = cpu_util_rt(rq);
> +	sg_cpu->util_irq = cpu_util_irq(rq);
>  }
>  
>  static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>  {
>  	struct rq *rq = cpu_rq(sg_cpu->cpu);
> -	unsigned long util;
> +	unsigned long util, max = sg_cpu->max;
>  
>  	if (rq->rt.rt_nr_running)
>  		return sg_cpu->max;
>  
> +	if (unlikely(sg_cpu->util_irq >= max))
> +		return max;
> +
> +	/* Sum rq utilization */
>  	util = sg_cpu->util_cfs;
>  	util += sg_cpu->util_rt;
>  
> -	if ((util + sg_cpu->util_dl) >= sg_cpu->max)
> -		return sg_cpu->max;
> +	/*
> +	 * Interrupt time is not seen by rqs utilization nso we can compare

                                                         nso ?

> +	 * them with the CPU capacity
> +	 */
> +	if ((util + sg_cpu->util_dl) >= max)
> +		return max;
>  
>  	/*
>  	 * As there is still idle time on the CPU, we need to compute the
> @@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>  	 * ready for such an interface. So, we only do the latter for now.
>  	 */
>  
> +	/* Weight rqs utilization to normal context window */
> +	util *= (max - sg_cpu->util_irq);
> +	util /= max;
> +
> +	/* Add interrupt utilization */
> +	util += sg_cpu->util_irq;
> +
>  	/* Add DL bandwidth requirement */
>  	util += sg_cpu->bw_dl;
>  
> -	return min(sg_cpu->max, util);
> +	return min(max, util);
>  }
>  
>  /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 377be2b..9438e68 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2221,4 +2221,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
>  {
>  	return rq->avg_rt.util_avg;
>  }
> +
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +static inline unsigned long cpu_util_irq(struct rq *rq)
> +{
> +	return rq->avg_irq.util_avg;
> +}
> +#else
> +static inline unsigned long cpu_util_irq(struct rq *rq)
> +{
> +	return 0;
> +}
> +
> +#endif
>  #endif

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util()
  2018-06-28 15:45 ` [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util() Vincent Guittot
@ 2018-07-06  6:02   ` Viresh Kumar
  2018-07-15 23:30   ` [tip:sched/core] sched/cpufreq: Remove sugov_aggregate_util() tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: Viresh Kumar @ 2018-07-06  6:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, valentin.schneider, patrick.bellasi, joel,
	daniel.lezcano, quentin.perret, luca.abeni, claudio, Ingo Molnar

On 28-06-18, 17:45, Vincent Guittot wrote:
> There is no reason why sugov_get_util() and sugov_aggregate_util()
> were in fact separate functions.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> [rebased after adding irq tracking and fixed some compilation errors]
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
>  kernel/sched/sched.h             |  2 +-
>  2 files changed, 16 insertions(+), 30 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index b77bfef..d04f941 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -53,12 +53,7 @@ struct sugov_cpu {
>  	unsigned int		iowait_boost_max;
>  	u64			last_update;
>  
> -	/* The fields below are only needed when sharing a policy: */
> -	unsigned long		util_cfs;
> -	unsigned long		util_dl;
>  	unsigned long		bw_dl;
> -	unsigned long		util_rt;
> -	unsigned long		util_irq;
>  	unsigned long		max;
>  
>  	/* The field below is for single-CPU policies only: */
> @@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  	return cpufreq_driver_resolve_freq(policy, freq);
>  }
>  
> -static void sugov_get_util(struct sugov_cpu *sg_cpu)
> +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
>  	struct rq *rq = cpu_rq(sg_cpu->cpu);
> +	unsigned long util, irq, max;
>  
> -	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> -	sg_cpu->util_cfs = cpu_util_cfs(rq);
> -	sg_cpu->util_dl  = cpu_util_dl(rq);
> -	sg_cpu->bw_dl    = cpu_bw_dl(rq);
> -	sg_cpu->util_rt  = cpu_util_rt(rq);
> -	sg_cpu->util_irq = cpu_util_irq(rq);
> -}
> -
> -static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> -{
> -	struct rq *rq = cpu_rq(sg_cpu->cpu);
> -	unsigned long util, max = sg_cpu->max;
> +	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> +	sg_cpu->bw_dl = cpu_bw_dl(rq);
>  
>  	if (rq->rt.rt_nr_running)
> -		return sg_cpu->max;
> +		return max;
> +
> +	irq = cpu_util_irq(rq);
>  
> -	if (unlikely(sg_cpu->util_irq >= max))
> +	if (unlikely(irq >= max))
>  		return max;
>  
>  	/* Sum rq utilization */
> -	util = sg_cpu->util_cfs;
> -	util += sg_cpu->util_rt;
> +	util = cpu_util_cfs(rq);
> +	util += cpu_util_rt(rq);
>  
>  	/*
>  	 * Interrupt time is not seen by rqs utilization nso we can compare
>  	 * them with the CPU capacity
>  	 */
> -	if ((util + sg_cpu->util_dl) >= max)
> +	if ((util + cpu_util_dl(rq)) >= max)
>  		return max;
>  
>  	/*
> @@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>  	 */
>  
>  	/* Weight rqs utilization to normal context window */
> -	util *= (max - sg_cpu->util_irq);
> +	util *= (max - irq);
>  	util /= max;
>  
>  	/* Add interrupt utilization */
> -	util += sg_cpu->util_irq;
> +	util += irq;
>  
>  	/* Add DL bandwidth requirement */
>  	util += sg_cpu->bw_dl;
> @@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>  
>  	busy = sugov_cpu_is_busy(sg_cpu);
>  
> -	sugov_get_util(sg_cpu);
> +	util = sugov_get_util(sg_cpu);
>  	max = sg_cpu->max;
> -	util = sugov_aggregate_util(sg_cpu);
>  	sugov_iowait_apply(sg_cpu, time, &util, &max);
>  	next_f = get_next_freq(sg_policy, util, max);
>  	/*
> @@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>  		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
>  		unsigned long j_util, j_max;
>  
> -		sugov_get_util(j_sg_cpu);
> +		j_util = sugov_get_util(j_sg_cpu);
>  		j_max = j_sg_cpu->max;
> -		j_util = sugov_aggregate_util(j_sg_cpu);
>  		sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);
>  
>  		if (j_util * max > j_max * util) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9438e68..59a633d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2219,7 +2219,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
>  
>  static inline unsigned long cpu_util_rt(struct rq *rq)
>  {
> -	return rq->avg_rt.util_avg;
> +	return READ_ONCE(rq->avg_rt.util_avg);
>  }
>  
>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 00/11] track CPU utilization
  2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
  2018-07-05 13:32   ` Vincent Guittot
@ 2018-07-06  6:05   ` Viresh Kumar
  2018-07-06  9:18     ` Peter Zijlstra
  2018-07-15 23:34   ` [tip:sched/core] sched/cpufreq: Clarify sugov_get_util() tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Viresh Kumar @ 2018-07-06  6:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, rjw, juri.lelli,
	dietmar.eggemann, Morten.Rasmussen, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio

On 05-07-18, 14:36, Peter Zijlstra wrote:
> Subject: sched/cpufreq: Clarify sugov_get_util()
> 
> Add a few comments (hopefully) clarifying some of the magic in
> sugov_get_util().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  cpufreq_schedutil.c |   69 ++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 51 insertions(+), 18 deletions(-)
> 
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct
>  	return cpufreq_driver_resolve_freq(policy, freq);
>  }
>  
> +/*
> + * This function computes an effective utilization for the given CPU, to be
> + * used for frequency selection given the linear relation: f = u * f_max.
> + *
> + * The scheduler tracks the following metrics:
> + *
> + *   cpu_util_{cfs,rt,dl,irq}()
> + *   cpu_bw_dl()
> + *
> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> + * synchronized windows and are thus directly comparable.
> + *
> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> + * which excludes things like IRQ and steal-time. These latter are then accrued in
> + * the irq utilization.
> + *
> + * The DL bandwidth number otoh is not a measured meric but a value computed

                                                     metric

> + * based on the task model parameters and gives the minimal u required to meet

                                                               u ?

> + * deadlines.
> + */
>  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
>  	struct rq *rq = cpu_rq(sg_cpu->cpu);
> @@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru
>  	if (rt_rq_is_runnable(&rq->rt))
>  		return max;
>  
> +	/*
> +	 * Early check to see if IRQ/steal time saturates the CPU, can be
> +	 * because of inaccuracies in how we track these -- see
> +	 * update_irq_load_avg().
> +	 */
>  	irq = cpu_util_irq(rq);
> -
>  	if (unlikely(irq >= max))
>  		return max;
>  
> -	/* Sum rq utilization */
> +	/*
> +	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
> +	 * CFS tasks and we use the same metric to track the effective
> +	 * utilization (PELT windows are synchronized) we can directly add them
> +	 * to obtain the CPU's actual utilization.
> +	 */
>  	util = cpu_util_cfs(rq);
>  	util += cpu_util_rt(rq);
>  
>  	/*
> -	 * Interrupt time is not seen by rqs utilization nso we can compare
> -	 * them with the CPU capacity
> +	 * We do not make cpu_util_dl() a permanent part of this sum because we
> +	 * want to use cpu_bw_dl() later on, but we need to check if the
> +	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
> +	 * f_max when there is no idle time.
> +	 *
> +	 * NOTE: numerical errors or stop class might cause us to not quite hit
> +	 * saturation when we should -- something for later.
>  	 */
>  	if ((util + cpu_util_dl(rq)) >= max)
>  		return max;
>  
>  	/*
> -	 * As there is still idle time on the CPU, we need to compute the
> -	 * utilization level of the CPU.
> +	 * There is still idle time; further improve the number by using the
> +	 * irq metric. Because IRQ/steal time is hidden from the task clock we
> +	 * need to scale the task numbers:
>  	 *
> +	 *              1 - irq
> +	 *   U' = irq + ------- * U
> +	 *                max
> +	 */
> +	util *= (max - irq);
> +	util /= max;
> +	util += irq;
> +
> +	/*
>  	 * Bandwidth required by DEADLINE must always be granted while, for
>  	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
>  	 * to gracefully reduce the frequency when no tasks show up for longer
> @@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru
>  	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>  	 * ready for such an interface. So, we only do the latter for now.
>  	 */
> -
> -	/* Weight rqs utilization to normal context window */
> -	util *= (max - irq);
> -	util /= max;
> -
> -	/* Add interrupt utilization */
> -	util += irq;
> -
> -	/* Add DL bandwidth requirement */
> -	util += sg_cpu->bw_dl;
> -
> -	return min(max, util);
> +	return min(max, util + sg_cpu->bw_dl);
>  }
>  

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt
  2018-07-06  6:00   ` Viresh Kumar
@ 2018-07-06  9:14     ` Peter Zijlstra
  2018-07-06  9:21       ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2018-07-06  9:14 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Vincent Guittot, mingo, linux-kernel, rjw, juri.lelli,
	dietmar.eggemann, Morten.Rasmussen, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio, Ingo Molnar

On Fri, Jul 06, 2018 at 11:30:33AM +0530, Viresh Kumar wrote:
> On 28-06-18, 17:45, Vincent Guittot wrote:
> > The time spent under interrupt can be significant but it is not reflected
> > in the utilization of CPU when deciding to choose an OPP. Now that we have
> > access to this metric, schedutil can take it into account when selecting
> > the OPP for a CPU.
> > rqs utilization don't see the time spend under interrupt context and report
> > their value in the normal context time window. We need to compensate this when
> > adding interrupt utilization
> > 
> > The CPU utilization is :
> >   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> > 
> > A test with iperf on hikey (octo arm64) gives:
> > iperf -c server_address -r -t 5
> > 
> > w/o patch		w/ patch
> > Tx 276 Mbits/sec        304 Mbits/sec +10%
> > Rx 299 Mbits/sec        328 Mbits/sec +09%
> > 
> > 8 iterations
> > stdev is lower than 1%
> > Only WFI idle state is enable (shallowest diel state)

Also s/diel/idle/

> > +	/*
> > +	 * Interrupt time is not seen by rqs utilization nso we can compare
> 
>                                                          nso ?
> 
> > +	 * them with the CPU capacity
> > +	 */

Already fixed ;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 00/11] track CPU utilization
  2018-07-06  6:05   ` Viresh Kumar
@ 2018-07-06  9:18     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2018-07-06  9:18 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Vincent Guittot, mingo, linux-kernel, rjw, juri.lelli,
	dietmar.eggemann, Morten.Rasmussen, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio

On Fri, Jul 06, 2018 at 11:35:22AM +0530, Viresh Kumar wrote:
> On 05-07-18, 14:36, Peter Zijlstra wrote:
> > +/*
> > + * This function computes an effective utilization for the given CPU, to be
> > + * used for frequency selection given the linear relation: f = u * f_max.
> > + *
> > + * The scheduler tracks the following metrics:
> > + *
> > + *   cpu_util_{cfs,rt,dl,irq}()
> > + *   cpu_bw_dl()
> > + *
> > + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> > + * synchronized windows and are thus directly comparable.
> > + *
> > + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> > + * which excludes things like IRQ and steal-time. These latter are then accrued in
> > + * the irq utilization.
> > + *
> > + * The DL bandwidth number otoh is not a measured meric but a value computed
> 
>                                                      metric

Indeed, fixed.

> > + * based on the task model parameters and gives the minimal u required to meet
> 
>                                                                u ?

utilization, but for lazy people :-) I'll use the whole word.

> > + * deadlines.
> > + */



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt
  2018-07-06  9:14     ` Peter Zijlstra
@ 2018-07-06  9:21       ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-07-06  9:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: viresh kumar, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Dietmar Eggemann, Morten Rasmussen,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar

On Fri, 6 Jul 2018 at 11:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 06, 2018 at 11:30:33AM +0530, Viresh Kumar wrote:
> > On 28-06-18, 17:45, Vincent Guittot wrote:
> > > The time spent under interrupt can be significant but it is not reflected
> > > in the utilization of CPU when deciding to choose an OPP. Now that we have
> > > access to this metric, schedutil can take it into account when selecting
> > > the OPP for a CPU.
> > > rqs utilization don't see the time spend under interrupt context and report
> > > their value in the normal context time window. We need to compensate this when
> > > adding interrupt utilization
> > >
> > > The CPU utilization is :
> > >   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> > >
> > > A test with iperf on hikey (octo arm64) gives:
> > > iperf -c server_address -r -t 5
> > >
> > > w/o patch           w/ patch
> > > Tx 276 Mbits/sec        304 Mbits/sec +10%
> > > Rx 299 Mbits/sec        328 Mbits/sec +09%
> > >
> > > 8 iterations
> > > stdev is lower than 1%
> > > Only WFI idle state is enable (shallowest diel state)
>
> Also s/diel/idle/
>
> > > +   /*
> > > +    * Interrupt time is not seen by rqs utilization nso we can compare
> >
> >                                                          nso ?
> >
> > > +    * them with the CPU capacity
> > > +    */
>
> Already fixed ;-)

Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()
  2018-06-28 15:45 ` [PATCH 09/11] sched: use pelt for scale_rt_capacity() Vincent Guittot
@ 2018-07-15 22:15   ` Ingo Molnar
  2018-07-15 22:46     ` Joe Perches
  2018-07-16 11:24     ` Vincent Guittot
  2018-07-15 23:32   ` [tip:sched/core] sched/core: Use PELT " tip-bot for Vincent Guittot
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2018-07-15 22:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, viresh.kumar, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio, Ingo Molnar


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> The utilization of the CPU by rt, dl and interrupts are now tracked with
> PELT so we can use these metrics instead of rt_avg to evaluate the remaining
> capacity available for cfs class.
> 
> scale_rt_capacity() behavior has been changed and now returns the remaining
> capacity available for cfs instead of a scaling factor because rt, dl and
> interrupt provide now absolute utilization value.
> 
> The same formula as schedutil is used:
>   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> but the implementation is different because it doesn't return the same value
> and doesn't benefit of the same optimization
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/deadline.c |  2 --
>  kernel/sched/fair.c     | 41 +++++++++++++++++++----------------------
>  kernel/sched/pelt.c     |  2 +-
>  kernel/sched/rt.c       |  2 --
>  4 files changed, 20 insertions(+), 27 deletions(-)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2758e3..ce0dcbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	u64 total, used, age_stamp, avg;
> -	s64 delta;
> -
> -	/*
> -	 * Since we're reading these variables without serialization make sure
> -	 * we read them once before doing sanity checks on them.
> -	 */
> -	age_stamp = READ_ONCE(rq->age_stamp);
> -	avg = READ_ONCE(rq->rt_avg);
> -	delta = __rq_clock_broken(rq) - age_stamp;
> +	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> +	unsigned long used, irq, free;
>  
> -	if (unlikely(delta < 0))
> -		delta = 0;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +	irq = READ_ONCE(rq->avg_irq.util_avg);
>  
> -	total = sched_avg_period() + delta;
> +	if (unlikely(irq >= max))
> +		return 1;
> +#endif

Note that 'irq' is unused outside that macro block, resulting in a new warning on 
defconfig builds:

 CC      kernel/sched/fair.o
 kernel/sched/fair.c: In function ‘scale_rt_capacity’:
 kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]
   unsigned long used, irq, free;
                       ^~~

I have applied the delta fix below for simplicity, but what we really want is a 
cleanup of that function to eliminate the #ifdefs. One solution would be to factor 
out the 'irq' utilization value into a helper inline, and double check that if the 
configs are off the compiler does the right thing and eliminates this identity 
transformation for the irq==0 case:

        free *= (max - irq);
        free /= max;

If the compiler refuses to optimize this away (due to the zero and overflow 
cases), try to find something more clever?

Thanks,

	Ingo

 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3221db0511a..d5f7d521e448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
-	unsigned long used, irq, free;
+	unsigned long used, free;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	unsigned long irq;
+#endif
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	irq = READ_ONCE(rq->avg_irq.util_avg);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()
  2018-07-15 22:15   ` Ingo Molnar
@ 2018-07-15 22:46     ` Joe Perches
  2018-07-16 11:24     ` Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: Joe Perches @ 2018-07-15 22:46 UTC (permalink / raw)
  To: Ingo Molnar, Vincent Guittot
  Cc: peterz, linux-kernel, rjw, juri.lelli, dietmar.eggemann,
	Morten.Rasmussen, viresh.kumar, valentin.schneider,
	patrick.bellasi, joel, daniel.lezcano, quentin.perret,
	luca.abeni, claudio, Ingo Molnar

On Mon, 2018-07-16 at 00:15 +0200, Ingo Molnar wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> > The utilization of the CPU by rt, dl and interrupts are now tracked with
> > PELT so we can use these metrics instead of rt_avg to evaluate the remaining
> > capacity available for cfs class.
> > 
> > scale_rt_capacity() behavior has been changed and now returns the remaining
> > capacity available for cfs instead of a scaling factor because rt, dl and
> > interrupt provide now absolute utilization value.
> > 
> > The same formula as schedutil is used:
> >   irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> > but the implementation is different because it doesn't return the same value
> > and doesn't benefit of the same optimization
[]
> I have applied the delta fix below for simplicity, but what we really want is a 
> cleanup of that function to eliminate the #ifdefs. One solution would be to factor 
> out the 'irq' utilization value into a helper inline, and double check that if the 
> configs are off the compiler does the right thing and eliminates this identity 
> transformation for the irq==0 case:
[]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
[]
> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> -	unsigned long used, irq, free;
> +	unsigned long used, free;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +	unsigned long irq;
> +#endif
>  
>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

Perhaps combine these two #if defined blocks into
a single block

>  	irq = READ_ONCE(rq->avg_irq.util_avg);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/pelt: Move PELT related code in a dedicated file
  2018-06-28 15:45 ` [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file Vincent Guittot
@ 2018-07-15 23:26   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, torvalds, vincent.guittot, peterz, hpa, linux-kernel

Commit-ID:  c079629862b20c101e8336362a8b042ec7d942fe
Gitweb:     https://git.kernel.org/tip/c079629862b20c101e8336362a8b042ec7d942fe
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:04 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/pelt: Move PELT related code in a dedicated file

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, PELT code is moved in a
dedicated file.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/Makefile |   2 +-
 kernel/sched/fair.c   | 333 +-------------------------------------------------
 kernel/sched/pelt.c   | 311 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h   |  43 +++++++
 kernel/sched/sched.h  |  19 +++
 5 files changed, 375 insertions(+), 333 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..7fe183404c38 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08b89ae34233..39ab46cea6c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -255,9 +255,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
 	return cfs_rq->rq;
 }
 
-/* An entity is a task if it doesn't "own" a runqueue */
-#define entity_is_task(se)	(!se->my_q)
-
 static inline struct task_struct *task_of(struct sched_entity *se)
 {
 	SCHED_WARN_ON(!entity_is_task(se));
@@ -419,7 +416,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
 	return container_of(cfs_rq, struct rq, cfs);
 }
 
-#define entity_is_task(se)	1
 
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
@@ -692,7 +688,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-
+#include "pelt.h"
 #include "sched-pelt.h"
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
@@ -2751,19 +2747,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 #ifdef CONFIG_SMP
-/*
- * XXX we want to get rid of these helpers and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
-	return scale_load_down(se->load.weight);
-}
-
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
-
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -3064,314 +3047,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
 }
 
 #ifdef CONFIG_SMP
-/*
- * Approximate:
- *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
- */
-static u64 decay_load(u64 val, u64 n)
-{
-	unsigned int local_n;
-
-	if (unlikely(n > LOAD_AVG_PERIOD * 63))
-		return 0;
-
-	/* after bounds checking we can collapse to 32-bit */
-	local_n = n;
-
-	/*
-	 * As y^PERIOD = 1/2, we can combine
-	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
-	 * With a look-up table which covers y^n (n<PERIOD)
-	 *
-	 * To achieve constant time decay_load.
-	 */
-	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
-		val >>= local_n / LOAD_AVG_PERIOD;
-		local_n %= LOAD_AVG_PERIOD;
-	}
-
-	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
-	return val;
-}
-
-static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
-{
-	u32 c1, c2, c3 = d3; /* y^0 == 1 */
-
-	/*
-	 * c1 = d1 y^p
-	 */
-	c1 = decay_load((u64)d1, periods);
-
-	/*
-	 *            p-1
-	 * c2 = 1024 \Sum y^n
-	 *            n=1
-	 *
-	 *              inf        inf
-	 *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
-	 *              n=0        n=p
-	 */
-	c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
-
-	return c1 + c2 + c3;
-}
-
-/*
- * Accumulate the three separate parts of the sum; d1 the remainder
- * of the last (incomplete) period, d2 the span of full periods and d3
- * the remainder of the (incomplete) current period.
- *
- *           d1          d2           d3
- *           ^           ^            ^
- *           |           |            |
- *         |<->|<----------------->|<--->|
- * ... |---x---|------| ... |------|-----x (now)
- *
- *                           p-1
- * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
- *                           n=1
- *
- *    = u y^p +					(Step 1)
- *
- *                     p-1
- *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
- *                     n=1
- */
-static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
-{
-	unsigned long scale_freq, scale_cpu;
-	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
-	u64 periods;
-
-	scale_freq = arch_scale_freq_capacity(cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
-	delta += sa->period_contrib;
-	periods = delta / 1024; /* A period is 1024us (~1ms) */
-
-	/*
-	 * Step 1: decay old *_sum if we crossed period boundaries.
-	 */
-	if (periods) {
-		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
-		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
-
-		/*
-		 * Step 2
-		 */
-		delta %= 1024;
-		contrib = __accumulate_pelt_segments(periods,
-				1024 - sa->period_contrib, delta);
-	}
-	sa->period_contrib = delta;
-
-	contrib = cap_scale(contrib, scale_freq);
-	if (load)
-		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
-	if (running)
-		sa->util_sum += contrib * scale_cpu;
-
-	return periods;
-}
-
-/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series.  To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
- *
- * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
- *      p0            p1           p2
- *     (now)       (~1ms ago)  (~2ms ago)
- *
- * Let u_i denote the fraction of p_i that the entity was runnable.
- *
- * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
- *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
- *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- *   y^32 = 0.5
- *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
- *
- * When a period "rolls over" and we have new u_0`, multiplying the previous
- * sum again by y is sufficient to update:
- *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
- */
-static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
-{
-	u64 delta;
-
-	delta = now - sa->last_update_time;
-	/*
-	 * This should only happen when time goes backwards, which it
-	 * unfortunately does during sched clock init when we swap over to TSC.
-	 */
-	if ((s64)delta < 0) {
-		sa->last_update_time = now;
-		return 0;
-	}
-
-	/*
-	 * Use 1024ns as the unit of measurement since it's a reasonable
-	 * approximation of 1us and fast to compute.
-	 */
-	delta >>= 10;
-	if (!delta)
-		return 0;
-
-	sa->last_update_time += delta << 10;
-
-	/*
-	 * running is a subset of runnable (weight) so running can't be set if
-	 * runnable is clear. But there are some corner cases where the current
-	 * se has been already dequeued but cfs_rq->curr still points to it.
-	 * This means that weight will be 0 but not running for a sched_entity
-	 * but also for a cfs_rq if the latter becomes idle. As an example,
-	 * this happens during idle_balance() which calls
-	 * update_blocked_averages()
-	 */
-	if (!load)
-		runnable = running = 0;
-
-	/*
-	 * Now we know we crossed measurement unit boundaries. The *_avg
-	 * accrues by two steps:
-	 *
-	 * Step 1: accumulate *_sum since last_update_time. If we haven't
-	 * crossed period boundaries, finish.
-	 */
-	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
-		return 0;
-
-	return 1;
-}
-
-static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
-{
-	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
-
-	/*
-	 * Step 2: update *_avg.
-	 */
-	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
-	sa->util_avg = sa->util_sum / divider;
-}
-
-/*
- * When a task is dequeued, its estimated utilization should not be update if
- * its util_avg has not been updated at least once.
- * This flag is used to synchronize util_avg updates with util_est updates.
- * We map this information into the LSB bit of the utilization saved at
- * dequeue time (i.e. util_est.dequeued).
- */
-#define UTIL_AVG_UNCHANGED 0x1
-
-static inline void cfs_se_util_change(struct sched_avg *avg)
-{
-	unsigned int enqueued;
-
-	if (!sched_feat(UTIL_EST))
-		return;
-
-	/* Avoid store if the flag has been already set */
-	enqueued = avg->util_est.enqueued;
-	if (!(enqueued & UTIL_AVG_UNCHANGED))
-		return;
-
-	/* Reset flag to report util_avg has been updated */
-	enqueued &= ~UTIL_AVG_UNCHANGED;
-	WRITE_ONCE(avg->util_est.enqueued, enqueued);
-}
-
-/*
- * sched_entity:
- *
- *   task:
- *     se_runnable() == se_weight()
- *
- *   group: [ see update_cfs_group() ]
- *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
- *   load_sum := runnable_sum
- *   load_avg = se_weight(se) * runnable_avg
- *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
- * cfq_rs:
- *
- *   load_sum = \Sum se_weight(se) * se->avg.load_sum
- *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
- */
-
-static int
-__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
-{
-	if (entity_is_task(se))
-		se->runnable_weight = se->load.weight;
-
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
-		return 1;
-	}
-
-	return 0;
-}
-
-static int
-__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	if (entity_is_task(se))
-		se->runnable_weight = se->load.weight;
-
-	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
-
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
-		cfs_se_util_change(&se->avg);
-		return 1;
-	}
-
-	return 0;
-}
-
-static int
-__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
-{
-	if (___update_load_sum(now, cpu, &cfs_rq->avg,
-				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
-				cfs_rq->curr != NULL)) {
-
-		___update_load_avg(&cfs_rq->avg, 1, 1);
-		return 1;
-	}
-
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /**
  * update_tg_load_avg - update the tg's load avg
@@ -4039,12 +3714,6 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 
 #else /* CONFIG_SMP */
 
-static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
-{
-	return 0;
-}
-
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
new file mode 100644
index 000000000000..e6ecbb2b8698
--- /dev/null
+++ b/kernel/sched/pelt.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per Entity Load Tracking
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
+ *
+ *  Interactivity improvements by Mike Galbraith
+ *  (C) 2007 Mike Galbraith <efault@gmx.de>
+ *
+ *  Various enhancements by Dmitry Adamushko.
+ *  (C) 2007 Dmitry Adamushko <dmitry.adamushko@gmail.com>
+ *
+ *  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  Copyright IBM Corporation, 2007
+ *  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
+ *
+ *  Scaled math optimizations by Thomas Gleixner
+ *  Copyright (C) 2007, Thomas Gleixner <tglx@linutronix.de>
+ *
+ *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
+ *
+ *  Move PELT related code from fair.c into this pelt.c file
+ *  Author: Vincent Guittot <vincent.guittot@linaro.org>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "sched-pelt.h"
+#include "pelt.h"
+
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static u64 decay_load(u64 val, u64 n)
+{
+	unsigned int local_n;
+
+	if (unlikely(n > LOAD_AVG_PERIOD * 63))
+		return 0;
+
+	/* after bounds checking we can collapse to 32-bit */
+	local_n = n;
+
+	/*
+	 * As y^PERIOD = 1/2, we can combine
+	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+	 * With a look-up table which covers y^n (n<PERIOD)
+	 *
+	 * To achieve constant time decay_load.
+	 */
+	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+		val >>= local_n / LOAD_AVG_PERIOD;
+		local_n %= LOAD_AVG_PERIOD;
+	}
+
+	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+	return val;
+}
+
+static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
+{
+	u32 c1, c2, c3 = d3; /* y^0 == 1 */
+
+	/*
+	 * c1 = d1 y^p
+	 */
+	c1 = decay_load((u64)d1, periods);
+
+	/*
+	 *            p-1
+	 * c2 = 1024 \Sum y^n
+	 *            n=1
+	 *
+	 *              inf        inf
+	 *    = 1024 ( \Sum y^n - \Sum y^n - y^0 )
+	 *              n=0        n=p
+	 */
+	c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
+
+	return c1 + c2 + c3;
+}
+
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
+/*
+ * Accumulate the three separate parts of the sum; d1 the remainder
+ * of the last (incomplete) period, d2 the span of full periods and d3
+ * the remainder of the (incomplete) current period.
+ *
+ *           d1          d2           d3
+ *           ^           ^            ^
+ *           |           |            |
+ *         |<->|<----------------->|<--->|
+ * ... |---x---|------| ... |------|-----x (now)
+ *
+ *                           p-1
+ * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
+ *                           n=1
+ *
+ *    = u y^p +					(Step 1)
+ *
+ *                     p-1
+ *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
+ *                     n=1
+ */
+static __always_inline u32
+accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+	       unsigned long load, unsigned long runnable, int running)
+{
+	unsigned long scale_freq, scale_cpu;
+	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
+	u64 periods;
+
+	scale_freq = arch_scale_freq_capacity(cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+	delta += sa->period_contrib;
+	periods = delta / 1024; /* A period is 1024us (~1ms) */
+
+	/*
+	 * Step 1: decay old *_sum if we crossed period boundaries.
+	 */
+	if (periods) {
+		sa->load_sum = decay_load(sa->load_sum, periods);
+		sa->runnable_load_sum =
+			decay_load(sa->runnable_load_sum, periods);
+		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+
+		/*
+		 * Step 2
+		 */
+		delta %= 1024;
+		contrib = __accumulate_pelt_segments(periods,
+				1024 - sa->period_contrib, delta);
+	}
+	sa->period_contrib = delta;
+
+	contrib = cap_scale(contrib, scale_freq);
+	if (load)
+		sa->load_sum += load * contrib;
+	if (runnable)
+		sa->runnable_load_sum += runnable * contrib;
+	if (running)
+		sa->util_sum += contrib * scale_cpu;
+
+	return periods;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p2
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ *   y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+		  unsigned long load, unsigned long runnable, int running)
+{
+	u64 delta;
+
+	delta = now - sa->last_update_time;
+	/*
+	 * This should only happen when time goes backwards, which it
+	 * unfortunately does during sched clock init when we swap over to TSC.
+	 */
+	if ((s64)delta < 0) {
+		sa->last_update_time = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024ns as the unit of measurement since it's a reasonable
+	 * approximation of 1us and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+
+	sa->last_update_time += delta << 10;
+
+	/*
+	 * running is a subset of runnable (weight) so running can't be set if
+	 * runnable is clear. But there are some corner cases where the current
+	 * se has been already dequeued but cfs_rq->curr still points to it.
+	 * This means that weight will be 0 but not running for a sched_entity
+	 * but also for a cfs_rq if the latter becomes idle. As an example,
+	 * this happens during idle_balance() which calls
+	 * update_blocked_averages()
+	 */
+	if (!load)
+		runnable = running = 0;
+
+	/*
+	 * Now we know we crossed measurement unit boundaries. The *_avg
+	 * accrues by two steps:
+	 *
+	 * Step 1: accumulate *_sum since last_update_time. If we haven't
+	 * crossed period boundaries, finish.
+	 */
+	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+		return 0;
+
+	return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+{
+	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
+	/*
+	 * Step 2: update *_avg.
+	 */
+	sa->load_avg = div_u64(load * sa->load_sum, divider);
+	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
+	sa->util_avg = sa->util_sum / divider;
+}
+
+/*
+ * sched_entity:
+ *
+ *   task:
+ *     se_runnable() == se_weight()
+ *
+ *   group: [ see update_cfs_group() ]
+ *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
+ *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
+ *   load_sum := runnable_sum
+ *   load_avg = se_weight(se) * runnable_avg
+ *
+ *   runnable_load_sum := runnable_sum
+ *   runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
+ * cfq_rq:
+ *
+ *   load_sum = \Sum se_weight(se) * se->avg.load_sum
+ *   load_avg = \Sum se->avg.load_avg
+ *
+ *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ *   runnable_load_avg = \Sum se->avg.runable_load_avg
+ */
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		return 1;
+	}
+
+	return 0;
+}
+
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		se->runnable_weight = se->load.weight;
+
+	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+				cfs_rq->curr == se)) {
+
+		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		cfs_se_util_change(&se->avg);
+		return 1;
+	}
+
+	return 0;
+}
+
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+	if (___update_load_sum(now, cpu, &cfs_rq->avg,
+				scale_load_down(cfs_rq->load.weight),
+				scale_load_down(cfs_rq->runnable_weight),
+				cfs_rq->curr != NULL)) {
+
+		___update_load_avg(&cfs_rq->avg, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
new file mode 100644
index 000000000000..9cac73efd64a
--- /dev/null
+++ b/kernel/sched/pelt.h
@@ -0,0 +1,43 @@
+#ifdef CONFIG_SMP
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+
+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+	unsigned int enqueued;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Avoid store if the flag has been already set */
+	enqueued = avg->util_est.enqueued;
+	if (!(enqueued & UTIL_AVG_UNCHANGED))
+		return;
+
+	/* Reset flag to report util_avg has been updated */
+	enqueued &= ~UTIL_AVG_UNCHANGED;
+	WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
+#else
+
+static inline int
+update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+#endif
+
+
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..00d6f2594c4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -673,7 +673,26 @@ struct dl_rq {
 	u64			bw_ratio;
 };
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se)	(!se->my_q)
+#else
+#define entity_is_task(se)	1
+#endif
+
 #ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+	return scale_load_down(se->load.weight);
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+	return scale_load_down(se->runnable_weight);
+}
 
 static inline bool sched_asym_prefer(int a, int b)
 {

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/rt: Add rt_rq utilization tracking
  2018-06-28 15:45 ` [PATCH 02/11] sched/rt: add rt_rq utilization tracking Vincent Guittot
@ 2018-07-15 23:27   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, hpa, torvalds, mingo, linux-kernel, vincent.guittot

Commit-ID:  371bf42732694d142b0de026e152266c039b97d3
Gitweb:     https://git.kernel.org/tip/371bf42732694d142b0de026e152266c039b97d3
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/rt: Add rt_rq utilization tracking

schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
are preempted by RT tasks and in this case util_avg reflects the remaining
capacity but not what CFS want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of RT tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 15 ++++++++++++++-
 kernel/sched/pelt.c  | 25 +++++++++++++++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/rt.c    | 13 +++++++++++++
 kernel/sched/sched.h |  7 +++++++
 5 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39ab46cea6c5..5b453213cd18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+	if (READ_ONCE(rq->avg_rt.util_avg))
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu)
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
 	}
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	/* Don't need periodic decay once load/util_avg are null */
+	if (rt_rq_has_blocked(rq))
+		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
@@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index e6ecbb2b8698..a00b1ba3dd5b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -309,3 +309,28 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 
 	return 0;
 }
+
+/*
+ * rt_rq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ *   load_avg and runnable_load_avg are not supported and meaningless.
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+				running,
+				running,
+				running)) {
+
+		___update_load_avg(&rq->avg_rt, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73efd64a..b2983b741d57 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@
 int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
 int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	return 0;
+}
+
 #endif
 
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 572567078b60..0dc8ad1915e6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@
  */
 #include "sched.h"
 
+#include "pelt.h"
+
 int sched_rr_timeslice = RR_TIMESLICE;
 int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
 
@@ -1576,6 +1578,14 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	rt_queue_push_tasks(rq);
 
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+
 	return p;
 }
 
@@ -1583,6 +1593,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
@@ -2312,6 +2324,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
+	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 00d6f2594c4e..405dd9ba6b39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct rt_rq {
 	unsigned long		rt_nr_total;
 	int			overloaded;
 	struct plist_head	pushable_tasks;
+
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
@@ -854,6 +855,7 @@ struct rq {
 
 	u64			rt_avg;
 	u64			age_stamp;
+	struct sched_avg	avg_rt;
 	u64			idle_stamp;
 	u64			avg_idle;
 
@@ -2212,4 +2214,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 	return util;
 }
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+	return rq->avg_rt.util_avg;
+}
 #endif

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] cpufreq/schedutil: Use RT utilization tracking
  2018-06-28 15:45 ` [PATCH 03/11] cpufreq/schedutil: use rt " Vincent Guittot
  2018-07-06  5:56   ` Viresh Kumar
@ 2018-07-15 23:27   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, mingo, viresh.kumar, tglx,
	vincent.guittot, hpa, peterz

Commit-ID:  3ae117c6cd7c4783819a0766aa97b9493a8a0f62
Gitweb:     https://git.kernel.org/tip/3ae117c6cd7c4783819a0766aa97b9493a8a0f62
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:06 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

cpufreq/schedutil: Use RT utilization tracking

Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
can preempt and steal CFS's running time.

RT util_avg is used to take into account the utilization of RT tasks
on the CPU when selecting OPP. If a RT task migrate, the RT utilization
will not migrate but will decay over time. On an overloaded CPU, CFS
utilization reflects the remaining utilization avialable on CPU. When RT
task migrates, the CFS utilization will increase when tasks will start to
use the newly available capacity. At the same pace, RT utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c907fde01eaa..da29b5a33adb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
 	/* The fields below are only needed when sharing a policy: */
 	unsigned long		util_cfs;
 	unsigned long		util_dl;
+	unsigned long		util_rt;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -186,15 +187,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 	sg_cpu->util_cfs = cpu_util_cfs(rq);
 	sg_cpu->util_dl  = cpu_util_dl(rq);
+	sg_cpu->util_rt  = cpu_util_rt(rq);
 }
 
 static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util;
 
 	if (rt_rq_is_runnable(&rq->rt))
 		return sg_cpu->max;
 
+	util = sg_cpu->util_dl;
+	util += sg_cpu->util_cfs;
+	util += sg_cpu->util_rt;
+
 	/*
 	 * Utilization required by DEADLINE must always be granted while, for
 	 * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
@@ -205,7 +212,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
-	return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+	return min(sg_cpu->max, util);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/dl: Add dl_rq utilization tracking
  2018-06-28 15:45 ` [PATCH 04/11] sched/dl: add dl_rq " Vincent Guittot
@ 2018-07-15 23:28   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, peterz, vincent.guittot, torvalds, tglx, linux-kernel, mingo

Commit-ID:  3727e0e16340cbdf83818f5bf0113505c6876057
Gitweb:     https://git.kernel.org/tip/3727e0e16340cbdf83818f5bf0113505c6876057
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:07 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/dl: Add dl_rq utilization tracking

Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
tasks and the CFS's utilization might no longer describes the real
utilization level.

Current DL bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the DL sched class. We track
DL class utilization to estimate the system utilization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c |  6 ++++++
 kernel/sched/fair.c     | 11 ++++++++---
 kernel/sched/pelt.c     | 23 +++++++++++++++++++++++
 kernel/sched/pelt.h     |  6 ++++++
 kernel/sched/sched.h    |  1 +
 5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fbfc3f1d368a..f4de26982d80 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,7 @@
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
 #include "sched.h"
+#include "pelt.h"
 
 struct dl_bandwidth def_dl_bandwidth;
 
@@ -1761,6 +1762,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	deadline_queue_push_tasks(rq);
 
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+
 	return p;
 }
 
@@ -1768,6 +1772,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
@@ -1784,6 +1789,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 {
 	update_curr_dl(rq);
 
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
 	/*
 	 * Even when we have runtime, update_curr_dl() might have resulted in us
 	 * not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b453213cd18..f096275c7df2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,11 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
-static inline bool rt_rq_has_blocked(struct rq *rq)
+static inline bool others_rqs_have_blocked(struct rq *rq)
 {
 	if (READ_ONCE(rq->avg_rt.util_avg))
 		return true;
 
+	if (READ_ONCE(rq->avg_dl.util_avg))
+		return true;
+
 	return false;
 }
 
@@ -7358,8 +7361,9 @@ static void update_blocked_averages(int cpu)
 			done = false;
 	}
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
-	if (rt_rq_has_blocked(rq))
+	if (others_rqs_have_blocked(rq))
 		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -7427,9 +7431,10 @@ static inline void update_blocked_averages(int cpu)
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a00b1ba3dd5b..8b78b6320cda 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -334,3 +334,26 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 
 	return 0;
 }
+
+/*
+ * dl_rq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+				running,
+				running,
+				running)) {
+
+		___update_load_avg(&rq->avg_dl, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index b2983b741d57..0e4f912461ad 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -4,6 +4,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
 int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
 /*
  * When a task is dequeued, its estimated utilization should not be update if
@@ -45,6 +46,11 @@ update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+static inline int
+update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 405dd9ba6b39..ab8b5296b5f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,6 +856,7 @@ struct rq {
 	u64			rt_avg;
 	u64			age_stamp;
 	struct sched_avg	avg_rt;
+	struct sched_avg	avg_dl;
 	u64			idle_stamp;
 	u64			avg_idle;
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] cpufreq/schedutil: Use DL utilization tracking
  2018-06-28 15:45 ` [PATCH 05/11] cpufreq/schedutil: use dl " Vincent Guittot
  2018-07-06  5:59   ` Viresh Kumar
@ 2018-07-15 23:28   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, vincent.guittot, linux-kernel, torvalds, tglx, mingo,
	peterz, viresh.kumar

Commit-ID:  8cc90515a4fa419ccfc4703ff127699cdcb96839
Gitweb:     https://git.kernel.org/tip/8cc90515a4fa419ccfc4703ff127699cdcb96839
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:08 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

cpufreq/schedutil: Use DL utilization tracking

Now that we have both the DL class bandwidth requirement and the DL class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the DL bandwidth requirement to define the
utilization of the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
 kernel/sched/sched.h             |  7 ++++++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index da29b5a33adb..07760bc7f69a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
 	/* The fields below are only needed when sharing a policy: */
 	unsigned long		util_cfs;
 	unsigned long		util_dl;
+	unsigned long		bw_dl;
 	unsigned long		util_rt;
 	unsigned long		max;
 
@@ -187,6 +188,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 	sg_cpu->util_cfs = cpu_util_cfs(rq);
 	sg_cpu->util_dl  = cpu_util_dl(rq);
+	sg_cpu->bw_dl    = cpu_bw_dl(rq);
 	sg_cpu->util_rt  = cpu_util_rt(rq);
 }
 
@@ -198,20 +200,29 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	if (rt_rq_is_runnable(&rq->rt))
 		return sg_cpu->max;
 
-	util = sg_cpu->util_dl;
-	util += sg_cpu->util_cfs;
+	util = sg_cpu->util_cfs;
 	util += sg_cpu->util_rt;
 
+	if ((util + sg_cpu->util_dl) >= sg_cpu->max)
+		return sg_cpu->max;
+
 	/*
-	 * Utilization required by DEADLINE must always be granted while, for
-	 * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
-	 * gracefully reduce the frequency when no tasks show up for longer
+	 * As there is still idle time on the CPU, we need to compute the
+	 * utilization level of the CPU.
+	 *
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
 	 * periods of time.
 	 *
 	 * Ideally we would like to set util_dl as min/guaranteed freq and
 	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
+
+	/* Add DL bandwidth requirement */
+	util += sg_cpu->bw_dl;
+
 	return min(sg_cpu->max, util);
 }
 
@@ -367,7 +378,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
  */
 static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
 {
-	if (cpu_util_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->util_dl)
+	if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
 		sg_policy->need_freq_update = true;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab8b5296b5f6..9028f268f867 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2199,11 +2199,16 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #endif
 
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-static inline unsigned long cpu_util_dl(struct rq *rq)
+static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
 }
 
+static inline unsigned long cpu_util_dl(struct rq *rq)
+{
+	return READ_ONCE(rq->avg_dl.util_avg);
+}
+
 static inline unsigned long cpu_util_cfs(struct rq *rq)
 {
 	unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/irq: Add IRQ utilization tracking
  2018-06-28 15:45 ` [PATCH 06/11] sched/irq: add irq " Vincent Guittot
@ 2018-07-15 23:29   ` tip-bot for Vincent Guittot
  2018-07-26  3:09   ` [PATCH 06/11] sched/irq: add irq " Wanpeng Li
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, mingo, torvalds, linux-kernel, hpa, tglx, peterz

Commit-ID:  91c27493e78df6849baaa21a9d66e26de8b875c0
Gitweb:     https://git.kernel.org/tip/91c27493e78df6849baaa21a9d66e26de8b875c0
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:09 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

sched/irq: Add IRQ utilization tracking

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because:

  rq_clock == rq_clock_task + interrupt time

and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.

The CPU utilization is:

  avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  4 +++-
 kernel/sched/fair.c  | 13 ++++++++++---
 kernel/sched/pelt.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h  | 16 ++++++++++++++++
 kernel/sched/sched.h |  3 +++
 5 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..38107a95baca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -17,6 +17,8 @@
 #include "../workqueue_internal.h"
 #include "../smpboot.h"
 
+#include "pelt.h"
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
@@ -185,7 +187,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
-		sched_rt_avg_update(rq, irq_delta + steal);
+		update_irq_load_avg(rq, irq_delta + steal);
 #endif
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f096275c7df2..c2782b29c79f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,7 +7290,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
 	return false;
 }
 
-static inline bool others_rqs_have_blocked(struct rq *rq)
+static inline bool others_have_blocked(struct rq *rq)
 {
 	if (READ_ONCE(rq->avg_rt.util_avg))
 		return true;
@@ -7298,6 +7298,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
 	if (READ_ONCE(rq->avg_dl.util_avg))
 		return true;
 
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	if (READ_ONCE(rq->avg_irq.util_avg))
+		return true;
+#endif
+
 	return false;
 }
 
@@ -7362,8 +7367,9 @@ static void update_blocked_averages(int cpu)
 	}
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
-	if (others_rqs_have_blocked(rq))
+	if (others_have_blocked(rq))
 		done = false;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -7432,9 +7438,10 @@ static inline void update_blocked_averages(int cpu)
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
 	update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+	update_irq_load_avg(rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
-	if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
+	if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
 		rq->has_blocked_load = 0;
 #endif
 	rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 8b78b6320cda..ead6d8b4a8b8 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 
 	return 0;
 }
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+/*
+ * irq:
+ *
+ *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ *   util_sum = cpu_scale * load_sum
+ *   runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+	int ret = 0;
+	/*
+	 * We know the time that has been used by interrupt since last update
+	 * but we don't when. Let be pessimistic and assume that interrupt has
+	 * happened just before the update. This is not so far from reality
+	 * because interrupt will most probably wake up task and trig an update
+	 * of rq clock during which the metric si updated.
+	 * We start to decay with normal context time and then we add the
+	 * interrupt context time.
+	 * We can safely remove running from rq->clock because
+	 * rq->clock += delta with delta >= running
+	 */
+	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+				0,
+				0,
+				0);
+	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+				1,
+				1,
+				1);
+
+	if (ret)
+		___update_load_avg(&rq->avg_irq, 1, 1);
+
+	return ret;
+}
+#endif
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912461ad..d2894db28955 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+int update_irq_load_avg(struct rq *rq, u64 running);
+#else
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+	return 0;
+}
+#endif
+
 /*
  * When a task is dequeued, its estimated utilization should not be update if
  * its util_avg has not been updated at least once.
@@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	return 0;
 }
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9028f268f867..b26d0c9948dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -857,6 +857,9 @@ struct rq {
 	u64			age_stamp;
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	struct sched_avg	avg_irq;
+#endif
 	u64			idle_stamp;
 	u64			avg_idle;
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] cpufreq/schedutil: Take time spent in interrupts into account
  2018-06-28 15:45 ` [PATCH 07/11] cpufreq/schedutil: take into account interrupt Vincent Guittot
  2018-07-06  6:00   ` Viresh Kumar
@ 2018-07-15 23:29   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: viresh.kumar, peterz, tglx, torvalds, hpa, linux-kernel,
	vincent.guittot, mingo

Commit-ID:  9033ea11889f88f243445495f72441e22256d5e9
Gitweb:     https://git.kernel.org/tip/9033ea11889f88f243445495f72441e22256d5e9
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:10 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

cpufreq/schedutil: Take time spent in interrupts into account

The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.

RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives the following speedup:

 iperf -c server_address -r -t 5

 w/o patch		w/ patch
 Tx 276 Mbits/sec	304 Mbits/sec +10%
 Rx 299 Mbits/sec	328 Mbits/sec  +9%

 8 iterations
 stdev is lower than 1%

Only WFI idle state is enabled (shallowest idle state).

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
 kernel/sched/sched.h             | 13 +++++++++++++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 07760bc7f69a..7016bde9d194 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -58,6 +58,7 @@ struct sugov_cpu {
 	unsigned long		util_dl;
 	unsigned long		bw_dl;
 	unsigned long		util_rt;
+	unsigned long		util_irq;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	sg_cpu->util_dl  = cpu_util_dl(rq);
 	sg_cpu->bw_dl    = cpu_bw_dl(rq);
 	sg_cpu->util_rt  = cpu_util_rt(rq);
+	sg_cpu->util_irq = cpu_util_irq(rq);
 }
 
 static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util;
+	unsigned long util, max = sg_cpu->max;
 
 	if (rt_rq_is_runnable(&rq->rt))
 		return sg_cpu->max;
 
+	if (unlikely(sg_cpu->util_irq >= max))
+		return max;
+
+	/* Sum rq utilization */
 	util = sg_cpu->util_cfs;
 	util += sg_cpu->util_rt;
 
-	if ((util + sg_cpu->util_dl) >= sg_cpu->max)
-		return sg_cpu->max;
+	/*
+	 * Interrupt time is not seen by RQS utilization so we can compare
+	 * them with the CPU capacity
+	 */
+	if ((util + sg_cpu->util_dl) >= max)
+		return max;
 
 	/*
 	 * As there is still idle time on the CPU, we need to compute the
@@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 * ready for such an interface. So, we only do the latter for now.
 	 */
 
+	/* Weight RQS utilization to normal context window */
+	util *= (max - sg_cpu->util_irq);
+	util /= max;
+
+	/* Add interrupt utilization */
+	util += sg_cpu->util_irq;
+
 	/* Add DL bandwidth requirement */
 	util += sg_cpu->bw_dl;
 
-	return min(sg_cpu->max, util);
+	return min(max, util);
 }
 
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b26d0c9948dd..b2833e2b4b6a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2228,4 +2228,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 {
 	return rq->avg_rt.util_avg;
 }
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+	return rq->avg_irq.util_avg;
+}
+#else
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+	return 0;
+}
+
+#endif
 #endif

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/cpufreq: Remove sugov_aggregate_util()
  2018-06-28 15:45 ` [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util() Vincent Guittot
  2018-07-06  6:02   ` Viresh Kumar
@ 2018-07-15 23:30   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, tglx, torvalds, linux-kernel, vincent.guittot,
	viresh.kumar, hpa

Commit-ID:  dfa444dc2ff62edbaf1ff95ed22dd2ce8a5715da
Gitweb:     https://git.kernel.org/tip/dfa444dc2ff62edbaf1ff95ed22dd2ce8a5715da
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:11 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

sched/cpufreq: Remove sugov_aggregate_util()

There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Rebased after adding irq tracking and fixed some compilation errors. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
 kernel/sched/sched.h             |  2 +-
 2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7016bde9d194..c9622b3f183d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -53,12 +53,7 @@ struct sugov_cpu {
 	unsigned int		iowait_boost_max;
 	u64			last_update;
 
-	/* The fields below are only needed when sharing a policy: */
-	unsigned long		util_cfs;
-	unsigned long		util_dl;
 	unsigned long		bw_dl;
-	unsigned long		util_rt;
-	unsigned long		util_irq;
 	unsigned long		max;
 
 	/* The field below is for single-CPU policies only: */
@@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-static void sugov_get_util(struct sugov_cpu *sg_cpu)
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util, irq, max;
 
-	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
-	sg_cpu->util_cfs = cpu_util_cfs(rq);
-	sg_cpu->util_dl  = cpu_util_dl(rq);
-	sg_cpu->bw_dl    = cpu_bw_dl(rq);
-	sg_cpu->util_rt  = cpu_util_rt(rq);
-	sg_cpu->util_irq = cpu_util_irq(rq);
-}
-
-static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
-{
-	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util, max = sg_cpu->max;
+	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
 
 	if (rt_rq_is_runnable(&rq->rt))
-		return sg_cpu->max;
+		return max;
+
+	irq = cpu_util_irq(rq);
 
-	if (unlikely(sg_cpu->util_irq >= max))
+	if (unlikely(irq >= max))
 		return max;
 
 	/* Sum rq utilization */
-	util = sg_cpu->util_cfs;
-	util += sg_cpu->util_rt;
+	util = cpu_util_cfs(rq);
+	util += cpu_util_rt(rq);
 
 	/*
 	 * Interrupt time is not seen by RQS utilization so we can compare
 	 * them with the CPU capacity
 	 */
-	if ((util + sg_cpu->util_dl) >= max)
+	if ((util + cpu_util_dl(rq)) >= max)
 		return max;
 
 	/*
@@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
 	 */
 
 	/* Weight RQS utilization to normal context window */
-	util *= (max - sg_cpu->util_irq);
+	util *= (max - irq);
 	util /= max;
 
 	/* Add interrupt utilization */
-	util += sg_cpu->util_irq;
+	util += irq;
 
 	/* Add DL bandwidth requirement */
 	util += sg_cpu->bw_dl;
@@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 
 	busy = sugov_cpu_is_busy(sg_cpu);
 
-	sugov_get_util(sg_cpu);
+	util = sugov_get_util(sg_cpu);
 	max = sg_cpu->max;
-	util = sugov_aggregate_util(sg_cpu);
 	sugov_iowait_apply(sg_cpu, time, &util, &max);
 	next_f = get_next_freq(sg_policy, util, max);
 	/*
@@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
 		unsigned long j_util, j_max;
 
-		sugov_get_util(j_sg_cpu);
+		j_util = sugov_get_util(j_sg_cpu);
 		j_max = j_sg_cpu->max;
-		j_util = sugov_aggregate_util(j_sg_cpu);
 		sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);
 
 		if (j_util * max > j_max * util) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2833e2b4b6a..061d51fb5b44 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2226,7 +2226,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
-	return rq->avg_rt.util_avg;
+	return READ_ONCE(rq->avg_rt.util_avg);
 }
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/core: Use PELT for scale_rt_capacity()
  2018-06-28 15:45 ` [PATCH 09/11] sched: use pelt for scale_rt_capacity() Vincent Guittot
  2018-07-15 22:15   ` Ingo Molnar
@ 2018-07-15 23:32   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, vincent.guittot, linux-kernel, peterz, tglx, hpa, mingo

Commit-ID:  523e979d31648112bad07f427c183525c0258c75
Gitweb:     https://git.kernel.org/tip/523e979d31648112bad07f427c183525c0258c75
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:12 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 16 Jul 2018 00:16:25 +0200

sched/core: Use PELT for scale_rt_capacity()

The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.

The same formula as schedutil is used:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c |  2 --
 kernel/sched/fair.c     | 44 ++++++++++++++++++++++----------------------
 kernel/sched/pelt.c     |  2 +-
 kernel/sched/rt.c       |  2 --
 4 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4de26982d80..68b8a9f1c9ca 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2782b29c79f..d265fa9756a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7551,39 +7551,39 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, used, age_stamp, avg;
-	s64 delta;
-
-	/*
-	 * Since we're reading these variables without serialization make sure
-	 * we read them once before doing sanity checks on them.
-	 */
-	age_stamp = READ_ONCE(rq->age_stamp);
-	avg = READ_ONCE(rq->rt_avg);
-	delta = __rq_clock_broken(rq) - age_stamp;
+	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long used, free;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	unsigned long irq;
+#endif
 
-	if (unlikely(delta < 0))
-		delta = 0;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	irq = READ_ONCE(rq->avg_irq.util_avg);
 
-	total = sched_avg_period() + delta;
+	if (unlikely(irq >= max))
+		return 1;
+#endif
 
-	used = div_u64(avg, total);
+	used = READ_ONCE(rq->avg_rt.util_avg);
+	used += READ_ONCE(rq->avg_dl.util_avg);
 
-	if (likely(used < SCHED_CAPACITY_SCALE))
-		return SCHED_CAPACITY_SCALE - used;
+	if (unlikely(used >= max))
+		return 1;
 
-	return 1;
+	free = max - used;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+	free *= (max - irq);
+	free /= max;
+#endif
+	return free;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
+	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
 
-	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-
-	capacity *= scale_rt_capacity(cpu);
-	capacity >>= SCHED_CAPACITY_SHIFT;
+	cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
 
 	if (!capacity)
 		capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index ead6d8b4a8b8..35475c0c5419 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
 	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
-	sa->util_avg = sa->util_sum / divider;
+	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0dc8ad1915e6..2df72abfa24a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -973,8 +973,6 @@ static void update_curr_rt(struct rq *rq)
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/core: Remove the rt_avg code
  2018-06-28 15:45 ` [PATCH 10/11] sched: remove rt_avg code Vincent Guittot
@ 2018-07-15 23:33   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, mingo, linux-kernel, hpa, torvalds, vincent.guittot

Commit-ID:  bbb62c0b024a1c721232667fa1d625cf6b3a555b
Gitweb:     https://git.kernel.org/tip/bbb62c0b024a1c721232667fa1d625cf6b3a555b
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:13 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/core: Remove the rt_avg code

rt_avg is not used anywhere anymore, so we can remove all related code.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 26 --------------------------
 kernel/sched/fair.c  |  2 --
 kernel/sched/sched.h | 17 -----------------
 3 files changed, 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 38107a95baca..a691b07390ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -651,23 +651,6 @@ bool sched_can_stop_tick(struct rq *rq)
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
-	s64 period = sched_avg_period();
-
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (rq->age_stamp));
-		rq->age_stamp += period;
-		rq->rt_avg /= 2;
-	}
-}
-
 #endif /* CONFIG_SMP */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5716,13 +5699,6 @@ void set_rq_offline(struct rq *rq)
 	}
 }
 
-static void set_cpu_rq_start_time(unsigned int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-
-	rq->age_stamp = sched_clock_cpu(cpu);
-}
-
 /*
  * used to mark begin/end of suspend/resume:
  */
@@ -5840,7 +5816,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
-	set_cpu_rq_start_time(cpu);
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -6108,7 +6083,6 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP
 	idle_thread_set_boot_cpu();
-	set_cpu_rq_start_time(smp_processor_id());
 #endif
 	init_sched_fair_class();
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d265fa9756a2..d5f7d521e448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,8 +5323,6 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,
 
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
-
-	sched_avg_update(this_rq);
 }
 
 /* Used instead of source_load when we know the type == 0 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 061d51fb5b44..14aac2d2de80 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -853,8 +853,6 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
-	u64			rt_avg;
-	u64			age_stamp;
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
@@ -1719,11 +1717,6 @@ extern const_debug unsigned int sysctl_sched_time_avg;
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-static inline u64 sched_avg_period(void)
-{
-	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
@@ -1760,8 +1753,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
 #endif
 
 #ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
 #ifndef arch_scale_cpu_capacity
 static __always_inline
 unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1772,12 +1763,6 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 	return SCHED_CAPACITY_SCALE;
 }
 #endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
-	rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
-	sched_avg_update(rq);
-}
 #else
 #ifndef arch_scale_cpu_capacity
 static __always_inline
@@ -1786,8 +1771,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
 	return SCHED_CAPACITY_SCALE;
 }
 #endif
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
 struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/sysctl: Remove unused sched_time_avg_ms sysctl
  2018-06-28 15:45 ` [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms Vincent Guittot
  2018-06-28 15:51   ` Luis R. Rodriguez
@ 2018-07-15 23:33   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot for Vincent Guittot @ 2018-07-15 23:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, keescook, hpa, linux-kernel, peterz, mcgrof, mingo,
	tglx, vincent.guittot

Commit-ID:  5fd778915ad29184a5ff8eb82d1118f6916b79e4
Gitweb:     https://git.kernel.org/tip/5fd778915ad29184a5ff8eb82d1118f6916b79e4
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 28 Jun 2018 17:45:14 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/sysctl: Remove unused sched_time_avg_ms sysctl

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched/sysctl.h | 1 -
 kernel/sched/core.c          | 8 --------
 kernel/sched/sched.h         | 1 -
 kernel/sysctl.c              | 8 --------
 4 files changed, 18 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a1512ec55..913488d828cb 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,7 +40,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
-extern __read_mostly unsigned int sysctl_sched_time_avg;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a691b07390ab..ba6bb805693a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -46,14 +46,6 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
-/*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14aac2d2de80..ebb4b3c3ece7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,7 +1713,6 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 
-extern const_debug unsigned int sysctl_sched_time_avg;
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2d9837c0aff4..f22f76b7a138 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,14 +368,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "sched_time_avg_ms",
-		.data		= &sysctl_sched_time_avg,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip:sched/core] sched/cpufreq: Clarify sugov_get_util()
  2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
  2018-07-05 13:32   ` Vincent Guittot
  2018-07-06  6:05   ` Viresh Kumar
@ 2018-07-15 23:34   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-07-15 23:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, tglx, linux-kernel, torvalds, mingo, peterz,
	vincent.guittot, viresh.kumar

Commit-ID:  45f5519ec55e75af3565dd737586d3b041834f71
Gitweb:     https://git.kernel.org/tip/45f5519ec55e75af3565dd737586d3b041834f71
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 5 Jul 2018 14:36:17 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/cpufreq: Clarify sugov_get_util()

Add a few comments to (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/20180705123617.GM2458@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cpufreq_schedutil.c | 75 +++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c9622b3f183d..97dcd4472a0e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ *   cpu_util_{cfs,rt,dl,irq}()
+ *   cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued
+ * in the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured metric but a value computed
+ * based on the task model parameters and gives the minimal utilization
+ * required to meet deadlines.
+ */
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
@@ -188,47 +208,60 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	if (rt_rq_is_runnable(&rq->rt))
 		return max;
 
+	/*
+	 * Early check to see if IRQ/steal time saturates the CPU, can be
+	 * because of inaccuracies in how we track these -- see
+	 * update_irq_load_avg().
+	 */
 	irq = cpu_util_irq(rq);
-
 	if (unlikely(irq >= max))
 		return max;
 
-	/* Sum rq utilization */
+	/*
+	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
+	 * CFS tasks and we use the same metric to track the effective
+	 * utilization (PELT windows are synchronized) we can directly add them
+	 * to obtain the CPU's actual utilization.
+	 */
 	util = cpu_util_cfs(rq);
 	util += cpu_util_rt(rq);
 
 	/*
-	 * Interrupt time is not seen by RQS utilization so we can compare
-	 * them with the CPU capacity
+	 * We do not make cpu_util_dl() a permanent part of this sum because we
+	 * want to use cpu_bw_dl() later on, but we need to check if the
+	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
+	 * f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
 	 */
 	if ((util + cpu_util_dl(rq)) >= max)
 		return max;
 
 	/*
-	 * As there is still idle time on the CPU, we need to compute the
-	 * utilization level of the CPU.
+	 * There is still idle time; further improve the number by using the
+	 * irq metric. Because IRQ/steal time is hidden from the task clock we
+	 * need to scale the task numbers:
 	 *
+	 *              1 - irq
+	 *   U' = irq + ------- * U
+	 *                max
+	 */
+	util *= (max - irq);
+	util /= max;
+	util += irq;
+
+	/*
 	 * Bandwidth required by DEADLINE must always be granted while, for
 	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
 	 * to gracefully reduce the frequency when no tasks show up for longer
 	 * periods of time.
 	 *
-	 * Ideally we would like to set util_dl as min/guaranteed freq and
-	 * util_cfs + util_dl as requested freq. However, cpufreq is not yet
-	 * ready for such an interface. So, we only do the latter for now.
+	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
+	 * an interface. So, we only do the latter for now.
 	 */
-
-	/* Weight RQS utilization to normal context window */
-	util *= (max - irq);
-	util /= max;
-
-	/* Add interrupt utilization */
-	util += irq;
-
-	/* Add DL bandwidth requirement */
-	util += sg_cpu->bw_dl;
-
-	return min(max, util);
+	return min(max, util + sg_cpu->bw_dl);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()
  2018-07-15 22:15   ` Ingo Molnar
  2018-07-15 22:46     ` Joe Perches
@ 2018-07-16 11:24     ` Vincent Guittot
  2018-07-16 11:39       ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-07-16 11:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Rafael J. Wysocki, Juri Lelli,
	Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar

Hi Ingo,

On Mon, 16 Jul 2018 at 00:15, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
>

> > @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> >  static unsigned long scale_rt_capacity(int cpu)
> >  {
> >       struct rq *rq = cpu_rq(cpu);
> > -     u64 total, used, age_stamp, avg;
> > -     s64 delta;
> > -
> > -     /*
> > -      * Since we're reading these variables without serialization make sure
> > -      * we read them once before doing sanity checks on them.
> > -      */
> > -     age_stamp = READ_ONCE(rq->age_stamp);
> > -     avg = READ_ONCE(rq->rt_avg);
> > -     delta = __rq_clock_broken(rq) - age_stamp;
> > +     unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> > +     unsigned long used, irq, free;
> >
> > -     if (unlikely(delta < 0))
> > -             delta = 0;
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +     irq = READ_ONCE(rq->avg_irq.util_avg);
> >
> > -     total = sched_avg_period() + delta;
> > +     if (unlikely(irq >= max))
> > +             return 1;
> > +#endif
>
> Note that 'irq' is unused outside that macro block, resulting in a new warning on
> defconfig builds:
>
>  CC      kernel/sched/fair.o
>  kernel/sched/fair.c: In function ‘scale_rt_capacity’:
>  kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]
>    unsigned long used, irq, free;
>                        ^~~
>
> I have applied the delta fix below for simplicity, but what we really want is a
> cleanup of that function to eliminate the #ifdefs. One solution would be to factor
> out the 'irq' utilization value into a helper inline, and double check that if the
> configs are off the compiler does the right thing and eliminates this identity
> transformation for the irq==0 case:
>
>         free *= (max - irq);
>         free /= max;
>
> If the compiler refuses to optimize this away (due to the zero and overflow
> cases), try to find something more clever?

Thanks for the fix.
I'm off for now and will look at your proposal above once back

Regards,
Vincent

>
> Thanks,
>
>         Ingo
>
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e3221db0511a..d5f7d521e448 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
>         unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> -       unsigned long used, irq, free;
> +       unsigned long used, free;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +       unsigned long irq;
> +#endif
>
>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
>         irq = READ_ONCE(rq->avg_irq.util_avg);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()
  2018-07-16 11:24     ` Vincent Guittot
@ 2018-07-16 11:39       ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2018-07-16 11:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, linux-kernel, Rafael J. Wysocki, Juri Lelli,
	Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> > If the compiler refuses to optimize this away (due to the zero and overflow
> > cases), try to find something more clever?
> 
> Thanks for the fix.
> I'm off for now and will look at your proposal above once back

Sounds good, there's no rush, we've still got time until ~rc7.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/11] sched/irq: add irq utilization tracking
  2018-06-28 15:45 ` [PATCH 06/11] sched/irq: add irq " Vincent Guittot
  2018-07-15 23:29   ` [tip:sched/core] sched/irq: Add IRQ " tip-bot for Vincent Guittot
@ 2018-07-26  3:09   ` Wanpeng Li
  2018-07-30 16:43     ` Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Wanpeng Li @ 2018-07-26  3:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Rafael J. Wysocki, juri.lelli,
	Dietmar Eggemann, Morten Rasmussen, Viresh Kumar,
	valentin.schneider, Patrick Bellasi, joel, Daniel Lezcano,
	quentin.perret, Luca Abeni, claudio, Ingo Molnar, kvm

Hi Vincent,
On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> interrupt and steal time are the only remaining activities tracked by
> rt_avg. Like for sched classes, we can use PELT to track their average
> utilization of the CPU. But unlike sched class, we don't track when
> entering/leaving interrupt; Instead, we take into account the time spent
> under interrupt context when we update rqs' clock (rq_clock_task).
> This also means that we have to decay the normal context time and account
> for interrupt time during the update.
>
> That's also important to note that because
>   rq_clock == rq_clock_task + interrupt time
> and rq_clock_task is used by a sched class to compute its utilization, the
> util_avg of a sched class only reflects the utilization of the time spent
> in normal context and not of the whole time of the CPU. The utilization of
> interrupt gives an more accurate level of utilization of CPU.
> The CPU utilization is :
>   avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
>
> Most of the time, avg_irq is small and neglictible so the use of the
> approximation CPU utilization = /Sum avg_rq was enough
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/core.c  |  4 +++-
>  kernel/sched/fair.c  | 13 ++++++++++---
>  kernel/sched/pelt.c  | 40 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/pelt.h  | 16 ++++++++++++++++
>  kernel/sched/sched.h |  3 +++
>  5 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d8fac..e5263a4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -18,6 +18,8 @@
>  #include "../workqueue_internal.h"
>  #include "../smpboot.h"
>
> +#include "pelt.h"
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/sched.h>
>
> @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>
>  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
>         if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> -               sched_rt_avg_update(rq, irq_delta + steal);
> +               update_irq_load_avg(rq, irq_delta + steal);

I think we should not add steal time into irq load tracking, steal
time is always 0 on native kernel which doesn't matter, what will
happen when guest disables IRQ_TIME_ACCOUNTING and enables
PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
addition, we haven't exposed power management for performance which
means that e.g. schedutil governor can not cooperate with passive mode
intel_pstate driver to tune the OPP. To decay the old steal time avg
and add the new one just wastes cpu cycles.

Regards,
Wanpeng Li

>  #endif
>  }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffce4b2..d2758e3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>         return false;
>  }
>
> -static inline bool others_rqs_have_blocked(struct rq *rq)
> +static inline bool others_have_blocked(struct rq *rq)
>  {
>         if (READ_ONCE(rq->avg_rt.util_avg))
>                 return true;
> @@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
>         if (READ_ONCE(rq->avg_dl.util_avg))
>                 return true;
>
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +       if (READ_ONCE(rq->avg_irq.util_avg))
> +               return true;
> +#endif
> +
>         return false;
>  }
>
> @@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
>         }
>         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> +       update_irq_load_avg(rq, 0);
>         /* Don't need periodic decay once load/util_avg are null */
> -       if (others_rqs_have_blocked(rq))
> +       if (others_have_blocked(rq))
>                 done = false;
>
>  #ifdef CONFIG_NO_HZ_COMMON
> @@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
>         update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> +       update_irq_load_avg(rq, 0);
>  #ifdef CONFIG_NO_HZ_COMMON
>         rq->last_blocked_load_update_tick = jiffies;
> -       if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
> +       if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
>                 rq->has_blocked_load = 0;
>  #endif
>         rq_unlock_irqrestore(rq, &rf);
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 8b78b63..ead6d8b 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>
>         return 0;
>  }
> +
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +/*
> + * irq:
> + *
> + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> + *   util_sum = cpu_scale * load_sum
> + *   runnable_load_sum = load_sum
> + *
> + */
> +
> +int update_irq_load_avg(struct rq *rq, u64 running)
> +{
> +       int ret = 0;
> +       /*
> +        * We know the time that has been used by interrupt since last update
> +        * but we don't when. Let be pessimistic and assume that interrupt has
> +        * happened just before the update. This is not so far from reality
> +        * because interrupt will most probably wake up task and trig an update
> +        * of rq clock during which the metric si updated.
> +        * We start to decay with normal context time and then we add the
> +        * interrupt context time.
> +        * We can safely remove running from rq->clock because
> +        * rq->clock += delta with delta >= running
> +        */
> +       ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> +                               0,
> +                               0,
> +                               0);
> +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> +                               1,
> +                               1,
> +                               1);
> +
> +       if (ret)
> +               ___update_load_avg(&rq->avg_irq, 1, 1);
> +
> +       return ret;
> +}
> +#endif
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 0e4f912..d2894db 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +int update_irq_load_avg(struct rq *rq, u64 running);
> +#else
> +static inline int
> +update_irq_load_avg(struct rq *rq, u64 running)
> +{
> +       return 0;
> +}
> +#endif
> +
>  /*
>   * When a task is dequeued, its estimated utilization should not be update if
>   * its util_avg has not been updated at least once.
> @@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  {
>         return 0;
>  }
> +
> +static inline int
> +update_irq_load_avg(struct rq *rq, u64 running)
> +{
> +       return 0;
> +}
>  #endif
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ef5d6aa..377be2b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -850,6 +850,9 @@ struct rq {
>         u64                     age_stamp;
>         struct sched_avg        avg_rt;
>         struct sched_avg        avg_dl;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +       struct sched_avg        avg_irq;
> +#endif
>         u64                     idle_stamp;
>         u64                     avg_idle;
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/11] sched/irq: add irq utilization tracking
  2018-07-26  3:09   ` [PATCH 06/11] sched/irq: add irq " Wanpeng Li
@ 2018-07-30 16:43     ` Vincent Guittot
  2018-07-31  3:32       ` Wanpeng Li
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2018-07-30 16:43 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar, kvm

Hi Wanpeng,

On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> Hi Vincent,
> On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > interrupt and steal time are the only remaining activities tracked by
> > rt_avg. Like for sched classes, we can use PELT to track their average
> > utilization of the CPU. But unlike sched class, we don't track when
> > entering/leaving interrupt; Instead, we take into account the time spent
> > under interrupt context when we update rqs' clock (rq_clock_task).
> > This also means that we have to decay the normal context time and account
> > for interrupt time during the update.
> >
> > That's also important to note that because
> >   rq_clock == rq_clock_task + interrupt time
> > and rq_clock_task is used by a sched class to compute its utilization, the
> > util_avg of a sched class only reflects the utilization of the time spent
> > in normal context and not of the whole time of the CPU. The utilization of
> > interrupt gives an more accurate level of utilization of CPU.
> > The CPU utilization is :
> >   avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
> >
> > Most of the time, avg_irq is small and neglictible so the use of the
> > approximation CPU utilization = /Sum avg_rq was enough
> >
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/core.c  |  4 +++-
> >  kernel/sched/fair.c  | 13 ++++++++++---
> >  kernel/sched/pelt.c  | 40 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/pelt.h  | 16 ++++++++++++++++
> >  kernel/sched/sched.h |  3 +++
> >  5 files changed, 72 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 78d8fac..e5263a4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -18,6 +18,8 @@
> >  #include "../workqueue_internal.h"
> >  #include "../smpboot.h"
> >
> > +#include "pelt.h"
> > +
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/sched.h>
> >
> > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >
> >  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> >         if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > -               sched_rt_avg_update(rq, irq_delta + steal);
> > +               update_irq_load_avg(rq, irq_delta + steal);
>
> I think we should not add steal time into irq load tracking, steal
> time is always 0 on native kernel which doesn't matter, what will
> happen when guest disables IRQ_TIME_ACCOUNTING and enables
> PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> addition, we haven't exposed power management for performance which
> means that e.g. schedutil governor can not cooperate with passive mode
> intel_pstate driver to tune the OPP. To decay the old steal time avg
> and add the new one just wastes cpu cycles.

In fact, I have kept the same behavior as with rt_avg, which was
already adding steal time when computing scale_rt_capacity, which is
used to reflect the remaining capacity for FAIR tasks and is used in
load balance. I'm not sure that it's worth using different variables
for irq and steal.
That being said, I see a possible optimization in schedutil when
PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
With this kind of config, scale_irq_capacity can be a nop for
schedutil but scales the utilization for scale_rt_capacity

Regards,
Vincent

>
> Regards,
> Wanpeng Li
>
> >  #endif
> >  }
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ffce4b2..d2758e3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> >         return false;
> >  }
> >
> > -static inline bool others_rqs_have_blocked(struct rq *rq)
> > +static inline bool others_have_blocked(struct rq *rq)
> >  {
> >         if (READ_ONCE(rq->avg_rt.util_avg))
> >                 return true;
> > @@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
> >         if (READ_ONCE(rq->avg_dl.util_avg))
> >                 return true;
> >
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +       if (READ_ONCE(rq->avg_irq.util_avg))
> > +               return true;
> > +#endif
> > +
> >         return false;
> >  }
> >
> > @@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
> >         }
> >         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> >         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> > +       update_irq_load_avg(rq, 0);
> >         /* Don't need periodic decay once load/util_avg are null */
> > -       if (others_rqs_have_blocked(rq))
> > +       if (others_have_blocked(rq))
> >                 done = false;
> >
> >  #ifdef CONFIG_NO_HZ_COMMON
> > @@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
> >         update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> >         update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> >         update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> > +       update_irq_load_avg(rq, 0);
> >  #ifdef CONFIG_NO_HZ_COMMON
> >         rq->last_blocked_load_update_tick = jiffies;
> > -       if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
> > +       if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
> >                 rq->has_blocked_load = 0;
> >  #endif
> >         rq_unlock_irqrestore(rq, &rf);
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index 8b78b63..ead6d8b 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> >
> >         return 0;
> >  }
> > +
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +/*
> > + * irq:
> > + *
> > + *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> > + *   util_sum = cpu_scale * load_sum
> > + *   runnable_load_sum = load_sum
> > + *
> > + */
> > +
> > +int update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > +       int ret = 0;
> > +       /*
> > +        * We know the time that has been used by interrupt since last update
> > +        * but we don't when. Let be pessimistic and assume that interrupt has
> > +        * happened just before the update. This is not so far from reality
> > +        * because interrupt will most probably wake up task and trig an update
> > +        * of rq clock during which the metric si updated.
> > +        * We start to decay with normal context time and then we add the
> > +        * interrupt context time.
> > +        * We can safely remove running from rq->clock because
> > +        * rq->clock += delta with delta >= running
> > +        */
> > +       ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> > +                               0,
> > +                               0,
> > +                               0);
> > +       ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> > +                               1,
> > +                               1,
> > +                               1);
> > +
> > +       if (ret)
> > +               ___update_load_avg(&rq->avg_irq, 1, 1);
> > +
> > +       return ret;
> > +}
> > +#endif
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index 0e4f912..d2894db 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
> >  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> >  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> >
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +int update_irq_load_avg(struct rq *rq, u64 running);
> > +#else
> > +static inline int
> > +update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > +       return 0;
> > +}
> > +#endif
> > +
> >  /*
> >   * When a task is dequeued, its estimated utilization should not be update if
> >   * its util_avg has not been updated at least once.
> > @@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> >  {
> >         return 0;
> >  }
> > +
> > +static inline int
> > +update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > +       return 0;
> > +}
> >  #endif
> >
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ef5d6aa..377be2b 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -850,6 +850,9 @@ struct rq {
> >         u64                     age_stamp;
> >         struct sched_avg        avg_rt;
> >         struct sched_avg        avg_dl;
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +       struct sched_avg        avg_irq;
> > +#endif
> >         u64                     idle_stamp;
> >         u64                     avg_idle;
> >
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/11] sched/irq: add irq utilization tracking
  2018-07-30 16:43     ` Vincent Guittot
@ 2018-07-31  3:32       ` Wanpeng Li
  2018-07-31  8:21         ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Wanpeng Li @ 2018-07-31  3:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Rafael J. Wysocki, juri.lelli,
	Dietmar Eggemann, Morten Rasmussen, Viresh Kumar,
	valentin.schneider, Patrick Bellasi, joel, Daniel Lezcano,
	quentin.perret, Luca Abeni, claudio, Ingo Molnar, kvm

On Tue, 31 Jul 2018 at 00:43, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Wanpeng,
>
> On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> > Hi Vincent,
> > On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > interrupt and steal time are the only remaining activities tracked by
> > > rt_avg. Like for sched classes, we can use PELT to track their average
> > > utilization of the CPU. But unlike sched class, we don't track when
> > > entering/leaving interrupt; Instead, we take into account the time spent
> > > under interrupt context when we update rqs' clock (rq_clock_task).
> > > This also means that we have to decay the normal context time and account
> > > for interrupt time during the update.
> > >
> > > That's also important to note that because
> > >   rq_clock == rq_clock_task + interrupt time
> > > and rq_clock_task is used by a sched class to compute its utilization, the
> > > util_avg of a sched class only reflects the utilization of the time spent
> > > in normal context and not of the whole time of the CPU. The utilization of
> > > interrupt gives an more accurate level of utilization of CPU.
> > > The CPU utilization is :
> > >   avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
> > >
> > > Most of the time, avg_irq is small and neglictible so the use of the
> > > approximation CPU utilization = /Sum avg_rq was enough
> > >
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/core.c  |  4 +++-
> > >  kernel/sched/fair.c  | 13 ++++++++++---
> > >  kernel/sched/pelt.c  | 40 ++++++++++++++++++++++++++++++++++++++++
> > >  kernel/sched/pelt.h  | 16 ++++++++++++++++
> > >  kernel/sched/sched.h |  3 +++
> > >  5 files changed, 72 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 78d8fac..e5263a4 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -18,6 +18,8 @@
> > >  #include "../workqueue_internal.h"
> > >  #include "../smpboot.h"
> > >
> > > +#include "pelt.h"
> > > +
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/sched.h>
> > >
> > > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > >
> > >  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > >         if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > > -               sched_rt_avg_update(rq, irq_delta + steal);
> > > +               update_irq_load_avg(rq, irq_delta + steal);
> >
> > I think we should not add steal time into irq load tracking, steal
> > time is always 0 on native kernel which doesn't matter, what will
> > happen when guest disables IRQ_TIME_ACCOUNTING and enables
> > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> > addition, we haven't exposed power management for performance which
> > means that e.g. schedutil governor can not cooperate with passive mode
> > intel_pstate driver to tune the OPP. To decay the old steal time avg
> > and add the new one just wastes cpu cycles.
>
> In fact, I have kept the same behavior as with rt_avg, which was
> already adding steal time when computing scale_rt_capacity, which is
> used to reflect the remaining capacity for FAIR tasks and is used in
> load balance. I'm not sure that it's worth using different variables
> for irq and steal.
> That being said, I see a possible optimization in schedutil when
> PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
> With this kind of config, scale_irq_capacity can be a nop for
> schedutil but scales the utilization for scale_rt_capacity

Yeah, this is what in my mind before, you can make a patch for that. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/11] sched/irq: add irq utilization tracking
  2018-07-31  3:32       ` Wanpeng Li
@ 2018-07-31  8:21         ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2018-07-31  8:21 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Rafael J. Wysocki,
	Juri Lelli, Dietmar Eggemann, Morten Rasmussen, viresh kumar,
	Valentin Schneider, Patrick Bellasi, Joel Fernandes,
	Daniel Lezcano, Quentin Perret, Luca Abeni, Claudio Scordino,
	Ingo Molnar, kvm

On Tue, 31 Jul 2018 at 05:32, Wanpeng Li <kernellwp@gmail.com> wrote:

> > > >
> > > >  #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > > >         if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > > > -               sched_rt_avg_update(rq, irq_delta + steal);
> > > > +               update_irq_load_avg(rq, irq_delta + steal);
> > >
> > > I think we should not add steal time into irq load tracking, steal
> > > time is always 0 on native kernel which doesn't matter, what will
> > > happen when guest disables IRQ_TIME_ACCOUNTING and enables
> > > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> > > addition, we haven't exposed power management for performance which
> > > means that e.g. schedutil governor can not cooperate with passive mode
> > > intel_pstate driver to tune the OPP. To decay the old steal time avg
> > > and add the new one just wastes cpu cycles.
> >
> > In fact, I have kept the same behavior as with rt_avg, which was
> > already adding steal time when computing scale_rt_capacity, which is
> > used to reflect the remaining capacity for FAIR tasks and is used in
> > load balance. I'm not sure that it's worth using different variables
> > for irq and steal.
> > That being said, I see a possible optimization in schedutil when
> > PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
> > With this kind of config, scale_irq_capacity can be a nop for
> > schedutil but scales the utilization for scale_rt_capacity
>
> Yeah, this is what in my mind before, you can make a patch for that. :)

ok, I'm going to prepare a patch

Thanks

>
> Regards,
> Wanpeng Li

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-07-31  8:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 15:45 [PATCH v7 00/11] track CPU utilization Vincent Guittot
2018-06-28 15:45 ` [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file Vincent Guittot
2018-07-15 23:26   ` [tip:sched/core] sched/pelt: Move PELT " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 02/11] sched/rt: add rt_rq utilization tracking Vincent Guittot
2018-07-15 23:27   ` [tip:sched/core] sched/rt: Add " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 03/11] cpufreq/schedutil: use rt " Vincent Guittot
2018-07-06  5:56   ` Viresh Kumar
2018-07-15 23:27   ` [tip:sched/core] cpufreq/schedutil: Use RT " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 04/11] sched/dl: add dl_rq " Vincent Guittot
2018-07-15 23:28   ` [tip:sched/core] sched/dl: Add " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 05/11] cpufreq/schedutil: use dl " Vincent Guittot
2018-07-06  5:59   ` Viresh Kumar
2018-07-15 23:28   ` [tip:sched/core] cpufreq/schedutil: Use DL " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 06/11] sched/irq: add irq " Vincent Guittot
2018-07-15 23:29   ` [tip:sched/core] sched/irq: Add IRQ " tip-bot for Vincent Guittot
2018-07-26  3:09   ` [PATCH 06/11] sched/irq: add irq " Wanpeng Li
2018-07-30 16:43     ` Vincent Guittot
2018-07-31  3:32       ` Wanpeng Li
2018-07-31  8:21         ` Vincent Guittot
2018-06-28 15:45 ` [PATCH 07/11] cpufreq/schedutil: take into account interrupt Vincent Guittot
2018-07-06  6:00   ` Viresh Kumar
2018-07-06  9:14     ` Peter Zijlstra
2018-07-06  9:21       ` Vincent Guittot
2018-07-15 23:29   ` [tip:sched/core] cpufreq/schedutil: Take time spent in interrupts into account tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util() Vincent Guittot
2018-07-06  6:02   ` Viresh Kumar
2018-07-15 23:30   ` [tip:sched/core] sched/cpufreq: Remove sugov_aggregate_util() tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 09/11] sched: use pelt for scale_rt_capacity() Vincent Guittot
2018-07-15 22:15   ` Ingo Molnar
2018-07-15 22:46     ` Joe Perches
2018-07-16 11:24     ` Vincent Guittot
2018-07-16 11:39       ` Ingo Molnar
2018-07-15 23:32   ` [tip:sched/core] sched/core: Use PELT " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 10/11] sched: remove rt_avg code Vincent Guittot
2018-07-15 23:33   ` [tip:sched/core] sched/core: Remove the " tip-bot for Vincent Guittot
2018-06-28 15:45 ` [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms Vincent Guittot
2018-06-28 15:51   ` Luis R. Rodriguez
2018-06-29  5:49     ` Vincent Guittot
2018-07-15 23:33   ` [tip:sched/core] sched/sysctl: Remove unused sched_time_avg_ms sysctl tip-bot for Vincent Guittot
2018-07-05 12:36 ` [PATCH v7 00/11] track CPU utilization Peter Zijlstra
2018-07-05 13:32   ` Vincent Guittot
2018-07-06  6:05   ` Viresh Kumar
2018-07-06  9:18     ` Peter Zijlstra
2018-07-15 23:34   ` [tip:sched/core] sched/cpufreq: Clarify sugov_get_util() tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.