All of lore.kernel.org
 help / color / mirror / Atom feed
From: Quentin Perret <quentin.perret@arm.com>
To: peterz@infradead.org, rjw@rjwysocki.net,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: gregkh@linuxfoundation.org, mingo@redhat.com,
	dietmar.eggemann@arm.com, morten.rasmussen@arm.com,
	chris.redpath@arm.com, patrick.bellasi@arm.com,
	valentin.schneider@arm.com, vincent.guittot@linaro.org,
	thara.gopinath@linaro.org, viresh.kumar@linaro.org,
	tkjos@google.com, joel@joelfernandes.org, smuckle@google.com,
	adharmap@quicinc.com, skannan@quicinc.com,
	pkondeti@codeaurora.org, juri.lelli@redhat.com,
	edubezval@gmail.com, srinivas.pandruvada@linux.intel.com,
	currojerez@riseup.net, javi.merino@kernel.org,
	quentin.perret@arm.com
Subject: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
Date: Tue, 24 Jul 2018 13:25:16 +0100	[thread overview]
Message-ID: <20180724122521.22109-10-quentin.perret@arm.com> (raw)
In-Reply-To: <20180724122521.22109-1-quentin.perret@arm.com>

From: Morten Rasmussen <morten.rasmussen@arm.com>

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 48 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  4 ++++
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc97fa3be94..4aaa9132e840 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5043,6 +5043,24 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+	return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
+		/*
+		 * The utilization of a new task is 'wrong' so wait for it
+		 * to build some utilization history before trying to detect
+		 * the overutilized flag.
+		 */
+		if (flags & ENQUEUE_WAKEUP)
+			update_overutilized_status(rq);
+
+	}
 
 	hrtick_update(rq);
 }
@@ -7837,6 +7864,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
+		if (cpu_overutilized(i))
+			*sg_status |= SG_OVERUTILIZED;
+
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8046,8 +8076,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
 	if (!env->sd->parent) {
+		struct root_domain *rd = env->dst_rq->rd;
+
 		/* update overload indicator if we are at root domain */
-		env->dst_rq->rd->overload = sg_status & SG_OVERLOAD;
+		rd->overload = sg_status & SG_OVERLOAD;
+
+		/* Update over-utilization (tipping point, U >= 0) indicator */
+		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
+	} else if (sg_status & SG_OVERUTILIZED) {
+		WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED);
 	}
 }
 
@@ -8267,6 +8304,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+
+	if (rd_freq_domain(env->dst_rq->rd) &&
+				!READ_ONCE(env->dst_rq->rd->overutilized))
+		goto out_balanced;
+
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
@@ -9624,6 +9666,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_overutilized_status(task_rq(curr));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b5bf245eab..6d08ccd1e7a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -709,6 +709,7 @@ struct freq_domain {
 
 /* Scheduling group status flags */
 #define SG_OVERLOAD		0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
 
 /*
  * We add the notion of a root-domain which will be used to define per-domain
@@ -728,6 +729,9 @@ struct root_domain {
 	/* Indicate more than one runnable task for any CPU */
 	int			overload;
 
+	/* Indicate one or more cpus over-utilized (tipping point) */
+	int			overutilized;
+
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).
-- 
2.18.0


  parent reply	other threads:[~2018-07-24 12:26 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
2018-07-24 12:25 ` [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity Quentin Perret
2018-07-24 12:25 ` [PATCH v5 02/14] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
2018-08-09 21:52   ` Rafael J. Wysocki
2018-08-10  8:15     ` Quentin Perret
2018-08-10  8:41       ` Rafael J. Wysocki
2018-08-10  8:41         ` Rafael J. Wysocki
2018-08-10  9:12         ` Quentin Perret
2018-08-10 11:13           ` Rafael J. Wysocki
2018-08-10 12:30             ` Quentin Perret
2018-08-12  9:49               ` Rafael J. Wysocki
2018-08-12  9:49                 ` Rafael J. Wysocki
2018-07-24 12:25 ` [PATCH v5 04/14] PM / EM: Expose the Energy Model in sysfs Quentin Perret
2018-07-24 12:25 ` [PATCH v5 05/14] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
2018-07-24 12:25 ` [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
2018-07-26 16:00   ` Valentin Schneider
2018-07-26 17:01     ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 07/14] sched/topology: Introduce sched_energy_present static key Quentin Perret
2018-07-24 12:25 ` [PATCH v5 08/14] sched/fair: Clean-up update_sg_lb_stats parameters Quentin Perret
2018-07-24 12:25 ` Quentin Perret [this message]
2018-08-02 12:26   ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Peter Zijlstra
2018-08-02 13:03     ` Quentin Perret
2018-08-02 13:08       ` Peter Zijlstra
2018-08-02 13:18         ` Quentin Perret
2018-08-02 13:48           ` Vincent Guittot
2018-08-02 13:48             ` Vincent Guittot
2018-08-02 14:14             ` Quentin Perret
2018-08-02 14:14               ` Quentin Perret
2018-08-02 15:14               ` Vincent Guittot
2018-08-02 15:14                 ` Vincent Guittot
2018-08-02 15:30                 ` Quentin Perret
2018-08-02 15:30                   ` Quentin Perret
2018-08-02 15:55                   ` Vincent Guittot
2018-08-02 15:55                     ` Vincent Guittot
2018-08-02 16:00                     ` Quentin Perret
2018-08-02 16:00                       ` Quentin Perret
2018-08-02 16:07                       ` Vincent Guittot
2018-08-02 16:07                         ` Vincent Guittot
2018-08-02 16:10                         ` Quentin Perret
2018-08-02 16:10                           ` Quentin Perret
2018-08-02 16:38                           ` Vincent Guittot
2018-08-02 16:38                             ` Vincent Guittot
2018-08-02 16:59                             ` Quentin Perret
2018-08-02 16:59                               ` Quentin Perret
2018-08-03  7:48                               ` Vincent Guittot
2018-08-03  7:48                                 ` Vincent Guittot
2018-08-03  8:18                                 ` Quentin Perret
2018-08-03  8:18                                   ` Quentin Perret
2018-08-03 13:49                                   ` Vincent Guittot
2018-08-03 13:49                                     ` Vincent Guittot
2018-08-03 14:21                                     ` Vincent Guittot
2018-08-03 14:21                                       ` Vincent Guittot
2018-08-03 15:55                                     ` Quentin Perret
2018-08-03 15:55                                       ` Quentin Perret
2018-08-06  8:40                                       ` Vincent Guittot
2018-08-06  8:40                                         ` Vincent Guittot
2018-08-06  9:43                                         ` Quentin Perret
2018-08-06  9:43                                           ` Quentin Perret
2018-08-06 10:45                                           ` Vincent Guittot
2018-08-06 10:45                                             ` Vincent Guittot
2018-08-06 11:02                                             ` Quentin Perret
2018-08-06 11:02                                               ` Quentin Perret
2018-08-06 10:08                                         ` Dietmar Eggemann
2018-08-06 10:08                                           ` Dietmar Eggemann
2018-08-06 10:33                                           ` Vincent Guittot
2018-08-06 10:33                                             ` Vincent Guittot
2018-08-06 12:29                                             ` Dietmar Eggemann
2018-08-06 12:29                                               ` Dietmar Eggemann
2018-08-06 12:37                                               ` Vincent Guittot
2018-08-06 12:37                                                 ` Vincent Guittot
2018-08-06 13:20                                                 ` Dietmar Eggemann
2018-08-06 13:20                                                   ` Dietmar Eggemann
2018-08-09  9:30   ` Vincent Guittot
2018-08-09  9:30     ` Vincent Guittot
2018-08-09  9:38     ` Quentin Perret
2018-08-09  9:38       ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method Quentin Perret
2018-07-30 19:35   ` skannan
2018-07-31  7:59     ` Quentin Perret
2018-07-31 19:31       ` skannan
2018-08-01  7:32         ` Rafael J. Wysocki
2018-08-01  7:32           ` Rafael J. Wysocki
2018-08-01  8:23           ` Quentin Perret
2018-08-01  8:23             ` Quentin Perret
2018-08-01  8:35             ` Rafael J. Wysocki
2018-08-01  8:35               ` Rafael J. Wysocki
2018-08-01  9:23               ` Quentin Perret
2018-08-01  9:23                 ` Quentin Perret
2018-08-01  9:40                 ` Rafael J. Wysocki
2018-08-01  9:40                   ` Rafael J. Wysocki
2018-08-02 13:04                 ` Peter Zijlstra
2018-08-02 13:04                   ` Peter Zijlstra
2018-08-02 15:39                   ` Quentin Perret
2018-08-02 15:39                     ` Quentin Perret
2018-08-03 13:04                     ` Quentin Perret
2018-08-03 13:04                       ` Quentin Perret
2018-08-02 12:33     ` Peter Zijlstra
2018-08-02 12:45       ` Peter Zijlstra
2018-08-02 15:21         ` Quentin Perret
2018-08-02 17:36           ` Peter Zijlstra
2018-08-03 12:42             ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 11/14] sched/fair: Introduce an energy estimation helper function Quentin Perret
2018-07-24 12:25 ` [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
2018-08-02 13:54   ` Peter Zijlstra
2018-08-02 16:21     ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 13/14] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
2018-07-24 12:25 ` [PATCH v5 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180724122521.22109-10-quentin.perret@arm.com \
    --to=quentin.perret@arm.com \
    --cc=adharmap@quicinc.com \
    --cc=chris.redpath@arm.com \
    --cc=currojerez@riseup.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=edubezval@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=javi.merino@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=patrick.bellasi@arm.com \
    --cc=peterz@infradead.org \
    --cc=pkondeti@codeaurora.org \
    --cc=rjw@rjwysocki.net \
    --cc=skannan@quicinc.com \
    --cc=smuckle@google.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=thara.gopinath@linaro.org \
    --cc=tkjos@google.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.