linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/2] sched: Power optimizations with SCHED_IDLE
@ 2018-11-26 11:20 Viresh Kumar
  2018-11-26 11:20 ` [RFC][PATCH 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq Viresh Kumar
  2018-11-26 11:20 ` [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks Viresh Kumar
  0 siblings, 2 replies; 6+ messages in thread
From: Viresh Kumar @ 2018-11-26 11:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Viresh Kumar, linux-kernel, Vincent Guittot, tkjos,
	Daniel Lezcano, quentin.perret, chris.redpath, Dietmar.Eggemann

Hello,

The intention behind sending this series is to get some initial feedback
on the idea of specially handling CPUs which only have SCHED_IDLE
activity enqueued on them, before investing too much effort on it. The
current implementation handles very few code-paths and there is a lot
more that we would need to do to make it work well for us.

This allows enqueuing more tasks to a CPUs which only have tasks with
SCHED_IDLE policy currently, as the new tasks should run on them without
any significant delays. And this avoids waking up an otherwise idle CPU,
hence save power without impacting performance.

We don't have lots of SCHED_IDLE tasks in Android currently, but it
should be possible to convert a few of the background tasks and take
advantage of this feature.

Only basic testing is done with the help of rt-app [1] currently to make
sure the task is getting placed correctly.

--
viresh

Viresh Kumar (2):
  sched: Start tracking SCHED_IDLE tasks count in cfs_rq
  sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks

 kernel/sched/core.c  | 23 ++++++++++++++++
 kernel/sched/fair.c  | 64 +++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |  5 ++++
 3 files changed, 76 insertions(+), 16 deletions(-)

-- 
2.19.1.568.g152ad8e3369a

[1] json: https://pastebin.ubuntu.com/p/Y4fr9xddV8/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC][PATCH 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq
  2018-11-26 11:20 [RFC][PATCH 0/2] sched: Power optimizations with SCHED_IDLE Viresh Kumar
@ 2018-11-26 11:20 ` Viresh Kumar
  2018-11-26 11:20 ` [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks Viresh Kumar
  1 sibling, 0 replies; 6+ messages in thread
From: Viresh Kumar @ 2018-11-26 11:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Viresh Kumar, linux-kernel, Vincent Guittot, tkjos,
	Daniel Lezcano, quentin.perret, chris.redpath, Dietmar.Eggemann

Start tracking how many tasks with SCHED_IDLE policy are present in each
cfs_rq. This will be used by later commits.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/sched/fair.c  | 14 ++++++++++++--
 kernel/sched/sched.h |  2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e30dea59d215..ad0b09ddddc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4453,7 +4453,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, dequeue = 1;
+	long task_delta, idle_task_delta, dequeue = 1;
 	bool empty;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
@@ -4464,6 +4464,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	rcu_read_unlock();
 
 	task_delta = cfs_rq->h_nr_running;
+	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -4473,6 +4474,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		if (dequeue)
 			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
 		qcfs_rq->h_nr_running -= task_delta;
+		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 
 		if (qcfs_rq->load.weight)
 			dequeue = 0;
@@ -4512,7 +4514,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	int enqueue = 1;
-	long task_delta;
+	long task_delta, idle_task_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -4532,6 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		return;
 
 	task_delta = cfs_rq->h_nr_running;
+	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			enqueue = 0;
@@ -4540,6 +4543,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		if (enqueue)
 			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
 		cfs_rq->h_nr_running += task_delta;
+		cfs_rq->idle_h_nr_running += idle_task_delta;
 
 		if (cfs_rq_throttled(cfs_rq))
 			break;
@@ -5092,6 +5096,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -5124,6 +5129,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
 		flags = ENQUEUE_WAKEUP;
 	}
@@ -5131,6 +5137,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running++;
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
 		if (cfs_rq_throttled(cfs_rq))
 			break;
@@ -5157,6 +5164,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
+	int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -5171,6 +5179,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -5190,6 +5199,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		cfs_rq->h_nr_running--;
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		if (cfs_rq_throttled(cfs_rq))
 			break;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e052a50fcd..86a388c506ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -488,6 +488,8 @@ struct cfs_rq {
 	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;
+	/* h_nr_running for SCHED_IDLE tasks */
+	unsigned int		idle_h_nr_running;
 
 	u64			exec_clock;
 	u64			min_vruntime;
-- 
2.19.1.568.g152ad8e3369a


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks
  2018-11-26 11:20 [RFC][PATCH 0/2] sched: Power optimizations with SCHED_IDLE Viresh Kumar
  2018-11-26 11:20 ` [RFC][PATCH 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq Viresh Kumar
@ 2018-11-26 11:20 ` Viresh Kumar
  2018-11-26 12:37   ` Quentin Perret
  1 sibling, 1 reply; 6+ messages in thread
From: Viresh Kumar @ 2018-11-26 11:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Viresh Kumar, linux-kernel, Vincent Guittot, tkjos,
	Daniel Lezcano, quentin.perret, chris.redpath, Dietmar.Eggemann

The scheduler tries to schedule a newly wakeup task on an idle CPU to
make sure the new task gets chance to run as soon as possible, for
performance reasons.

The SCHED_IDLE scheduling policy is used for tasks which have the lowest
priority and there is no hurry in running them. If all the tasks
currently enqueued on a CPU have their policy set to SCHED_IDLE, then
any new task (non SCHED_IDLE) enqueued on that CPU should normally get a
chance to run immediately. This patch takes advantage of this to save
power in some cases by avoiding waking up an idle CPU (which may be in
some deep idle state) and enqueuing the new task on a CPU which only has
SCHED_IDLE tasks.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/sched/core.c  | 23 ++++++++++++++++++++
 kernel/sched/fair.c  | 50 +++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |  3 +++
 3 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d87a28da378..176eed77b18e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4020,6 +4020,29 @@ int available_idle_cpu(int cpu)
 	return 1;
 }
 
+/* CPU only has SCHED_IDLE tasks enqueued */
+int cpu_only_has_sched_idle_tasks(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	return unlikely(rq->nr_running &&
+			rq->nr_running == rq->cfs.idle_h_nr_running);
+}
+
+int available_sched_idle_cpu(int cpu)
+{
+	if (vcpu_is_preempted(cpu))
+		return 0;
+
+	if (idle_cpu(cpu))
+		return 1;
+
+	if (cpu_only_has_sched_idle_tasks(cpu))
+		return 1;
+
+	return 0;
+}
+
 /**
  * idle_task - return the idle task for a given CPU.
  * @cpu: the processor in question.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad0b09ddddc0..3a029c740d51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5623,9 +5623,10 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 	 * on one CPU.
 	 */
 	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+		return available_sched_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
 
-	if (sync && cpu_rq(this_cpu)->nr_running == 1)
+	if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
+	    cpu_only_has_sched_idle_tasks(this_cpu))
 		return this_cpu;
 
 	return nr_cpumask_bits;
@@ -5888,6 +5889,9 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				latest_idle_timestamp = rq->idle_stamp;
 				shallowest_idle_cpu = i;
 			}
+		} else if (cpu_only_has_sched_idle_tasks(i) && !vcpu_is_preempted(i)) {
+			/* Prefer CPU with only SCHED_IDLE tasks */
+			return i;
 		} else if (shallowest_idle_cpu == -1) {
 			load = weighted_cpuload(cpu_rq(i));
 			if (load < min_load) {
@@ -6049,7 +6053,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  */
 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	int cpu;
+	int cpu, last_idle_cpu = -1;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6057,11 +6061,18 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (available_idle_cpu(cpu))
-			return cpu;
+		if (!vcpu_is_preempted(cpu)) {
+			if (idle_cpu(cpu)) {
+				/* Prefer CPU with only SCHED_IDLE tasks */
+				last_idle_cpu = cpu;
+				continue;
+			}
+			if (cpu_only_has_sched_idle_tasks(cpu))
+				return cpu;
+		}
 	}
 
-	return -1;
+	return last_idle_cpu;
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6089,7 +6100,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, nr = INT_MAX, last_idle_cpu = -1;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6116,12 +6127,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (!--nr)
-			return -1;
+		if (!--nr) {
+			if (last_idle_cpu == -1)
+				return -1;
+			cpu = last_idle_cpu;
+			break;
+		}
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (available_idle_cpu(cpu))
-			break;
+		if (!vcpu_is_preempted(cpu)) {
+			if (idle_cpu(cpu)) {
+				/* Prefer CPU with only SCHED_IDLE tasks */
+				last_idle_cpu = cpu;
+				continue;
+			}
+			if (cpu_only_has_sched_idle_tasks(cpu))
+				break;
+		}
 	}
 
 	time = local_clock() - time;
@@ -6140,13 +6162,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
-	if (available_idle_cpu(target))
+	if (available_sched_idle_cpu(target))
 		return target;
 
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+	if (prev != target && cpus_share_cache(prev, target) && available_sched_idle_cpu(prev))
 		return prev;
 
 	/* Check a recently used CPU as a potential idle candidate: */
@@ -6154,7 +6176,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    available_idle_cpu(recent_used_cpu) &&
+	    available_sched_idle_cpu(recent_used_cpu) &&
 	    cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
 		/*
 		 * Replace recent_used_cpu with prev as it is a potential
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86a388c506ac..ecd016c64ee2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1828,6 +1828,9 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
+extern int cpu_only_has_sched_idle_tasks(int cpu);
+extern int available_sched_idle_cpu(int cpu);
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.19.1.568.g152ad8e3369a


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks
  2018-11-26 11:20 ` [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks Viresh Kumar
@ 2018-11-26 12:37   ` Quentin Perret
  2018-11-27 10:24     ` Viresh Kumar
  0 siblings, 1 reply; 6+ messages in thread
From: Quentin Perret @ 2018-11-26 12:37 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Vincent Guittot,
	tkjos, Daniel Lezcano, quentin.perret, chris.redpath,
	Dietmar.Eggemann

Hi Viresh,

On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:
> The scheduler tries to schedule a newly wakeup task on an idle CPU to
> make sure the new task gets chance to run as soon as possible, for
> performance reasons.
> 
> The SCHED_IDLE scheduling policy is used for tasks which have the lowest
> priority and there is no hurry in running them. If all the tasks
> currently enqueued on a CPU have their policy set to SCHED_IDLE, then
> any new task (non SCHED_IDLE) enqueued on that CPU should normally get a
> chance to run immediately. This patch takes advantage of this to save
> power in some cases by avoiding waking up an idle CPU (which may be in
> some deep idle state) and enqueuing the new task on a CPU which only has
> SCHED_IDLE tasks.

So, avoiding to wake-up a CPU isn't always good for energy. You may
prefer to spread tasks in order to keep the OPP low, for example. What
you're trying to achieve here can be actively harmful for both energy
and performance in some cases, I think.

Also, packing will reduce your chances to go cluster idle (yes you're
not guaranteed to go cluster idle either if you spread depending how
the tasks align in time, but at least there's a chance). So, even from
the idle perspective it's not obvious we actually want to do that.

And finally, the placement that this patch tries to achieve is
inherently unbalanced IIUC. So, unless you hide this behind the EAS
static key, you'll need to make sure the periodic/idle load balance code
doesn't kill all the work you do in the wake-up path. So I'm not sure
this patch really works in practice in its current state.

Now, I think you have a point by saying we could possibly be a bit
smarter with the way we deal with SCHED_IDLE tasks, especially if they
are going to be used more (is that a certainty BTW ?), I'm just not
entirely convinced with the 'power' argument yet.

Maybe there is something we could do if, say we need to schedule a
SCHED_NORMAL task and all CPUs have roughly the same load and/or
utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we
should select it in priority since we know for a fact it's not running
anything important.

What do you think ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks
  2018-11-26 12:37   ` Quentin Perret
@ 2018-11-27 10:24     ` Viresh Kumar
  2018-11-27 11:00       ` Quentin Perret
  0 siblings, 1 reply; 6+ messages in thread
From: Viresh Kumar @ 2018-11-27 10:24 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Vincent Guittot,
	tkjos, Daniel Lezcano, quentin.perret, chris.redpath,
	Dietmar.Eggemann

Hi Quentin,

On 26-11-18, 12:37, Quentin Perret wrote:
> On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:
> > The scheduler tries to schedule a newly wakeup task on an idle CPU to
> > make sure the new task gets chance to run as soon as possible, for
> > performance reasons.
> > 
> > The SCHED_IDLE scheduling policy is used for tasks which have the lowest
> > priority and there is no hurry in running them. If all the tasks
> > currently enqueued on a CPU have their policy set to SCHED_IDLE, then
> > any new task (non SCHED_IDLE) enqueued on that CPU should normally get a
> > chance to run immediately. This patch takes advantage of this to save
> > power in some cases by avoiding waking up an idle CPU (which may be in
> > some deep idle state) and enqueuing the new task on a CPU which only has
> > SCHED_IDLE tasks.
> 
> So, avoiding to wake-up a CPU isn't always good for energy. You may
> prefer to spread tasks in order to keep the OPP low, for example. What
> you're trying to achieve here can be actively harmful for both energy
> and performance in some cases, I think.

Yeah, we may end up packing SCHED_IDLE tasks to a single CPU in this case.

We know that dynamic energy is significantly more than static energy and that is
what we should care more about. Yes, higher OPP should be avoided (apart from
performance reasons), but isn't it better (for power) to run a single CPU at
somewhat higher OPP (1GHz ?) instead of running four of them at lower OPPs (500
MHz) ?

> Also, packing will reduce your chances to go cluster idle (yes you're
> not guaranteed to go cluster idle either if you spread depending how
> the tasks align in time, but at least there's a chance). So, even from
> the idle perspective it's not obvious we actually want to do that.

But do we really want to fire all CPUs of a cluster to finish the work earlier
and go cluster idle ? We don't really believe in race-to-idle as that's why we
have the whole DVFS thing here, right ?

> And finally, the placement that this patch tries to achieve is
> inherently unbalanced IIUC. So, unless you hide this behind the EAS
> static key, you'll need to make sure the periodic/idle load balance code
> doesn't kill all the work you do in the wake-up path. So I'm not sure
> this patch really works in practice in its current state.

True, I intentionally left the load-balancer code as is to avoid larger diff for
now. The idea was to get more feedback on the whole thing before investing too
much on it.

> Now, I think you have a point by saying we could possibly be a bit
> smarter with the way we deal with SCHED_IDLE tasks, especially if they
> are going to be used more (is that a certainty BTW ?), I'm just not
> entirely convinced with the 'power' argument yet.

Todd confirmed earlier (privately) that most (?) of the android background tasks
can actually be migrated to use SCHED_IDLE stuff as there is no urgency in
scheduling them normally.

@Todd, can you please provide some inputs here as well ?

> Maybe there is something we could do if, say we need to schedule a
> SCHED_NORMAL task and all CPUs have roughly the same load and/or
> utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we
> should select it in priority since we know for a fact it's not running
> anything important.
> 
> What do you think ?

Sure, I am not saying that the approach taken by this patch is the best or the
worst. We need to come up with better policy on how we can benefit from the
SCHED_IDLE policy and that's where I am looking for inputs from all of you.

Thanks for the feedback.

-- 
viresh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks
  2018-11-27 10:24     ` Viresh Kumar
@ 2018-11-27 11:00       ` Quentin Perret
  0 siblings, 0 replies; 6+ messages in thread
From: Quentin Perret @ 2018-11-27 11:00 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Vincent Guittot,
	tkjos, Daniel Lezcano, quentin.perret, chris.redpath,
	Dietmar.Eggemann

Hi Viresh,

On Tuesday 27 Nov 2018 at 15:54:42 (+0530), Viresh Kumar wrote:
> Hi Quentin,
> 
> On 26-11-18, 12:37, Quentin Perret wrote:
> > On Monday 26 Nov 2018 at 16:50:24 (+0530), Viresh Kumar wrote:
> > > The scheduler tries to schedule a newly wakeup task on an idle CPU to
> > > make sure the new task gets chance to run as soon as possible, for
> > > performance reasons.
> > > 
> > > The SCHED_IDLE scheduling policy is used for tasks which have the lowest
> > > priority and there is no hurry in running them. If all the tasks
> > > currently enqueued on a CPU have their policy set to SCHED_IDLE, then
> > > any new task (non SCHED_IDLE) enqueued on that CPU should normally get a
> > > chance to run immediately. This patch takes advantage of this to save
> > > power in some cases by avoiding waking up an idle CPU (which may be in
> > > some deep idle state) and enqueuing the new task on a CPU which only has
> > > SCHED_IDLE tasks.
> > 
> > So, avoiding to wake-up a CPU isn't always good for energy. You may
> > prefer to spread tasks in order to keep the OPP low, for example. What
> > you're trying to achieve here can be actively harmful for both energy
> > and performance in some cases, I think.
> 
> Yeah, we may end up packing SCHED_IDLE tasks to a single CPU in this case.
> 
> We know that dynamic energy is significantly more than static energy and that is
> what we should care more about. Yes, higher OPP should be avoided (apart from
> performance reasons), but isn't it better (for power) to run a single CPU at
> somewhat higher OPP (1GHz ?) instead of running four of them at lower OPPs (500
> MHz) ?

I guess that really depends on your platform (which is why EAS is using
an Energy Model BTW, it's pretty hard to find one heuristic that works
well for all topologies out there). But your example is a bit unfair I
think. You should compare 1 CPU at 1GHz vs 4 CPUs at 250MHz. Otherwise
you're not resourcing the tasks adequately in one of the cases.

> 
> > Also, packing will reduce your chances to go cluster idle (yes you're
> > not guaranteed to go cluster idle either if you spread depending how
> > the tasks align in time, but at least there's a chance). So, even from
> > the idle perspective it's not obvious we actually want to do that.
> 
> But do we really want to fire all CPUs of a cluster to finish the work earlier
> and go cluster idle ? We don't really believe in race-to-idle as that's why we
> have the whole DVFS thing here, right ?

Right, I'm certainly not advocating for a race-to-idle policy here. What
I'm saying is that, if you can avoid to raise the OPP by spreading, it's
often a good thing from an energy standpoint, because as you said the
dynamic energy is generally the most expensive part. But even if you can
pack the tasks on a single CPU without having to raise the OPP, it's not
obvious this is the right thing to do either since that will prevent you
from going cluster idle (or reduce the time you _could_ spend cluster
idle at least).

That kind of packing vs spreading energy assessment is really hard to
do in general, especially without knowing the costs of running at
different idle states.

> > And finally, the placement that this patch tries to achieve is
> > inherently unbalanced IIUC. So, unless you hide this behind the EAS
> > static key, you'll need to make sure the periodic/idle load balance code
> > doesn't kill all the work you do in the wake-up path. So I'm not sure
> > this patch really works in practice in its current state.
> 
> True, I intentionally left the load-balancer code as is to avoid larger diff for
> now. The idea was to get more feedback on the whole thing before investing too
> much on it.

OK I see :-)

> 
> > Now, I think you have a point by saying we could possibly be a bit
> > smarter with the way we deal with SCHED_IDLE tasks, especially if they
> > are going to be used more (is that a certainty BTW ?), I'm just not
> > entirely convinced with the 'power' argument yet.
> 
> Todd confirmed earlier (privately) that most (?) of the android background tasks
> can actually be migrated to use SCHED_IDLE stuff as there is no urgency in
> scheduling them normally.

Ah, that's interesting ... If there are tasks in Android that we really
don't care about (that is it's actually fine to starve them), then maybe
we should put those in SCHED_IDLE indeed ... That'll leave the stage for
the other tasks that do have stronger requirements.

So yeah, I agree it's worth investigating.

> @Todd, can you please provide some inputs here as well ?
> 
> > Maybe there is something we could do if, say we need to schedule a
> > SCHED_NORMAL task and all CPUs have roughly the same load and/or
> > utilization numbers, then if a CPU is busy running SCHED_IDLE tasks we
> > should select it in priority since we know for a fact it's not running
> > anything important.
> > 
> > What do you think ?
> 
> Sure, I am not saying that the approach taken by this patch is the best or the
> worst. We need to come up with better policy on how we can benefit from the
> SCHED_IDLE policy and that's where I am looking for inputs from all of you.

Right so my overall advice would be to try an avoid to hard-code a pure
packing heuristic like that (unless you have loads of numbers to backup
the idea and show it works well), but perhaps to use the policy of the
tasks to try and break the tie between CPU candidates that cannot be
differentiated otherwise because the other metrics (load / utilization)
are roughly equivalent).

We really ought to make sure, however, that we have a strong use case
for SCHED_IDLE tasks in Android or elsewhere first before adding any
kind infrastructure for it.

Anyways, just my two cents ...

> Thanks for the feedback.

I hope that's useful :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-27 11:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26 11:20 [RFC][PATCH 0/2] sched: Power optimizations with SCHED_IDLE Viresh Kumar
2018-11-26 11:20 ` [RFC][PATCH 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq Viresh Kumar
2018-11-26 11:20 ` [RFC][PATCH 2/2] sched: Enqueue tasks on a cpu with only SCHED_IDLE tasks Viresh Kumar
2018-11-26 12:37   ` Quentin Perret
2018-11-27 10:24     ` Viresh Kumar
2018-11-27 11:00       ` Quentin Perret

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).