linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] sched/fair: remote load updates for idle CPUs
@ 2017-10-24 12:25 Brendan Jackman
  2017-10-24 12:25 ` [PATCH 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
  2017-10-24 12:25 ` [PATCH 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
  0 siblings, 2 replies; 8+ messages in thread
From: Brendan Jackman @ 2017-10-24 12:25 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel

Until a NOHZ idle balance takes place on behalf of a CPU (which may
never happen), the blocked load and shares of its root cfs_rq are
updated only by that CPU. That means if a CPU goes suddenly from
being busy to totally idle, its load and shares may not be updated
for a long time.

Schedutil works around this problem by ignoring the util of CPUs
that were last updated more than a tick ago. However the stale
load does impact task placement: elements that look at load and
util (in particular the slow-path of select_task_rq_fair) can
leave the idle CPUs un-used while other CPUs go unnecessarily
overloaded. Furthermore the stale shares can impact CPU time
allotment.

Two complementary solutions are proposed here:
1. When a task wakes up, if necessary an idle CPU is woken as if to
   perform a NOHZ idle balance, which is then aborted once the load
   of NOHZ idle CPUs has been updated. This solves the problem but
   brings with it extra CPU wakeups, which have an energy cost.
2. During newly-idle load balancing, the load of remote nohz-idle
   CPUs in the sched_domain is updated. When all of the idle CPUs
   were updated in that step, the nohz.next_update field
   is pushed further into the future. This field is used to determine
   the need for triggering the newly-added NOHZ kick. So if such
   newly-idle balances are happening often enough, no additional CPU
   wakeups are required to keep all the CPUs' loads updated.


Brendan Jackman (1):
  sched/fair: Update blocked load from newly idle balance

Vincent Guittot (1):
  sched: force update of blocked load of idle cpus

 kernel/sched/core.c  |   1 +
 kernel/sched/fair.c  | 109 ++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |   2 +
 3 files changed, 98 insertions(+), 14 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] sched: force update of blocked load of idle cpus
  2017-10-24 12:25 [PATCH 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
@ 2017-10-24 12:25 ` Brendan Jackman
  2017-11-09 19:56   ` Todd Kjos
  2017-11-20  9:04   ` Vincent Guittot
  2017-10-24 12:25 ` [PATCH 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
  1 sibling, 2 replies; 8+ messages in thread
From: Brendan Jackman @ 2017-10-24 12:25 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Ingo Molnar, Morten Rasmussen

From: Vincent Guittot <vincent.guittot@linaro.org>

When idle, the blocked load of CPUs will be updated only when an idle
load balance is triggered which may never happen. Because of this
uncertainty on the execution of idle load balance, the utilization,
the load and the shares of idle cfs_rq can stay artificially high and
steal shares and running time to busy cfs_rqs of the task group.
Add a new light idle load balance state which ensures that blocked loads
are periodically updated and decayed but does not perform any task
migration.

The remote load udpates are rate-limited, so that they are not
performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT
half-life). This is the period after which we have a known 50% error
in stale load.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[Switched remote update interval to use PELT half life]
[Moved update_blocked_averges call outside rebalance_domains
 to simplify code]
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
---
 kernel/sched/fair.c  | 71 +++++++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  1 +
 2 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85d1ec1c3b39..9085caf49c76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5976,6 +5976,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update);
+static void nohz_balancer_kick(bool only_update);
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6074,6 +6077,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 	rcu_read_unlock();
 
+#ifdef CONFIG_NO_HZ_COMMON
+	if (nohz_kick_needed(cpu_rq(new_cpu), true))
+		nohz_balancer_kick(true);
+#endif
+
 	return new_cpu;
 }
 
@@ -8653,6 +8661,7 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
+	unsigned long next_update;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
 static inline int find_new_ilb(void)
@@ -8670,7 +8679,7 @@ static inline int find_new_ilb(void)
  * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
  * CPU (if there is one).
  */
-static void nohz_balancer_kick(void)
+static void nohz_balancer_kick(bool only_update)
 {
 	int ilb_cpu;
 
@@ -8683,6 +8692,10 @@ static void nohz_balancer_kick(void)
 
 	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
 		return;
+
+	if (only_update)
+		set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
+
 	/*
 	 * Use smp_send_reschedule() instead of resched_cpu().
 	 * This way we generate a sched IPI on the target cpu which
@@ -8770,6 +8783,8 @@ void nohz_balance_enter_idle(int cpu)
 	atomic_inc(&nohz.nr_cpus);
 	set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 }
+#else
+static inline void nohz_balancer_kick(bool only_update) {}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -8801,8 +8816,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	int need_serialize, need_decay = 0;
 	u64 max_cost = 0;
 
-	update_blocked_averages(cpu);
-
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
 		/*
@@ -8901,6 +8914,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
+	struct sched_domain *sd;
 	int balance_cpu;
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
@@ -8910,6 +8924,23 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
+	/*
+	 * This cpu is going to update the blocked load of idle CPUs either
+	 * before doing a rebalancing or just to keep metrics up to date. we
+	 * can safely update the next update timestamp
+	 */
+	rcu_read_lock();
+	sd = rcu_dereference(this_rq->sd);
+	/*
+	 * Check whether there is a sched_domain available for this cpu.
+	 * The last other cpu can have been unplugged since the ILB has been
+	 * triggered and the sched_domain can now be null. The idle balance
+	 * sequence will quickly be aborted as there is no more idle CPUs
+	 */
+	if (sd)
+		nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD);
+	rcu_read_unlock();
+
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
@@ -8936,7 +8967,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 			cpu_load_update_idle(rq);
 			rq_unlock_irq(rq, &rf);
 
-			rebalance_domains(rq, CPU_IDLE);
+			update_blocked_averages(balance_cpu);
+			/*
+			 * This idle load balance softirq may have been
+			 * triggered only to update the blocked load and shares
+			 * of idle CPUs (which we have just done for
+			 * balance_cpu). In that case skip the actual balance.
+			 */
+			if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)))
+				rebalance_domains(rq, idle);
 		}
 
 		if (time_after(next_balance, rq->next_balance)) {
@@ -8967,7 +9006,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline bool nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update)
 {
 	unsigned long now = jiffies;
 	struct sched_domain_shared *sds;
@@ -8975,7 +9014,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	int nr_busy, i, cpu = rq->cpu;
 	bool kick = false;
 
-	if (unlikely(rq->idle_balance))
+	if (unlikely(rq->idle_balance) && !only_update)
 		return false;
 
        /*
@@ -8992,6 +9031,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	if (likely(!atomic_read(&nohz.nr_cpus)))
 		return false;
 
+	if (only_update) {
+		if (time_before(now, nohz.next_update))
+			return false;
+		else
+			return true;
+	}
+
 	if (time_before(now, nohz.next_balance))
 		return false;
 
@@ -9041,6 +9087,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { return false; }
 #endif
 
 /*
@@ -9062,7 +9109,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
 	 * and abort nohz_idle_balance altogether if we pull some load.
 	 */
 	nohz_idle_balance(this_rq, idle);
-	rebalance_domains(this_rq, idle);
+	update_blocked_averages(this_rq->cpu);
+	if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)))
+		rebalance_domains(this_rq, idle);
+#ifdef CONFIG_NO_HZ_COMMON
+	clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
+#endif
 }
 
 /*
@@ -9077,8 +9129,8 @@ void trigger_load_balance(struct rq *rq)
 	if (time_after_eq(jiffies, rq->next_balance))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ_COMMON
-	if (nohz_kick_needed(rq))
-		nohz_balancer_kick();
+	if (nohz_kick_needed(rq, false))
+		nohz_balancer_kick(false);
 #endif
 }
 
@@ -9657,6 +9709,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ_COMMON
 	nohz.next_balance = jiffies;
+	nohz.next_update = jiffies;
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 #endif
 #endif /* SMP */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14db76cd496f..6f95ef653f73 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1978,6 +1978,7 @@ extern void cfs_bandwidth_usage_dec(void);
 enum rq_nohz_flag_bits {
 	NOHZ_TICK_STOPPED,
 	NOHZ_BALANCE_KICK,
+	NOHZ_STATS_KICK
 };
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] sched/fair: Update blocked load from newly idle balance
  2017-10-24 12:25 [PATCH 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
  2017-10-24 12:25 ` [PATCH 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
@ 2017-10-24 12:25 ` Brendan Jackman
  2017-11-20  9:07   ` Vincent Guittot
  1 sibling, 1 reply; 8+ messages in thread
From: Brendan Jackman @ 2017-10-24 12:25 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Ingo Molnar, Morten Rasmussen

We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is
good, but it brings about CPU wakeups, which have an energy cost. As an
alternative to waking CPUs up to do decay blocked load, we can sometimes do it
from newly idle balance. If the newly idle balance is on a domain that covers
all the currently nohz-idle CPUs, we push the value of nohz.next_update into the
future. That means that if such newly idle balances happen often enough, we
never need wake up a CPU just to update load.

Since we're doing this new update inside a for_each_domain, we need to do
something to avoid doing multiple updates on the same CPU in the same
idle_balance. A tick stamp is set on the rq in update_blocked_averages as a
simple way to do this. Using a simple jiffies-based timestamp, as opposed to the
last_update_time of the root cfs_rq's sched_avg, means we can do this without
taking the rq lock.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 41 +++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  1 +
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..d8e71fd27806 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5923,6 +5923,7 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
 		rq->last_load_update_tick = jiffies;
+		rq->last_blocked_load_update_tick = jiffies;
 		rq->nohz_flags = 0;
 #endif
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9085caf49c76..45e9c8056161 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7062,6 +7062,7 @@ static void update_blocked_averages(int cpu)
 		if (cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 	}
+	rq->last_blocked_load_update_tick = jiffies;
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7121,6 +7122,7 @@ static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+	rq->last_blocked_load_update_tick = jiffies;
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7615,6 +7617,15 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_NO_HZ_COMMON
+static struct {
+	cpumask_var_t idle_cpus_mask;
+	atomic_t nr_cpus;
+	unsigned long next_balance;     /* in jiffy units */
+	unsigned long next_update;     /* in jiffy units */
+} nohz ____cacheline_aligned;
+#endif
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -7633,6 +7644,30 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+#ifdef CONFIG_NO_HZ_COMMON
+	if (env->idle == CPU_NEWLY_IDLE) {
+		int cpu;
+
+		/* Update the stats of NOHZ idle CPUs in the sd */
+		for_each_cpu_and(cpu, sched_domain_span(env->sd),
+				 nohz.idle_cpus_mask) {
+			struct rq *rq = cpu_rq(cpu);
+
+			/* ... Unless we've already done since the last tick */
+			if (time_after(jiffies,
+                                       rq->last_blocked_load_update_tick))
+				update_blocked_averages(cpu);
+		}
+	}
+	/*
+	 * If we've just updated all of the NOHZ idle CPUs, then we can push
+	 * back the next nohz.next_update, which will prevent an unnecessary
+	 * wakeup for the nohz stats kick
+	 */
+	if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd)))
+		nohz.next_update = jiffies + LOAD_AVG_PERIOD;
+#endif
+
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -8657,12 +8692,6 @@ static inline int on_null_domain(struct rq *rq)
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
  */
-static struct {
-	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
-	unsigned long next_balance;     /* in jiffy units */
-	unsigned long next_update;     /* in jiffy units */
-} nohz ____cacheline_aligned;
 
 static inline int find_new_ilb(void)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6f95ef653f73..6be8938bb977 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -681,6 +681,7 @@ struct rq {
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
 	unsigned long last_load_update_tick;
+	unsigned long last_blocked_load_update_tick;
 #endif /* CONFIG_SMP */
 	unsigned long nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched: force update of blocked load of idle cpus
  2017-10-24 12:25 ` [PATCH 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
@ 2017-11-09 19:56   ` Todd Kjos
  2017-11-10 14:53     ` Brendan Jackman
  2017-11-20  9:04   ` Vincent Guittot
  1 sibling, 1 reply; 8+ messages in thread
From: Todd Kjos @ 2017-11-09 19:56 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	LKML, Ingo Molnar, Morten Rasmussen

> @@ -8683,6 +8692,10 @@ static void nohz_balancer_kick(void)
>
>         if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
>                 return;
> +
> +       if (only_update)
> +               set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));

Should there be an "else clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));" ?

Seems like any time this is called as !only_update, we should stop
inhibiting rebalance. In fact, we should perhaps go a little further
so that an only_update never inhibits rebalance from a concurrent
!only_update.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched: force update of blocked load of idle cpus
  2017-11-09 19:56   ` Todd Kjos
@ 2017-11-10 14:53     ` Brendan Jackman
  0 siblings, 0 replies; 8+ messages in thread
From: Brendan Jackman @ 2017-11-10 14:53 UTC (permalink / raw)
  To: Todd Kjos
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	LKML, Ingo Molnar, Morten Rasmussen


Hi Todd,

On Thu, Nov 09 2017 at 19:56, Todd Kjos wrote:
>> @@ -8683,6 +8692,10 @@ static void nohz_balancer_kick(void)
>>
>>         if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
>>                 return;
>> +
>> +       if (only_update)
>> +               set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
>
> Should there be an "else clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));" ?
>
> Seems like any time this is called as !only_update, we should stop
> inhibiting rebalance. In fact, we should perhaps go a little further
> so that an only_update never inhibits rebalance from a concurrent
> !only_update.

Yes, I think you are essentially right. To make sure I understand you: I
guess you are saying that where two CPUs are concurrently in
nohz_balancer_kick, one with only_update=1 and one with only_update=0,
the former should not prevent the latter from triggering a full load
balance. (I'm assuming they both get the same value for ilb_cpu).

The exact solution you described won't work: only one of those CPUs can
get to this line of code, because of the test_and_set_bit above. So I
think we need something like:

       if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
               if (!only_update) {
                      /*
                       * There's a pending stats kick or an ongoing
                       * nohz_idle balance that's just for stats.
                       * Convert it to a proper nohz balance.
                       */
                      clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
               }
               return;
       }

       if (only_update)
               set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));

There's still scope for some lost rebalance_domains calls, but as
Vincent pointed out in a recent chat, because the rq->next_balance
fields won't be changed for the CPUs they get missed out, those will
only be delayed until the next scheduler_tick.

I'm on holiday ATM, I'll get to testing that when I get back.

Thanks for the review,
Brendan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched: force update of blocked load of idle cpus
  2017-10-24 12:25 ` [PATCH 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
  2017-11-09 19:56   ` Todd Kjos
@ 2017-11-20  9:04   ` Vincent Guittot
  2017-11-30 15:59     ` Brendan Jackman
  1 sibling, 1 reply; 8+ messages in thread
From: Vincent Guittot @ 2017-11-20  9:04 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On 24 October 2017 at 14:25, Brendan Jackman <brendan.jackman@arm.com> wrote:
> From: Vincent Guittot <vincent.guittot@linaro.org>
>
> When idle, the blocked load of CPUs will be updated only when an idle
> load balance is triggered which may never happen. Because of this
> uncertainty on the execution of idle load balance, the utilization,
> the load and the shares of idle cfs_rq can stay artificially high and
> steal shares and running time to busy cfs_rqs of the task group.
> Add a new light idle load balance state which ensures that blocked loads
> are periodically updated and decayed but does not perform any task
> migration.
>
> The remote load udpates are rate-limited, so that they are not
> performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT
> half-life). This is the period after which we have a known 50% error
> in stale load.
>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> [Switched remote update interval to use PELT half life]
> [Moved update_blocked_averges call outside rebalance_domains
>  to simplify code]
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> ---
>  kernel/sched/fair.c  | 71 +++++++++++++++++++++++++++++++++++++++++++++-------
>  kernel/sched/sched.h |  1 +
>  2 files changed, 63 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 85d1ec1c3b39..9085caf49c76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5976,6 +5976,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>         return min_cap * 1024 < task_util(p) * capacity_margin;
>  }
>
> +static inline bool nohz_kick_needed(struct rq *rq, bool only_update);
> +static void nohz_balancer_kick(bool only_update);
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6074,6 +6077,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>         }
>         rcu_read_unlock();
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +       if (nohz_kick_needed(cpu_rq(new_cpu), true))
> +               nohz_balancer_kick(true);
> +#endif
> +
>         return new_cpu;
>  }
>
> @@ -8653,6 +8661,7 @@ static struct {
>         cpumask_var_t idle_cpus_mask;
>         atomic_t nr_cpus;
>         unsigned long next_balance;     /* in jiffy units */
> +       unsigned long next_update;     /* in jiffy units */
>  } nohz ____cacheline_aligned;
>
>  static inline int find_new_ilb(void)
> @@ -8670,7 +8679,7 @@ static inline int find_new_ilb(void)
>   * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
>   * CPU (if there is one).
>   */
> -static void nohz_balancer_kick(void)
> +static void nohz_balancer_kick(bool only_update)
>  {
>         int ilb_cpu;
>
> @@ -8683,6 +8692,10 @@ static void nohz_balancer_kick(void)
>
>         if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
>                 return;
> +
> +       if (only_update)
> +               set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> +
>         /*
>          * Use smp_send_reschedule() instead of resched_cpu().
>          * This way we generate a sched IPI on the target cpu which
> @@ -8770,6 +8783,8 @@ void nohz_balance_enter_idle(int cpu)
>         atomic_inc(&nohz.nr_cpus);
>         set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
>  }
> +#else
> +static inline void nohz_balancer_kick(bool only_update) {}
>  #endif
>
>  static DEFINE_SPINLOCK(balancing);
> @@ -8801,8 +8816,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>         int need_serialize, need_decay = 0;
>         u64 max_cost = 0;
>
> -       update_blocked_averages(cpu);
> -
>         rcu_read_lock();
>         for_each_domain(cpu, sd) {
>                 /*
> @@ -8901,6 +8914,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
>         int this_cpu = this_rq->cpu;
>         struct rq *rq;
> +       struct sched_domain *sd;
>         int balance_cpu;
>         /* Earliest time when we have to do rebalance again */
>         unsigned long next_balance = jiffies + 60*HZ;
> @@ -8910,6 +8924,23 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>             !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
>                 goto end;
>
> +       /*
> +        * This cpu is going to update the blocked load of idle CPUs either
> +        * before doing a rebalancing or just to keep metrics up to date. we
> +        * can safely update the next update timestamp
> +        */
> +       rcu_read_lock();
> +       sd = rcu_dereference(this_rq->sd);
> +       /*
> +        * Check whether there is a sched_domain available for this cpu.
> +        * The last other cpu can have been unplugged since the ILB has been
> +        * triggered and the sched_domain can now be null. The idle balance
> +        * sequence will quickly be aborted as there is no more idle CPUs
> +        */
> +       if (sd)
> +               nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD);
> +       rcu_read_unlock();
> +
>         for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>                 if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>                         continue;
> @@ -8936,7 +8967,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>                         cpu_load_update_idle(rq);
>                         rq_unlock_irq(rq, &rf);
>
> -                       rebalance_domains(rq, CPU_IDLE);
> +                       update_blocked_averages(balance_cpu);
> +                       /*
> +                        * This idle load balance softirq may have been
> +                        * triggered only to update the blocked load and shares
> +                        * of idle CPUs (which we have just done for
> +                        * balance_cpu). In that case skip the actual balance.
> +                        */
> +                       if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)))
> +                               rebalance_domains(rq, idle);
>                 }
>
>                 if (time_after(next_balance, rq->next_balance)) {
> @@ -8967,7 +9006,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>   *     domain span are idle.
>   */
> -static inline bool nohz_kick_needed(struct rq *rq)
> +static inline bool nohz_kick_needed(struct rq *rq, bool only_update)
>  {
>         unsigned long now = jiffies;
>         struct sched_domain_shared *sds;
> @@ -8975,7 +9014,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
>         int nr_busy, i, cpu = rq->cpu;
>         bool kick = false;
>
> -       if (unlikely(rq->idle_balance))
> +       if (unlikely(rq->idle_balance) && !only_update)
>                 return false;
>
>         /*
> @@ -8992,6 +9031,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
>         if (likely(!atomic_read(&nohz.nr_cpus)))
>                 return false;
>
> +       if (only_update) {
> +               if (time_before(now, nohz.next_update))
> +                       return false;
> +               else
> +                       return true;
> +       }
> +
>         if (time_before(now, nohz.next_balance))
>                 return false;
>
> @@ -9041,6 +9087,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
>  }
>  #else
>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> +static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { return false; }
>  #endif
>
>  /*
> @@ -9062,7 +9109,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>          * and abort nohz_idle_balance altogether if we pull some load.
>          */
>         nohz_idle_balance(this_rq, idle);
> -       rebalance_domains(this_rq, idle);
> +       update_blocked_averages(this_rq->cpu);
> +       if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)))

The NOHZ_STATS_KICK field is only defined with CONFIG_NO_HZ_COMMON.

> +               rebalance_domains(this_rq, idle);
> +#ifdef CONFIG_NO_HZ_COMMON
> +       clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
> +#endif
>  }
>
>  /*
> @@ -9077,8 +9129,8 @@ void trigger_load_balance(struct rq *rq)
>         if (time_after_eq(jiffies, rq->next_balance))
>                 raise_softirq(SCHED_SOFTIRQ);
>  #ifdef CONFIG_NO_HZ_COMMON
> -       if (nohz_kick_needed(rq))
> -               nohz_balancer_kick();
> +       if (nohz_kick_needed(rq, false))
> +               nohz_balancer_kick(false);
>  #endif
>  }
>
> @@ -9657,6 +9709,7 @@ __init void init_sched_fair_class(void)
>
>  #ifdef CONFIG_NO_HZ_COMMON
>         nohz.next_balance = jiffies;
> +       nohz.next_update = jiffies;
>         zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
>  #endif
>  #endif /* SMP */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14db76cd496f..6f95ef653f73 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1978,6 +1978,7 @@ extern void cfs_bandwidth_usage_dec(void);
>  enum rq_nohz_flag_bits {
>         NOHZ_TICK_STOPPED,
>         NOHZ_BALANCE_KICK,
> +       NOHZ_STATS_KICK
>  };
>
>  #define nohz_flags(cpu)        (&cpu_rq(cpu)->nohz_flags)
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Update blocked load from newly idle balance
  2017-10-24 12:25 ` [PATCH 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
@ 2017-11-20  9:07   ` Vincent Guittot
  0 siblings, 0 replies; 8+ messages in thread
From: Vincent Guittot @ 2017-11-20  9:07 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On 24 October 2017 at 14:25, Brendan Jackman <brendan.jackman@arm.com> wrote:
> We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is
> good, but it brings about CPU wakeups, which have an energy cost. As an
> alternative to waking CPUs up to do decay blocked load, we can sometimes do it
> from newly idle balance. If the newly idle balance is on a domain that covers
> all the currently nohz-idle CPUs, we push the value of nohz.next_update into the
> future. That means that if such newly idle balances happen often enough, we
> never need wake up a CPU just to update load.
>
> Since we're doing this new update inside a for_each_domain, we need to do
> something to avoid doing multiple updates on the same CPU in the same
> idle_balance. A tick stamp is set on the rq in update_blocked_averages as a
> simple way to do this. Using a simple jiffies-based timestamp, as opposed to the
> last_update_time of the root cfs_rq's sched_avg, means we can do this without
> taking the rq lock.
>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> ---
>  kernel/sched/core.c  |  1 +
>  kernel/sched/fair.c  | 41 +++++++++++++++++++++++++++++++++++------
>  kernel/sched/sched.h |  1 +
>  3 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d17c5da523a0..d8e71fd27806 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5923,6 +5923,7 @@ void __init sched_init(void)
>                 rq_attach_root(rq, &def_root_domain);
>  #ifdef CONFIG_NO_HZ_COMMON
>                 rq->last_load_update_tick = jiffies;
> +               rq->last_blocked_load_update_tick = jiffies;
>                 rq->nohz_flags = 0;
>  #endif
>  #ifdef CONFIG_NO_HZ_FULL
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9085caf49c76..45e9c8056161 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7062,6 +7062,7 @@ static void update_blocked_averages(int cpu)
>                 if (cfs_rq_is_decayed(cfs_rq))
>                         list_del_leaf_cfs_rq(cfs_rq);
>         }
> +       rq->last_blocked_load_update_tick = jiffies;

last_blocked_load_update_tick is defined under CONFIG_NO_HZ_COMMON and
CONFIG_SMP
whereas update_blocked_averages() is not. This generates a compilation error

>         rq_unlock_irqrestore(rq, &rf);
>  }
>
> @@ -7121,6 +7122,7 @@ static inline void update_blocked_averages(int cpu)
>         rq_lock_irqsave(rq, &rf);
>         update_rq_clock(rq);
>         update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> +       rq->last_blocked_load_update_tick = jiffies;
>         rq_unlock_irqrestore(rq, &rf);
>  }
>
> @@ -7615,6 +7617,15 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static struct {
> +       cpumask_var_t idle_cpus_mask;
> +       atomic_t nr_cpus;
> +       unsigned long next_balance;     /* in jiffy units */
> +       unsigned long next_update;     /* in jiffy units */
> +} nohz ____cacheline_aligned;
> +#endif
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -7633,6 +7644,30 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>         if (child && child->flags & SD_PREFER_SIBLING)
>                 prefer_sibling = 1;
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +       if (env->idle == CPU_NEWLY_IDLE) {
> +               int cpu;
> +
> +               /* Update the stats of NOHZ idle CPUs in the sd */
> +               for_each_cpu_and(cpu, sched_domain_span(env->sd),
> +                                nohz.idle_cpus_mask) {
> +                       struct rq *rq = cpu_rq(cpu);
> +
> +                       /* ... Unless we've already done since the last tick */
> +                       if (time_after(jiffies,
> +                                       rq->last_blocked_load_update_tick))
> +                               update_blocked_averages(cpu);
> +               }
> +       }
> +       /*
> +        * If we've just updated all of the NOHZ idle CPUs, then we can push
> +        * back the next nohz.next_update, which will prevent an unnecessary
> +        * wakeup for the nohz stats kick
> +        */
> +       if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd)))
> +               nohz.next_update = jiffies + LOAD_AVG_PERIOD;
> +#endif
> +
>         load_idx = get_sd_load_idx(env->sd, env->idle);
>
>         do {
> @@ -8657,12 +8692,6 @@ static inline int on_null_domain(struct rq *rq)
>   *   needed, they will kick the idle load balancer, which then does idle
>   *   load balancing for all the idle CPUs.
>   */
> -static struct {
> -       cpumask_var_t idle_cpus_mask;
> -       atomic_t nr_cpus;
> -       unsigned long next_balance;     /* in jiffy units */
> -       unsigned long next_update;     /* in jiffy units */
> -} nohz ____cacheline_aligned;
>
>  static inline int find_new_ilb(void)
>  {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6f95ef653f73..6be8938bb977 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -681,6 +681,7 @@ struct rq {
>  #ifdef CONFIG_NO_HZ_COMMON
>  #ifdef CONFIG_SMP
>         unsigned long last_load_update_tick;
> +       unsigned long last_blocked_load_update_tick;
>  #endif /* CONFIG_SMP */
>         unsigned long nohz_flags;
>  #endif /* CONFIG_NO_HZ_COMMON */
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched: force update of blocked load of idle cpus
  2017-11-20  9:04   ` Vincent Guittot
@ 2017-11-30 15:59     ` Brendan Jackman
  0 siblings, 0 replies; 8+ messages in thread
From: Brendan Jackman @ 2017-11-30 15:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Ingo Molnar, Morten Rasmussen

Hi Vincent,

On Mon, Nov 20 2017 at 09:04, Vincent Guittot wrote:
> On 24 October 2017 at 14:25, Brendan Jackman <brendan.jackman@arm.com> wrote:
>> @@ -9062,7 +9109,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>>          * and abort nohz_idle_balance altogether if we pull some load.
>>          */
>>         nohz_idle_balance(this_rq, idle);
>> -       rebalance_domains(this_rq, idle);
>> +       update_blocked_averages(this_rq->cpu);
>> +       if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)))
>
> The NOHZ_STATS_KICK field is only defined with CONFIG_NO_HZ_COMMON.

Damn, sorry. Will fix this and the similar issue you pointed out on
patch 2/2 and send a v2.

Thanks for reviewing,
Brendan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-30 15:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-24 12:25 [PATCH 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
2017-10-24 12:25 ` [PATCH 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
2017-11-09 19:56   ` Todd Kjos
2017-11-10 14:53     ` Brendan Jackman
2017-11-20  9:04   ` Vincent Guittot
2017-11-30 15:59     ` Brendan Jackman
2017-10-24 12:25 ` [PATCH 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
2017-11-20  9:07   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).