linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] sched/fair: remote load updates for idle CPUs
@ 2017-12-01 18:01 Brendan Jackman
  2017-12-01 18:01 ` [PATCH v2 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
  2017-12-01 18:01 ` [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
  0 siblings, 2 replies; 13+ messages in thread
From: Brendan Jackman @ 2017-12-01 18:01 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel

Hi,

Here is the v2, which was just delayed by my holiday.

Until a NOHZ idle balance takes place on behalf of a CPU (which may
never happen), the blocked load of of its root cfs_rq and its
contributions to task group shares are updated only by that CPU. That
means if a CPU goes suddenly from being busy to totally idle, its
load and effect on shares may not be updated for a long time.

Schedutil works around this problem by ignoring the util of CPUs
that were last updated more than a tick ago. However the stale
load does impact task placement: elements that look at load and
util (in particular the slow-path of select_task_rq_fair) can
leave the idle CPUs un-used while other CPUs go unnecessarily
overloaded. Furthermore the stale shares can impact CPU time
allotment.

Two complementary solutions are proposed here:
1. When a task wakes up, if necessary an idle CPU is woken as if to
   perform a NOHZ idle balance, which is then aborted once the load
   of NOHZ idle CPUs has been updated. This solves the problem but
   brings with it extra CPU wakeups, which have an energy cost.
2. During newly-idle load balancing, the load of remote nohz-idle
   CPUs in the sched_domain is updated. When all of the idle CPUs
   were updated in that step, the nohz.next_update field
   is pushed further into the future. This field is used to determine
   the need for triggering the newly-added NOHZ kick. So if such
   newly-idle balances are happening often enough, no additional CPU
   wakeups are required to keep all the CPUs' loads and shares
   contributions updated.

Changes v1 -> v2:

 - Vincent pointed out I broke !CONFIG_NO_HZ_COMMON, fixed that.

 - Tood Kjos pointed out that a stats kick from CPU A can inhibit a
   full balance kick from CPU B. Reduced the scope for that by having
   CPU B convert the pending/ongoing stats kick to a proper balance
   by clearing the NOHZ_STATS_KICK bit in nohz_kick_needed.

Brendan Jackman (1):
  sched/fair: Update blocked load from newly idle balance

Vincent Guittot (1):
  sched: force update of blocked load of idle cpus

 kernel/sched/core.c  |   1 +
 kernel/sched/fair.c  | 128 +++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   2 +
 3 files changed, 116 insertions(+), 15 deletions(-)

--
2.14.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-01 18:01 [PATCH v2 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
@ 2017-12-01 18:01 ` Brendan Jackman
  2017-12-20 14:03   ` Peter Zijlstra
  2017-12-20 14:09   ` Peter Zijlstra
  2017-12-01 18:01 ` [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
  1 sibling, 2 replies; 13+ messages in thread
From: Brendan Jackman @ 2017-12-01 18:01 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Ingo Molnar, Morten Rasmussen

From: Vincent Guittot <vincent.guittot@linaro.org>

When idle, the blocked load of CPUs will be updated only when an idle
load balance is triggered which may never happen. Because of this
uncertainty on the execution of idle load balance, the utilization,
the load and the shares of idle cfs_rq can stay artificially high and
steal shares and running time to busy cfs_rqs of the task group.
Add a new light idle load balance state which ensures that blocked loads
are periodically updated and decayed but does not perform any task
migration.

The remote load udpates are rate-limited, so that they are not
performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT
half-life). This is the period after which we have a known 50% error
in stale load.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[Switched remote update interval to use PELT half life]
[Moved update_blocked_averges call outside rebalance_domains
 to simplify code]
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
---
 kernel/sched/fair.c  | 86 ++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  1 +
 2 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19bbca2..f83e8f0d4f06 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6258,6 +6258,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update);
+static void nohz_balancer_kick(bool only_update);
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6334,6 +6337,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 	rcu_read_unlock();
 
+#ifdef CONFIG_NO_HZ_COMMON
+	if (nohz_kick_needed(cpu_rq(new_cpu), true))
+		nohz_balancer_kick(true);
+#endif
+
 	return new_cpu;
 }
 
@@ -8927,6 +8935,7 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	atomic_t nr_cpus;
 	unsigned long next_balance;     /* in jiffy units */
+	unsigned long next_update;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
 static inline int find_new_ilb(void)
@@ -8944,7 +8953,7 @@ static inline int find_new_ilb(void)
  * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
  * CPU (if there is one).
  */
-static void nohz_balancer_kick(void)
+static void nohz_balancer_kick(bool only_update)
 {
 	int ilb_cpu;
 
@@ -8955,8 +8964,20 @@ static void nohz_balancer_kick(void)
 	if (ilb_cpu >= nr_cpu_ids)
 		return;
 
-	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
+	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
+		if (!only_update) {
+			/*
+			 * There's a pending/ongoing nohz kick/balance. If it's
+			 * just for stats, convert it to a proper load balance.
+			 */
+			clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
+		}
 		return;
+	}
+
+	if (only_update)
+		set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
+
 	/*
 	 * Use smp_send_reschedule() instead of resched_cpu().
 	 * This way we generate a sched IPI on the target cpu which
@@ -9044,6 +9065,8 @@ void nohz_balance_enter_idle(int cpu)
 	atomic_inc(&nohz.nr_cpus);
 	set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
 }
+#else
+static inline void nohz_balancer_kick(bool only_update) {}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
@@ -9075,8 +9098,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	int need_serialize, need_decay = 0;
 	u64 max_cost = 0;
 
-	update_blocked_averages(cpu);
-
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
 		/*
@@ -9167,6 +9188,13 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
+
+/* True iff @cpu was kicked for nohz balance, but just to update blocked load */
+static inline bool in_nohz_stats_kick(int cpu)
+{
+	return test_bit(NOHZ_STATS_KICK, nohz_flags(cpu));
+}
+
 /*
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
@@ -9175,6 +9203,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
+	struct sched_domain *sd;
 	int balance_cpu;
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
@@ -9184,6 +9213,23 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
+	/*
+	 * This cpu is going to update the blocked load of idle CPUs either
+	 * before doing a rebalancing or just to keep metrics up to date. we
+	 * can safely update the next update timestamp
+	 */
+	rcu_read_lock();
+	sd = rcu_dereference(this_rq->sd);
+	/*
+	 * Check whether there is a sched_domain available for this cpu.
+	 * The last other cpu can have been unplugged since the ILB has been
+	 * triggered and the sched_domain can now be null. The idle balance
+	 * sequence will quickly be aborted as there is no more idle CPUs
+	 */
+	if (sd)
+		nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD);
+	rcu_read_unlock();
+
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
@@ -9210,7 +9256,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 			cpu_load_update_idle(rq);
 			rq_unlock_irq(rq, &rf);
 
-			rebalance_domains(rq, CPU_IDLE);
+			update_blocked_averages(balance_cpu);
+			/*
+			 * This idle load balance softirq may have been
+			 * triggered only to update the blocked load and shares
+			 * of idle CPUs (which we have just done for
+			 * balance_cpu). In that case skip the actual balance.
+			 */
+			if (!in_nohz_stats_kick(this_cpu))
+				rebalance_domains(rq, idle);
 		}
 
 		if (time_after(next_balance, rq->next_balance)) {
@@ -9241,7 +9295,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline bool nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update)
 {
 	unsigned long now = jiffies;
 	struct sched_domain_shared *sds;
@@ -9249,7 +9303,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	int nr_busy, i, cpu = rq->cpu;
 	bool kick = false;
 
-	if (unlikely(rq->idle_balance))
+	if (unlikely(rq->idle_balance) && !only_update)
 		return false;
 
        /*
@@ -9266,6 +9320,9 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	if (likely(!atomic_read(&nohz.nr_cpus)))
 		return false;
 
+	if (only_update)
+		return time_after_eq(now, nohz.next_update);
+
 	if (time_before(now, nohz.next_balance))
 		return false;
 
@@ -9313,8 +9370,11 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	rcu_read_unlock();
 	return kick;
 }
+
 #else
+static inline bool in_nohz_stats_kick(int cpu) { return false; }
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { return false; }
 #endif
 
 /*
@@ -9336,7 +9396,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
 	 * and abort nohz_idle_balance altogether if we pull some load.
 	 */
 	nohz_idle_balance(this_rq, idle);
-	rebalance_domains(this_rq, idle);
+	update_blocked_averages(this_rq->cpu);
+	if (!in_nohz_stats_kick(this_rq->cpu))
+		rebalance_domains(this_rq, idle);
+#ifdef CONFIG_NO_HZ_COMMON
+	clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
+#endif
 }
 
 /*
@@ -9351,8 +9416,8 @@ void trigger_load_balance(struct rq *rq)
 	if (time_after_eq(jiffies, rq->next_balance))
 		raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ_COMMON
-	if (nohz_kick_needed(rq))
-		nohz_balancer_kick();
+	if (nohz_kick_needed(rq, false))
+		nohz_balancer_kick(false);
 #endif
 }
 
@@ -9927,6 +9992,7 @@ __init void init_sched_fair_class(void)
 
 #ifdef CONFIG_NO_HZ_COMMON
 	nohz.next_balance = jiffies;
+	nohz.next_update = jiffies;
 	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
 #endif
 #endif /* SMP */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a212de..a7e48b6ee68c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2014,6 +2014,7 @@ extern void cfs_bandwidth_usage_dec(void);
 enum rq_nohz_flag_bits {
 	NOHZ_TICK_STOPPED,
 	NOHZ_BALANCE_KICK,
+	NOHZ_STATS_KICK
 };
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance
  2017-12-01 18:01 [PATCH v2 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
  2017-12-01 18:01 ` [PATCH v2 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
@ 2017-12-01 18:01 ` Brendan Jackman
  2017-12-20 14:22   ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Brendan Jackman @ 2017-12-01 18:01 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Ingo Molnar, Morten Rasmussen

We now have a NOHZ kick to avoid the load of idle CPUs becoming
stale. This is good, but it brings about CPU wakeups, which have an
energy cost. As an alternative to waking CPUs up to do decay blocked
load, we can sometimes do it from newly idle balance. If the newly
idle balance is on a domain that covers all the currently nohz-idle
CPUs, we push the value of nohz.next_update into the future. That
means that if such newly idle balances happen often enough, we never
need wake up a CPU just to update load.

Since we're doing this new update inside a for_each_domain, we need
to do something to avoid doing multiple updates on the same CPU in
the same idle_balance. A tick stamp is set on the rq in
update_blocked_averages as a simple way to do this. Using a simple
jiffies-based timestamp, as opposed to the last_update_time of the
root cfs_rq's sched_avg, means we can do this without taking the rq
lock.

Change-Id: If9d4e14d7b4da86e05474b5c125d91d9b47f9e93
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 44 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  1 +
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75554f366fd3..715d2b9ea7e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5949,6 +5949,7 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
 		rq->last_load_update_tick = jiffies;
+		rq->last_blocked_load_update_tick = jiffies;
 		rq->nohz_flags = 0;
 #endif
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f83e8f0d4f06..011dcfa1bb5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7343,6 +7343,9 @@ static void update_blocked_averages(int cpu)
 		if (cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 	}
+#ifdef CONFIG_NO_HZ_COMMON
+	rq->last_blocked_load_update_tick = jiffies;
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7402,6 +7405,9 @@ static inline void update_blocked_averages(int cpu)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+#ifdef CONFIG_NO_HZ_COMMON
+	rq->last_blocked_load_update_tick = jiffies;
+#endif
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7896,6 +7902,15 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_NO_HZ_COMMON
+static struct {
+	cpumask_var_t idle_cpus_mask;
+	atomic_t nr_cpus;
+	unsigned long next_balance;     /* in jiffy units */
+	unsigned long next_update;     /* in jiffy units */
+} nohz ____cacheline_aligned;
+#endif
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -7913,6 +7928,29 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+#ifdef CONFIG_NO_HZ_COMMON
+	if (env->idle == CPU_NEWLY_IDLE) {
+		int cpu;
+
+		/* Update the stats of NOHZ idle CPUs in the sd */
+		for_each_cpu_and(cpu, sched_domain_span(env->sd),
+				 nohz.idle_cpus_mask) {
+			struct rq *rq = cpu_rq(cpu);
+
+			/* ... Unless we've already done since the last tick */
+			if (time_after(jiffies, rq->last_blocked_load_update_tick))
+				update_blocked_averages(cpu);
+		}
+	}
+	/*
+	 * If we've just updated all of the NOHZ idle CPUs, then we can push
+	 * back the next nohz.next_update, which will prevent an unnecessary
+	 * wakeup for the nohz stats kick
+	 */
+	if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd)))
+		nohz.next_update = jiffies + LOAD_AVG_PERIOD;
+#endif
+
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -8931,12 +8969,6 @@ static inline int on_null_domain(struct rq *rq)
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
  */
-static struct {
-	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
-	unsigned long next_balance;     /* in jiffy units */
-	unsigned long next_update;     /* in jiffy units */
-} nohz ____cacheline_aligned;
 
 static inline int find_new_ilb(void)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7e48b6ee68c..40fa579a4649 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -696,6 +696,7 @@ struct rq {
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
 	unsigned long last_load_update_tick;
+	unsigned long last_blocked_load_update_tick;
 #endif /* CONFIG_SMP */
 	unsigned long nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-01 18:01 ` [PATCH v2 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
@ 2017-12-20 14:03   ` Peter Zijlstra
  2017-12-20 14:23     ` Vincent Guittot
  2017-12-20 14:09   ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-12-20 14:03 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
> @@ -8955,8 +8964,20 @@ static void nohz_balancer_kick(void)
>  	if (ilb_cpu >= nr_cpu_ids)
>  		return;
>  
> -	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
> +	if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
> +		if (!only_update) {
> +			/*
> +			 * There's a pending/ongoing nohz kick/balance. If it's
> +			 * just for stats, convert it to a proper load balance.
> +			 */
> +			clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> +		}
>  		return;
> +	}
> +
> +	if (only_update)
> +		set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> +
>  	/*
>  	 * Use smp_send_reschedule() instead of resched_cpu().
>  	 * This way we generate a sched IPI on the target cpu which

This looks racy.. if its not we don't need atomic ops, if it is but is
still fine it needs a comment.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-01 18:01 ` [PATCH v2 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
  2017-12-20 14:03   ` Peter Zijlstra
@ 2017-12-20 14:09   ` Peter Zijlstra
  2017-12-20 14:27     ` Vincent Guittot
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-12-20 14:09 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:

> @@ -9210,7 +9256,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  			cpu_load_update_idle(rq);
>  			rq_unlock_irq(rq, &rf);
>  
> -			rebalance_domains(rq, CPU_IDLE);
> +			update_blocked_averages(balance_cpu);
> +			/*
> +			 * This idle load balance softirq may have been
> +			 * triggered only to update the blocked load and shares
> +			 * of idle CPUs (which we have just done for
> +			 * balance_cpu). In that case skip the actual balance.
> +			 */
> +			if (!in_nohz_stats_kick(this_cpu))
> +				rebalance_domains(rq, idle);
>  		}
>  
>  		if (time_after(next_balance, rq->next_balance)) {

> @@ -9336,7 +9396,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>  	 * and abort nohz_idle_balance altogether if we pull some load.
>  	 */
>  	nohz_idle_balance(this_rq, idle);
> -	rebalance_domains(this_rq, idle);
> +	update_blocked_averages(this_rq->cpu);
> +	if (!in_nohz_stats_kick(this_rq->cpu))
> +		rebalance_domains(this_rq, idle);
> +#ifdef CONFIG_NO_HZ_COMMON
> +	clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
> +#endif
>  }
>  
>  /*

You're doing the same thing to both (all) callsites of
rebalance_domains(), does that not suggest doing it inside and leaving
update_blocked_averages() where it is?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance
  2017-12-01 18:01 ` [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
@ 2017-12-20 14:22   ` Peter Zijlstra
  2017-12-21  9:19     ` Dietmar Eggemann
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-12-20 14:22 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On Fri, Dec 01, 2017 at 06:01:57PM +0000, Brendan Jackman wrote:
> @@ -7913,6 +7928,29 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	if (child && child->flags & SD_PREFER_SIBLING)
>  		prefer_sibling = 1;
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (env->idle == CPU_NEWLY_IDLE) {
> +		int cpu;
> +
> +		/* Update the stats of NOHZ idle CPUs in the sd */
> +		for_each_cpu_and(cpu, sched_domain_span(env->sd),
> +				 nohz.idle_cpus_mask) {
> +			struct rq *rq = cpu_rq(cpu);
> +
> +			/* ... Unless we've already done since the last tick */
> +			if (time_after(jiffies, rq->last_blocked_load_update_tick))
> +				update_blocked_averages(cpu);
> +		}
> +	}
> +	/*
> +	 * If we've just updated all of the NOHZ idle CPUs, then we can push
> +	 * back the next nohz.next_update, which will prevent an unnecessary
> +	 * wakeup for the nohz stats kick
> +	 */
> +	if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd)))
> +		nohz.next_update = jiffies + LOAD_AVG_PERIOD;
> +#endif
> +
>  	load_idx = get_sd_load_idx(env->sd, env->idle);
>  
>  	do {

We're already going to be iterating all those CPUs through
update_sg_lb_stats(), why not push it all the way down there and avoid
the double iteration?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 14:03   ` Peter Zijlstra
@ 2017-12-20 14:23     ` Vincent Guittot
  2017-12-20 15:01       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Guittot @ 2017-12-20 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On 20 December 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
>> @@ -8955,8 +8964,20 @@ static void nohz_balancer_kick(void)
>>       if (ilb_cpu >= nr_cpu_ids)
>>               return;
>>
>> -     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
>> +     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
>> +             if (!only_update) {
>> +                     /*
>> +                      * There's a pending/ongoing nohz kick/balance. If it's
>> +                      * just for stats, convert it to a proper load balance.
>> +                      */
>> +                     clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
>> +             }
>>               return;
>> +     }
>> +
>> +     if (only_update)
>> +             set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
>> +
>>       /*
>>        * Use smp_send_reschedule() instead of resched_cpu().
>>        * This way we generate a sched IPI on the target cpu which
>
> This looks racy.. if its not we don't need atomic ops, if it is but is
> still fine it needs a comment.

NOHZ_STATS_KICK modification is protected by test_and_set_bit(NOHZ_BALANCE_KICK.
You're right that we don't need atomics ops and __set_bit() is enough

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 14:09   ` Peter Zijlstra
@ 2017-12-20 14:27     ` Vincent Guittot
  2017-12-21 10:01       ` Vincent Guittot
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Guittot @ 2017-12-20 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On 20 December 2017 at 15:09, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
>
>> @@ -9210,7 +9256,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>                       cpu_load_update_idle(rq);
>>                       rq_unlock_irq(rq, &rf);
>>
>> -                     rebalance_domains(rq, CPU_IDLE);
>> +                     update_blocked_averages(balance_cpu);
>> +                     /*
>> +                      * This idle load balance softirq may have been
>> +                      * triggered only to update the blocked load and shares
>> +                      * of idle CPUs (which we have just done for
>> +                      * balance_cpu). In that case skip the actual balance.
>> +                      */
>> +                     if (!in_nohz_stats_kick(this_cpu))
>> +                             rebalance_domains(rq, idle);
>>               }
>>
>>               if (time_after(next_balance, rq->next_balance)) {
>
>> @@ -9336,7 +9396,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>>        * and abort nohz_idle_balance altogether if we pull some load.
>>        */
>>       nohz_idle_balance(this_rq, idle);
>> -     rebalance_domains(this_rq, idle);
>> +     update_blocked_averages(this_rq->cpu);
>> +     if (!in_nohz_stats_kick(this_rq->cpu))
>> +             rebalance_domains(this_rq, idle);
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +     clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
>> +#endif
>>  }
>>
>>  /*
>
> You're doing the same thing to both (all) callsites of
> rebalance_domains(), does that not suggest doing it inside and leaving
> update_blocked_averages() where it is?

The goal of moving update_blocked_averages() outside rebalance_domains
is to not add a new parameter or use a special  cpu_idle_type value in
rebalance_domains parameters in order to abort the rebalance sequence
just after updating blocked load

>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 14:23     ` Vincent Guittot
@ 2017-12-20 15:01       ` Peter Zijlstra
  2017-12-20 15:05         ` Peter Zijlstra
  2017-12-21  7:59         ` Vincent Guittot
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-12-20 15:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On Wed, Dec 20, 2017 at 03:23:01PM +0100, Vincent Guittot wrote:
> On 20 December 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
> >> @@ -8955,8 +8964,20 @@ static void nohz_balancer_kick(void)
> >>       if (ilb_cpu >= nr_cpu_ids)
> >>               return;
> >>
> >> -     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
> >> +     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
> >> +             if (!only_update) {
> >> +                     /*
> >> +                      * There's a pending/ongoing nohz kick/balance. If it's
> >> +                      * just for stats, convert it to a proper load balance.
> >> +                      */
> >> +                     clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> >> +             }
> >>               return;
> >> +     }
> >> +
> >> +     if (only_update)
> >> +             set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> >> +
> >>       /*
> >>        * Use smp_send_reschedule() instead of resched_cpu().
> >>        * This way we generate a sched IPI on the target cpu which
> >
> > This looks racy.. if its not we don't need atomic ops, if it is but is
> > still fine it needs a comment.
> 
> NOHZ_STATS_KICK modification is protected by test_and_set_bit(NOHZ_BALANCE_KICK.
> You're right that we don't need atomics ops and __set_bit() is enough

Well, you shouldn't mix atomic and non-atomic ops to the same word,
that's asking for trouble.

But why don't you do something like:

nohz_kick()

	flags = NOHZ_STAT;
	if (!only_update)
		flags |= NOHZ_BALANCE;

	atomic_long_or(flags, &nohz_cpu(cpu));


nohz_idle_balance()

	unsigned long do_flags = atomic_long_fetch_andnot(NOHZ_BALANCE|NOHZ_STAT, &nohz_flags(cpu));

	if (do_flags & NOHZ_STAT)
		update_blocked_stuff();

	if (do_flags & NOHZ_BALANCE)
		rebalance_domains();

That way its far more readable.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 15:01       ` Peter Zijlstra
@ 2017-12-20 15:05         ` Peter Zijlstra
  2017-12-21  7:59         ` Vincent Guittot
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-12-20 15:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On Wed, Dec 20, 2017 at 04:01:10PM +0100, Peter Zijlstra wrote:
> Well, you shouldn't mix atomic and non-atomic ops to the same word,
> that's asking for trouble.
> 
> But why don't you do something like:
> 
> nohz_kick()
> 
> 	flags = NOHZ_STAT;
> 	if (!only_update)
> 		flags |= NOHZ_BALANCE;
> 
> 	atomic_long_or(flags, &nohz_cpu(cpu));
> 
> 
> nohz_idle_balance()
> 
> 	unsigned long do_flags = atomic_long_fetch_andnot(NOHZ_BALANCE|NOHZ_STAT, &nohz_flags(cpu));
> 
> 	if (do_flags & NOHZ_STAT)
> 		update_blocked_stuff();
> 
> 	if (do_flags & NOHZ_BALANCE)
> 		rebalance_domains();
> 
> That way its far more readable.

we could use atomic_t too, there's not that many flags in there, the
only reason its long is because of that bitmap crud.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 15:01       ` Peter Zijlstra
  2017-12-20 15:05         ` Peter Zijlstra
@ 2017-12-21  7:59         ` Vincent Guittot
  1 sibling, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2017-12-21  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

Le Wednesday 20 Dec 2017 à 16:01:10 (+0100), Peter Zijlstra a écrit :
> On Wed, Dec 20, 2017 at 03:23:01PM +0100, Vincent Guittot wrote:
> > On 20 December 2017 at 15:03, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
> > >> @@ -8955,8 +8964,20 @@ static void nohz_balancer_kick(void)
> > >>       if (ilb_cpu >= nr_cpu_ids)
> > >>               return;
> > >>
> > >> -     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu)))
> > >> +     if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) {
> > >> +             if (!only_update) {
> > >> +                     /*
> > >> +                      * There's a pending/ongoing nohz kick/balance. If it's
> > >> +                      * just for stats, convert it to a proper load balance.
> > >> +                      */
> > >> +                     clear_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> > >> +             }
> > >>               return;
> > >> +     }
> > >> +
> > >> +     if (only_update)
> > >> +             set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));
> > >> +
> > >>       /*
> > >>        * Use smp_send_reschedule() instead of resched_cpu().
> > >>        * This way we generate a sched IPI on the target cpu which
> > >
> > > This looks racy.. if its not we don't need atomic ops, if it is but is
> > > still fine it needs a comment.
> > 
> > NOHZ_STATS_KICK modification is protected by test_and_set_bit(NOHZ_BALANCE_KICK.
> > You're right that we don't need atomics ops and __set_bit() is enough
> 
> Well, you shouldn't mix atomic and non-atomic ops to the same word,
> that's asking for trouble.

In fact, the atomic operation is needed, I forgot that set/clear of NOHZ_TICK_STOPPED
can happen simultaneously on the ilb_cpu so we must keep
set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu));

As comment we can add:
 /*
  * Update of NOHZ_STATS_KICK itself is protected by test_and_set_biti but
  * clear/set of NOHZ_TICK_STOPPED can happen simultaneously so we need
  * atomic operation
  */

> 
> But why don't you do something like:
> 
> nohz_kick()
> 
> 	flags = NOHZ_STAT;
> 	if (!only_update)
> 		flags |= NOHZ_BALANCE;
> 
> 	atomic_long_or(flags, &nohz_cpu(cpu));
> 
> 
> nohz_idle_balance()
> 
> 	unsigned long do_flags = atomic_long_fetch_andnot(NOHZ_BALANCE|NOHZ_STAT, &nohz_flags(cpu));
> 
> 	if (do_flags & NOHZ_STAT)
> 		update_blocked_stuff();
> 
> 	if (do_flags & NOHZ_BALANCE)
> 		rebalance_domains();
> 
> That way its far more readable.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance
  2017-12-20 14:22   ` Peter Zijlstra
@ 2017-12-21  9:19     ` Dietmar Eggemann
  0 siblings, 0 replies; 13+ messages in thread
From: Dietmar Eggemann @ 2017-12-21  9:19 UTC (permalink / raw)
  To: Peter Zijlstra, Brendan Jackman
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Ingo Molnar,
	Morten Rasmussen

On 12/20/2017 03:22 PM, Peter Zijlstra wrote:
> On Fri, Dec 01, 2017 at 06:01:57PM +0000, Brendan Jackman wrote:
>> @@ -7913,6 +7928,29 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>   	if (child && child->flags & SD_PREFER_SIBLING)
>>   		prefer_sibling = 1;
>>   
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	if (env->idle == CPU_NEWLY_IDLE) {
>> +		int cpu;
>> +
>> +		/* Update the stats of NOHZ idle CPUs in the sd */
>> +		for_each_cpu_and(cpu, sched_domain_span(env->sd),
>> +				 nohz.idle_cpus_mask) {
>> +			struct rq *rq = cpu_rq(cpu);
>> +
>> +			/* ... Unless we've already done since the last tick */
>> +			if (time_after(jiffies, rq->last_blocked_load_update_tick))
>> +				update_blocked_averages(cpu);
>> +		}
>> +	}
>> +	/*
>> +	 * If we've just updated all of the NOHZ idle CPUs, then we can push
>> +	 * back the next nohz.next_update, which will prevent an unnecessary
>> +	 * wakeup for the nohz stats kick
>> +	 */
>> +	if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd)))
>> +		nohz.next_update = jiffies + LOAD_AVG_PERIOD;
>> +#endif
>> +
>>   	load_idx = get_sd_load_idx(env->sd, env->idle);
>>   
>>   	do {
> 
> We're already going to be iterating all those CPUs through
> update_sg_lb_stats(), why not push it all the way down there and avoid
> the double iteration?

Brendan doesn't work for ARM anymore although he might still follow this 
discussion.

So you think that we can do the time_after(jiffies, 
rq->last_blocked_load_update_tick) check in update_sg_lb_stats() and 
possibly the update_blocked_averages() for all the cpus in the sd and 
then only update nohz.next_update in update_sd_lb_stats(). Let me try 
this ...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] sched: force update of blocked load of idle cpus
  2017-12-20 14:27     ` Vincent Guittot
@ 2017-12-21 10:01       ` Vincent Guittot
  0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2017-12-21 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Jackman, Dietmar Eggemann, Ingo Molnar, linux-kernel,
	Ingo Molnar, Morten Rasmussen

On 20 December 2017 at 15:27, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> On 20 December 2017 at 15:09, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, Dec 01, 2017 at 06:01:56PM +0000, Brendan Jackman wrote:
>>
>>> @@ -9210,7 +9256,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>                       cpu_load_update_idle(rq);
>>>                       rq_unlock_irq(rq, &rf);
>>>
>>> -                     rebalance_domains(rq, CPU_IDLE);
>>> +                     update_blocked_averages(balance_cpu);
>>> +                     /*
>>> +                      * This idle load balance softirq may have been
>>> +                      * triggered only to update the blocked load and shares
>>> +                      * of idle CPUs (which we have just done for
>>> +                      * balance_cpu). In that case skip the actual balance.
>>> +                      */
>>> +                     if (!in_nohz_stats_kick(this_cpu))
>>> +                             rebalance_domains(rq, idle);
>>>               }
>>>
>>>               if (time_after(next_balance, rq->next_balance)) {
>>
>>> @@ -9336,7 +9396,12 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>>>        * and abort nohz_idle_balance altogether if we pull some load.
>>>        */
>>>       nohz_idle_balance(this_rq, idle);
>>> -     rebalance_domains(this_rq, idle);
>>> +     update_blocked_averages(this_rq->cpu);
>>> +     if (!in_nohz_stats_kick(this_rq->cpu))
>>> +             rebalance_domains(this_rq, idle);
>>> +#ifdef CONFIG_NO_HZ_COMMON
>>> +     clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
>>> +#endif
>>>  }
>>>
>>>  /*
>>
>> You're doing the same thing to both (all) callsites of
>> rebalance_domains(), does that not suggest doing it inside and leaving
>> update_blocked_averages() where it is?
>
> The goal of moving update_blocked_averages() outside rebalance_domains
> is to not add a new parameter or use a special  cpu_idle_type value in
> rebalance_domains parameters in order to abort the rebalance sequence
> just after updating blocked load

Peter,
Is the reason above reasonable or you prefer update_blocked_averages
to stay in rebalance_domains ?

>
>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-12-21 10:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01 18:01 [PATCH v2 0/2] sched/fair: remote load updates for idle CPUs Brendan Jackman
2017-12-01 18:01 ` [PATCH v2 1/2] sched: force update of blocked load of idle cpus Brendan Jackman
2017-12-20 14:03   ` Peter Zijlstra
2017-12-20 14:23     ` Vincent Guittot
2017-12-20 15:01       ` Peter Zijlstra
2017-12-20 15:05         ` Peter Zijlstra
2017-12-21  7:59         ` Vincent Guittot
2017-12-20 14:09   ` Peter Zijlstra
2017-12-20 14:27     ` Vincent Guittot
2017-12-21 10:01       ` Vincent Guittot
2017-12-01 18:01 ` [PATCH v2 2/2] sched/fair: Update blocked load from newly idle balance Brendan Jackman
2017-12-20 14:22   ` Peter Zijlstra
2017-12-21  9:19     ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).