linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] avoid spurious blocked load update
@ 2021-11-12  9:58 Vincent Guittot
  2021-11-12  9:58 ` [PATCH 1/2] sched/fair: skip newidle update stats Vincent Guittot
  2021-11-16 23:48 ` [PATCH 0/2] avoid spurious blocked load update Tim Chen
  0 siblings, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-11-12  9:58 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: joel, Vincent Guittot

This patchset is a follow up of : 
https://lore.kernel.org/lkml/20211019123537.17146-1-vincent.guittot@linaro.org/

It ensures that newly idle load balance will not kick the update of
blocked load if it skips the load balance because avg_idle is too short.
It also makes sure that rq->next_balance doesn't go in the past when
updated.

Tim Chen (1):
  sched: sched: Fix rq->next_balance time updated to earlier than
    current time

Vincent Guittot (1):
  sched/fair: skip newidle update stats

 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12  9:58 [PATCH 0/2] avoid spurious blocked load update Vincent Guittot
@ 2021-11-12  9:58 ` Vincent Guittot
  2021-11-12 14:29   ` Peter Zijlstra
  2021-11-12 14:55   ` Peter Zijlstra
  2021-11-16 23:48 ` [PATCH 0/2] avoid spurious blocked load update Tim Chen
  1 sibling, 2 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-11-12  9:58 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: joel, Vincent Guittot

In case we skip the newly idle LB entirely or we abort it because we are
going to exceed the avg_idle, we have to make sure to not start an update
of the blocked load when entering idle

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13950beb01a2..a162b0ec8963 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int this_cpu = this_rq->cpu;
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
-	int pulled_task = 0;
+	int pulled_task = 0, early_stop = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		if (sd) {
 			update_next_balance(sd, &next_balance);
+
+			/*
+			 * We skip new idle LB because there is not enough
+			 * time before next wake up. Make sure that we will
+			 * not kick NOHZ_NEWILB_KICK
+			 */
+			early_stop = 1;
+		}
 		rcu_read_unlock();
 
 		goto out;
@@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 		update_next_balance(sd, &next_balance);
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+			early_stop = 1;
 			break;
+		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
@@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
-	else
+	else if (!early_stop)
 		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12  9:58 ` [PATCH 1/2] sched/fair: skip newidle update stats Vincent Guittot
@ 2021-11-12 14:29   ` Peter Zijlstra
  2021-11-12 14:47     ` Vincent Guittot
  2021-11-12 14:55   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-11-12 14:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel

On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> In case we skip the newly idle LB entirely or we abort it because we are
> going to exceed the avg_idle, we have to make sure to not start an update
> of the blocked load when entering idle
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 13950beb01a2..a162b0ec8963 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	int this_cpu = this_rq->cpu;
>  	u64 t0, t1, curr_cost = 0;
>  	struct sched_domain *sd;
> -	int pulled_task = 0;
> +	int pulled_task = 0, early_stop = 0;
>  
>  	update_misfit_status(NULL, this_rq);
>  
> @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		if (sd) {
>  			update_next_balance(sd, &next_balance);
> +
> +			/*
> +			 * We skip new idle LB because there is not enough
> +			 * time before next wake up. Make sure that we will
> +			 * not kick NOHZ_NEWILB_KICK
> +			 */
> +			early_stop = 1;
> +		}
>  		rcu_read_unlock();
>  
>  		goto out;
> @@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  		update_next_balance(sd, &next_balance);
>  
> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> +			early_stop = 1;
>  			break;
> +		}
>  
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>  
> @@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (pulled_task)
>  		this_rq->idle_stamp = 0;
> -	else
> +	else if (!early_stop)
>  		nohz_newidle_balance(this_rq);
>  
>  	rq_repin_lock(this_rq, rf);

Urgh code flow is a mess... Let me see if I can fix some of that.

Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
with this on?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12 14:29   ` Peter Zijlstra
@ 2021-11-12 14:47     ` Vincent Guittot
  2021-11-12 15:29       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2021-11-12 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel

On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > In case we skip the newly idle LB entirely or we abort it because we are
> > going to exceed the avg_idle, we have to make sure to not start an update
> > of the blocked load when entering idle
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 13950beb01a2..a162b0ec8963 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >       int this_cpu = this_rq->cpu;
> >       u64 t0, t1, curr_cost = 0;
> >       struct sched_domain *sd;
> > -     int pulled_task = 0;
> > +     int pulled_task = 0, early_stop = 0;
> >
> >       update_misfit_status(NULL, this_rq);
> >
> > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >       if (!READ_ONCE(this_rq->rd->overload) ||
> >           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> >
> > -             if (sd)
> > +             if (sd) {
> >                       update_next_balance(sd, &next_balance);
> > +
> > +                     /*
> > +                      * We skip new idle LB because there is not enough
> > +                      * time before next wake up. Make sure that we will
> > +                      * not kick NOHZ_NEWILB_KICK
> > +                      */
> > +                     early_stop = 1;
> > +             }
> >               rcu_read_unlock();
> >
> >               goto out;
> > @@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >               update_next_balance(sd, &next_balance);
> >
> > -             if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> > +             if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > +                     early_stop = 1;
> >                       break;
> > +             }
> >
> >               if (sd->flags & SD_BALANCE_NEWIDLE) {
> >
> > @@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >       if (pulled_task)
> >               this_rq->idle_stamp = 0;
> > -     else
> > +     else if (!early_stop)
> >               nohz_newidle_balance(this_rq);
> >
> >       rq_repin_lock(this_rq, rf);
>
> Urgh code flow is a mess... Let me see if I can fix some of that.

yeah, I haven't find a better way

>
> Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> with this on?

This test still covers cases with short newly idle balance. Being
conservative, people never complained that the update of blocked load
average of idle CPUs doesn't happen often enough. It's most often the
opposite

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12  9:58 ` [PATCH 1/2] sched/fair: skip newidle update stats Vincent Guittot
  2021-11-12 14:29   ` Peter Zijlstra
@ 2021-11-12 14:55   ` Peter Zijlstra
  2021-11-12 14:55     ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-11-12 14:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel



Subject: sched/fair: Simplify newidle_balance()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 12 15:46:26 CET 2021

Move rq_{un,re}pin_lock() next to raw_spin_rq_{un,}lock().

Remove all rcu_read_{,un}lock(), since we have preempt/irqs disabled
over the whole function and those hold off RCU (again).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10884,15 +10884,6 @@ static int newidle_balance(struct rq *th
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
-	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
@@ -10908,18 +10899,22 @@ static int newidle_balance(struct rq *th
 			 */
 			early_stop = 1;
 		}
-		rcu_read_unlock();
 
 		goto out;
 	}
-	rcu_read_unlock();
 
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
 	update_blocked_averages(this_cpu);
 
-	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
 		u64 domain_cost;
@@ -10953,9 +10948,9 @@ static int newidle_balance(struct rq *th
 		    this_rq->ttwu_pending)
 			break;
 	}
-	rcu_read_unlock();
 
 	raw_spin_rq_lock(this_rq);
+	rq_repin_lock(this_rq, rf);
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -10982,8 +10977,6 @@ static int newidle_balance(struct rq *th
 	else if (!early_stop)
 		nohz_newidle_balance(this_rq);
 
-	rq_repin_lock(this_rq, rf);
-
 	return pulled_task;
 }
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12 14:55   ` Peter Zijlstra
@ 2021-11-12 14:55     ` Peter Zijlstra
  2021-11-12 16:05       ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-11-12 14:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel


Subject: sched/fair: Reflow newidle_balance()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 12 15:46:08 CET 2021

The control flow in newidle_balance() is a little convoluted, attempt
simplification.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10858,10 +10858,10 @@ static inline void nohz_newidle_balance(
 static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
+	int pulled_task = 0, timeout = 0;
 	int this_cpu = this_rq->cpu;
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
-	int pulled_task = 0, early_stop = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10889,17 +10889,9 @@ static int newidle_balance(struct rq *th
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd) {
+		if (sd)
 			update_next_balance(sd, &next_balance);
 
-			/*
-			 * We skip new idle LB because there is not enough
-			 * time before next wake up. Make sure that we will
-			 * not kick NOHZ_NEWILB_KICK
-			 */
-			early_stop = 1;
-		}
-
 		goto out;
 	}
 
@@ -10922,7 +10914,7 @@ static int newidle_balance(struct rq *th
 		update_next_balance(sd, &next_balance);
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
-			early_stop = 1;
+			timeout = 1;
 			break;
 		}
 
@@ -10967,6 +10959,11 @@ static int newidle_balance(struct rq *th
 	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
 		pulled_task = -1;
 
+	if (pulled_task || timeout)
+		goto out;
+
+	nohz_newidle_balance(this_rq);
+
 out:
 	/* Move the next balance forward */
 	if (time_after(this_rq->next_balance, next_balance))
@@ -10974,8 +10971,6 @@ static int newidle_balance(struct rq *th
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
-	else if (!early_stop)
-		nohz_newidle_balance(this_rq);
 
 	return pulled_task;
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12 14:47     ` Vincent Guittot
@ 2021-11-12 15:29       ` Peter Zijlstra
  2021-11-12 16:00         ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-11-12 15:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel

On Fri, Nov 12, 2021 at 03:47:21PM +0100, Vincent Guittot wrote:
> On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > > In case we skip the newly idle LB entirely or we abort it because we are
> > > going to exceed the avg_idle, we have to make sure to not start an update
> > > of the blocked load when entering idle
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 18 ++++++++++++++----
> > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 13950beb01a2..a162b0ec8963 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >       int this_cpu = this_rq->cpu;
> > >       u64 t0, t1, curr_cost = 0;
> > >       struct sched_domain *sd;
> > > -     int pulled_task = 0;
> > > +     int pulled_task = 0, early_stop = 0;
> > >
> > >       update_misfit_status(NULL, this_rq);
> > >
> > > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >       if (!READ_ONCE(this_rq->rd->overload) ||
> > >           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > >
> > > -             if (sd)
> > > +             if (sd) {
> > >                       update_next_balance(sd, &next_balance);
> > > +
> > > +                     /*
> > > +                      * We skip new idle LB because there is not enough
> > > +                      * time before next wake up. Make sure that we will
> > > +                      * not kick NOHZ_NEWILB_KICK
> > > +                      */
> > > +                     early_stop = 1;
> > > +             }
> > >               rcu_read_unlock();
> > >
> > >               goto out;

> > Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> > with this on?
> 
> This test still covers cases with short newly idle balance. Being
> conservative, people never complained that the update of blocked load
> average of idle CPUs doesn't happen often enough. It's most often the
> opposite

Well, per commit c5b0a7eefc70 ("sched/fair: Remove
sysctl_sched_migration_cost condition") combined with the above change,
we no longer call nohz_newidle_balance() in exactly that condition,
right?

Or are we worried about that !overload case?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12 15:29       ` Peter Zijlstra
@ 2021-11-12 16:00         ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-11-12 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel

On Fri, 12 Nov 2021 at 16:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 12, 2021 at 03:47:21PM +0100, Vincent Guittot wrote:
> > On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > > > In case we skip the newly idle LB entirely or we abort it because we are
> > > > going to exceed the avg_idle, we have to make sure to not start an update
> > > > of the blocked load when entering idle
> > > >
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > ---
> > > >  kernel/sched/fair.c | 18 ++++++++++++++----
> > > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 13950beb01a2..a162b0ec8963 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > >       int this_cpu = this_rq->cpu;
> > > >       u64 t0, t1, curr_cost = 0;
> > > >       struct sched_domain *sd;
> > > > -     int pulled_task = 0;
> > > > +     int pulled_task = 0, early_stop = 0;
> > > >
> > > >       update_misfit_status(NULL, this_rq);
> > > >
> > > > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > >       if (!READ_ONCE(this_rq->rd->overload) ||
> > > >           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > > >
> > > > -             if (sd)
> > > > +             if (sd) {
> > > >                       update_next_balance(sd, &next_balance);
> > > > +
> > > > +                     /*
> > > > +                      * We skip new idle LB because there is not enough
> > > > +                      * time before next wake up. Make sure that we will
> > > > +                      * not kick NOHZ_NEWILB_KICK
> > > > +                      */
> > > > +                     early_stop = 1;
> > > > +             }
> > > >               rcu_read_unlock();
> > > >
> > > >               goto out;
>
> > > Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> > > with this on?
> >
> > This test still covers cases with short newly idle balance. Being
> > conservative, people never complained that the update of blocked load
> > average of idle CPUs doesn't happen often enough. It's most often the
> > opposite
>
> Well, per commit c5b0a7eefc70 ("sched/fair: Remove
> sysctl_sched_migration_cost condition") combined with the above change,
> we no longer call nohz_newidle_balance() in exactly that condition,
> right?
>
> Or are we worried about that !overload case?

we can do a complete newly idle LB but have this_rq->avg_idle <
sysctl_sched_migration_cost. In this case, the condition will continue
to skip update of other idle CPUs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched/fair: skip newidle update stats
  2021-11-12 14:55     ` Peter Zijlstra
@ 2021-11-12 16:05       ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-11-12 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen, joel

On Fri, 12 Nov 2021 at 15:55, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Subject: sched/fair: Reflow newidle_balance()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 12 15:46:08 CET 2021
>
> The control flow in newidle_balance() is a little convoluted, attempt
> simplification.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   21 ++++++++-------------
>  1 file changed, 8 insertions(+), 13 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10858,10 +10858,10 @@ static inline void nohz_newidle_balance(
>  static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  {
>         unsigned long next_balance = jiffies + HZ;
> +       int pulled_task = 0, timeout = 0;
>         int this_cpu = this_rq->cpu;
>         u64 t0, t1, curr_cost = 0;
>         struct sched_domain *sd;
> -       int pulled_task = 0, early_stop = 0;
>
>         update_misfit_status(NULL, this_rq);
>
> @@ -10889,17 +10889,9 @@ static int newidle_balance(struct rq *th
>         if (!READ_ONCE(this_rq->rd->overload) ||
>             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> -               if (sd) {
> +               if (sd)
>                         update_next_balance(sd, &next_balance);
>
> -                       /*
> -                        * We skip new idle LB because there is not enough
> -                        * time before next wake up. Make sure that we will
> -                        * not kick NOHZ_NEWILB_KICK
> -                        */
> -                       early_stop = 1;
> -               }
> -
>                 goto out;
>         }
>
> @@ -10922,7 +10914,7 @@ static int newidle_balance(struct rq *th
>                 update_next_balance(sd, &next_balance);
>
>                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> -                       early_stop = 1;
> +                       timeout = 1;
>                         break;
>                 }
>
> @@ -10967,6 +10959,11 @@ static int newidle_balance(struct rq *th
>         if (this_rq->nr_running != this_rq->cfs.h_nr_running)
>                 pulled_task = -1;
>
> +       if (pulled_task || timeout)
> +               goto out;
> +
> +       nohz_newidle_balance(this_rq);

maybe

       if (!pulled_task && !timeout)
               nohz_newidle_balance(this_rq);

> +
>  out:
>         /* Move the next balance forward */
>         if (time_after(this_rq->next_balance, next_balance))
> @@ -10974,8 +10971,6 @@ static int newidle_balance(struct rq *th
>
>         if (pulled_task)
>                 this_rq->idle_stamp = 0;
> -       else if (!early_stop)
> -               nohz_newidle_balance(this_rq);
>
>         return pulled_task;
>  }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] avoid spurious blocked load update
  2021-11-12  9:58 [PATCH 0/2] avoid spurious blocked load update Vincent Guittot
  2021-11-12  9:58 ` [PATCH 1/2] sched/fair: skip newidle update stats Vincent Guittot
@ 2021-11-16 23:48 ` Tim Chen
  2021-11-18 14:40   ` Vincent Guittot
  1 sibling, 1 reply; 11+ messages in thread
From: Tim Chen @ 2021-11-16 23:48 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, linux-kernel
  Cc: joel

On Fri, 2021-11-12 at 10:58 +0100, Vincent Guittot wrote:
> This patchset is a follow up of : 
> https://lore.kernel.org/lkml/20211019123537.17146-1-vincent.guittot@linaro.org/
> 
> It ensures that newly idle load balance will not kick the update of
> blocked load if it skips the load balance because avg_idle is too
> short.
> It also makes sure that rq->next_balance doesn't go in the past when
> updated.
> 
> Tim Chen (1):
>   sched: sched: Fix rq->next_balance time updated to earlier than
>     current time
> 
> Vincent Guittot (1):
>   sched/fair: skip newidle update stats
> 
>  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 

Vincent,

Got some data back from the benchmark team.
To my surprise, the skip_new_idle_update_stats patch
actually makes things a little worse.

                                        Relative Performance 
                                        (higher better)
5.15 rc4 vanilla (cgroup disabled)      100%
5.15 rc4 vanilla (cgroup enabled)       96%
patch v2                                96%
patch v3                                96%
patch v3
	+skip_new_idle_update_stats	93.7%
patch v3
	+skip_new_idle_update_stats
	+Fix rq->next_balance_time	93.7%	

The cpu utilization actually is the similar compared with
having just the v3 patch. In both cases they are
81% user
12% kernel
2%  idle
5%  waiting for IO


Profile on key functions
in load balancing shows a little more cpu utilization,
which is unexpected as we are cutting short
the newidle_balance.

patch v3
     0.56%          [k] __update_load_avg_cfs_rq  
     0.51%          [k] update_load_avg    
     0.39%          [k] update_blocked_averages
     0.36%          [k] __update_load_avg_se  
     0.05%	    [k] newidle_balance

patch v3 + skip_new_idle_update_stats
     0.58%          [k] __update_load_avg_cfs_rq  
     0.53%          [k] update_load_avg    
     0.40%          [k] update_blocked_averages
     0.37%          [k] __update_load_avg_se  
     0.06%	    [k] newidle_balance

Context switch frequency is lower by 4% with the skip_new_idle_update_stats
patch.  

Thanks.

Tim


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] avoid spurious blocked load update
  2021-11-16 23:48 ` [PATCH 0/2] avoid spurious blocked load update Tim Chen
@ 2021-11-18 14:40   ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2021-11-18 14:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, joel

On Wed, 17 Nov 2021 at 00:48, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2021-11-12 at 10:58 +0100, Vincent Guittot wrote:
> > This patchset is a follow up of :
> > https://lore.kernel.org/lkml/20211019123537.17146-1-vincent.guittot@linaro.org/
> >
> > It ensures that newly idle load balance will not kick the update of
> > blocked load if it skips the load balance because avg_idle is too
> > short.
> > It also makes sure that rq->next_balance doesn't go in the past when
> > updated.
> >
> > Tim Chen (1):
> >   sched: sched: Fix rq->next_balance time updated to earlier than
> >     current time
> >
> > Vincent Guittot (1):
> >   sched/fair: skip newidle update stats
> >
> >  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
> >  1 file changed, 20 insertions(+), 5 deletions(-)
> >
>
> Vincent,
>
> Got some data back from the benchmark team.
> To my surprise, the skip_new_idle_update_stats patch
> actually makes things a little worse.
>
>                                         Relative Performance
>                                         (higher better)
> 5.15 rc4 vanilla (cgroup disabled)      100%
> 5.15 rc4 vanilla (cgroup enabled)       96%
> patch v2                                96%
> patch v3                                96%
> patch v3
>         +skip_new_idle_update_stats     93.7%
> patch v3
>         +skip_new_idle_update_stats
>         +Fix rq->next_balance_time      93.7%
>

Yeah, that looks surprising.
patch skip_new_idle_update_stats only ensures that the cpu will not
run an update of the blocked average of idle cpus before entering idle
but outside newidle_balance if it thinks that a task is about to wake
up soon.
The end result is that we run less often
_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE); before
entering idle.

> The cpu utilization actually is the similar compared with
> having just the v3 patch. In both cases they are
> 81% user
> 12% kernel
> 2%  idle
> 5%  waiting for IO
>
>
> Profile on key functions
> in load balancing shows a little more cpu utilization,
> which is unexpected as we are cutting short
> the newidle_balance.
>
> patch v3
>      0.56%          [k] __update_load_avg_cfs_rq
>      0.51%          [k] update_load_avg
>      0.39%          [k] update_blocked_averages
>      0.36%          [k] __update_load_avg_se
>      0.05%          [k] newidle_balance
>
> patch v3 + skip_new_idle_update_stats
>      0.58%          [k] __update_load_avg_cfs_rq
>      0.53%          [k] update_load_avg
>      0.40%          [k] update_blocked_averages
>      0.37%          [k] __update_load_avg_se
>      0.06%          [k] newidle_balance
>
> Context switch frequency is lower by 4% with the skip_new_idle_update_stats
> patch.
>
> Thanks.
>
> Tim
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-11-18 14:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12  9:58 [PATCH 0/2] avoid spurious blocked load update Vincent Guittot
2021-11-12  9:58 ` [PATCH 1/2] sched/fair: skip newidle update stats Vincent Guittot
2021-11-12 14:29   ` Peter Zijlstra
2021-11-12 14:47     ` Vincent Guittot
2021-11-12 15:29       ` Peter Zijlstra
2021-11-12 16:00         ` Vincent Guittot
2021-11-12 14:55   ` Peter Zijlstra
2021-11-12 14:55     ` Peter Zijlstra
2021-11-12 16:05       ` Vincent Guittot
2021-11-16 23:48 ` [PATCH 0/2] avoid spurious blocked load update Tim Chen
2021-11-18 14:40   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).