linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Fix nohz.next_balance update
@ 2020-05-03  8:34 Peng Liu
  2020-05-04  0:10 ` Valentin Schneider
  2020-05-04 15:17 ` Vincent Guittot
  0 siblings, 2 replies; 21+ messages in thread
From: Peng Liu @ 2020-05-03  8:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, valentin.schneider, iwtbavbm

commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
During idle load balance, this_cpu(ilb) do load balance for the other
idle CPUs, also gather the earliest (nohz.)next_balance.

Since commit:
  'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'

We update nohz.next_balance like this:

  _nohz_idle_balance() {
      for_each_cpu(nohz.idle_cpus_mask) {
    	  rebalance_domains() {
      	      update nohz.next_balance <-- compare and update
    	  }
      }
      rebalance_domains(this_cpu) {
	  update nohz.next_balance <-- compare and update
      }
      update nohz.next_balance <-- unconditionally update
  }

For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
cpu5. After the above loop we could gather the earliest *next_balance*
among {cpu2,3,8}, then rebalance_domains(this_cpu) update
nohz.next_balance with this_rq->next_balance, but finally overwrite
nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
we may end up with not getting the earliest next_balance.

Since we can gather all the updated rq->next_balance, including this_cpu,
in _nohz_idle_balance(), it's safe to remove the extra lines in
rebalance_domains() which are originally intended for this_cpu. And
finally the updating only happen in _nohz_idle_balance().

Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..1d0cf33fefad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	 * When the cpu is attached to null domain for ex, it will not be
 	 * updated.
 	 */
-	if (likely(update_next_balance)) {
+	if (likely(update_next_balance))
 		rq->next_balance = next_balance;
-
-#ifdef CONFIG_NO_HZ_COMMON
-		/*
-		 * If this CPU has been elected to perform the nohz idle
-		 * balance. Other idle CPUs have already rebalanced with
-		 * nohz_idle_balance() and nohz.next_balance has been
-		 * updated accordingly. This CPU is now running the idle load
-		 * balance for itself and we need to update the
-		 * nohz.next_balance accordingly.
-		 */
-		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
-			nohz.next_balance = rq->next_balance;
-#endif
-	}
 }
 
 static inline int on_null_domain(struct rq *rq)
@@ -10321,9 +10307,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 		has_blocked_load |= this_rq->has_blocked_load;
 	}
 
-	if (flags & NOHZ_BALANCE_KICK)
+	if (flags & NOHZ_BALANCE_KICK) {
 		rebalance_domains(this_rq, CPU_IDLE);
 
+		if (time_after(next_balance, this_rq->next_balance)) {
+			next_balance = this_rq->next_balance;
+			update_next_balance = 1;
+		}
+	}
+
 	WRITE_ONCE(nohz.next_blocked,
 		now + msecs_to_jiffies(LOAD_AVG_PERIOD));
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-03  8:34 [PATCH] sched/fair: Fix nohz.next_balance update Peng Liu
@ 2020-05-04  0:10 ` Valentin Schneider
  2020-05-05 12:36   ` Peng Liu
  2020-05-04 15:17 ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2020-05-04  0:10 UTC (permalink / raw)
  To: Peng Liu
  Cc: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman


Hi,

On 03/05/20 09:34, Peng Liu wrote:
> commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")

I got confused because this has the same topic as your patch, but that's a
genuine commit from 2015. Is this meant to be a "Fixes:" reference?

> During idle load balance, this_cpu(ilb) do load balance for the other
> idle CPUs, also gather the earliest (nohz.)next_balance.
>
> Since commit:
>   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
>
> We update nohz.next_balance like this:
>
>   _nohz_idle_balance() {
>       for_each_cpu(nohz.idle_cpus_mask) {
>         rebalance_domains() {
>                     update nohz.next_balance <-- compare and update
>         }
>       }
>       rebalance_domains(this_cpu) {
>         update nohz.next_balance <-- compare and update
>       }
>       update nohz.next_balance <-- unconditionally update
>   }
>
> For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> cpu5. After the above loop we could gather the earliest *next_balance*
> among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> nohz.next_balance with this_rq->next_balance, but finally overwrite
> nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> we may end up with not getting the earliest next_balance.
>

That does look like it, nice catch!

> Since we can gather all the updated rq->next_balance, including this_cpu,
> in _nohz_idle_balance(), it's safe to remove the extra lines in
> rebalance_domains() which are originally intended for this_cpu. And
> finally the updating only happen in _nohz_idle_balance().
>

One added benefit of this is that we get rid of extra writes to
nohz.next_balance, since that special case in rebalance_domains() could be
hit by all NOHZ CPUs, not just the ILB.

With the below comment taken into account:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..1d0cf33fefad 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>        * When the cpu is attached to null domain for ex, it will not be
>        * updated.
>        */
> -	if (likely(update_next_balance)) {
> +	if (likely(update_next_balance))
>               rq->next_balance = next_balance;
> -
> -#ifdef CONFIG_NO_HZ_COMMON
> -		/*
> -		 * If this CPU has been elected to perform the nohz idle
> -		 * balance. Other idle CPUs have already rebalanced with
> -		 * nohz_idle_balance() and nohz.next_balance has been
> -		 * updated accordingly. This CPU is now running the idle load
> -		 * balance for itself and we need to update the
> -		 * nohz.next_balance accordingly.
> -		 */
> -		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> -			nohz.next_balance = rq->next_balance;
> -#endif
> -	}
>  }
>
>  static inline int on_null_domain(struct rq *rq)
> @@ -10321,9 +10307,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>               has_blocked_load |= this_rq->has_blocked_load;
>       }
>
> -	if (flags & NOHZ_BALANCE_KICK)
> +	if (flags & NOHZ_BALANCE_KICK) {
>               rebalance_domains(this_rq, CPU_IDLE);
>
> +		if (time_after(next_balance, this_rq->next_balance)) {
> +			next_balance = this_rq->next_balance;
> +			update_next_balance = 1;
> +		}
> +	}

To align with what we do for the other NOHZ CPUs, shouldn't this update
be outside of the NOHZ_BALANCE_KICK condition? That way we can update
nohz.next_balance with just NOHZ_STATS_KICK, which IMO is the expected
course of action.

> +
>       WRITE_ONCE(nohz.next_blocked,
>               now + msecs_to_jiffies(LOAD_AVG_PERIOD));

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-03  8:34 [PATCH] sched/fair: Fix nohz.next_balance update Peng Liu
  2020-05-04  0:10 ` Valentin Schneider
@ 2020-05-04 15:17 ` Vincent Guittot
  2020-05-04 15:48   ` Valentin Schneider
                     ` (3 more replies)
  1 sibling, 4 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-05-04 15:17 UTC (permalink / raw)
  To: Peng Liu
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
>
> commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> During idle load balance, this_cpu(ilb) do load balance for the other
> idle CPUs, also gather the earliest (nohz.)next_balance.
>
> Since commit:
>   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
>
> We update nohz.next_balance like this:
>
>   _nohz_idle_balance() {
>       for_each_cpu(nohz.idle_cpus_mask) {
>           rebalance_domains() {
>               update nohz.next_balance <-- compare and update
>           }
>       }
>       rebalance_domains(this_cpu) {
>           update nohz.next_balance <-- compare and update
>       }
>       update nohz.next_balance <-- unconditionally update
>   }
>
> For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> cpu5. After the above loop we could gather the earliest *next_balance*
> among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> nohz.next_balance with this_rq->next_balance, but finally overwrite
> nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> we may end up with not getting the earliest next_balance.
>
> Since we can gather all the updated rq->next_balance, including this_cpu,
> in _nohz_idle_balance(), it's safe to remove the extra lines in
> rebalance_domains() which are originally intended for this_cpu. And
> finally the updating only happen in _nohz_idle_balance().

I'm not sure that's always true. Nothing prevents nohz_idle_balance()
to return false . Then run_rebalance_domains() calls
rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
In this case we must keep the code in rebalance_domains().

For example when the tick is not stopped when entering idle. Or when
need_resched() returns true.

So instead of removing the code from rebalance_domains, you should
move the one in _nohz_idle_balance() to make sure that the "if
(likely(update_next_balance)) ..." is called before calling
rebalance_domains for the local cpu



>
> Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..1d0cf33fefad 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>          * When the cpu is attached to null domain for ex, it will not be
>          * updated.
>          */
> -       if (likely(update_next_balance)) {
> +       if (likely(update_next_balance))
>                 rq->next_balance = next_balance;
> -
> -#ifdef CONFIG_NO_HZ_COMMON
> -               /*
> -                * If this CPU has been elected to perform the nohz idle
> -                * balance. Other idle CPUs have already rebalanced with
> -                * nohz_idle_balance() and nohz.next_balance has been
> -                * updated accordingly. This CPU is now running the idle load
> -                * balance for itself and we need to update the
> -                * nohz.next_balance accordingly.
> -                */
> -               if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> -                       nohz.next_balance = rq->next_balance;
> -#endif
> -       }
>  }
>
>  static inline int on_null_domain(struct rq *rq)
> @@ -10321,9 +10307,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>                 has_blocked_load |= this_rq->has_blocked_load;
>         }
>
> -       if (flags & NOHZ_BALANCE_KICK)
> +       if (flags & NOHZ_BALANCE_KICK) {
>                 rebalance_domains(this_rq, CPU_IDLE);
>
> +               if (time_after(next_balance, this_rq->next_balance)) {
> +                       next_balance = this_rq->next_balance;
> +                       update_next_balance = 1;
> +               }
> +       }
> +
>         WRITE_ONCE(nohz.next_blocked,
>                 now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-04 15:17 ` Vincent Guittot
@ 2020-05-04 15:48   ` Valentin Schneider
  2020-05-04 16:05   ` Dietmar Eggemann
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2020-05-04 15:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peng Liu, linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman


On 04/05/20 16:17, Vincent Guittot wrote:
> On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
>>
>> commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
>> During idle load balance, this_cpu(ilb) do load balance for the other
>> idle CPUs, also gather the earliest (nohz.)next_balance.
>>
>> Since commit:
>>   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
>>
>> We update nohz.next_balance like this:
>>
>>   _nohz_idle_balance() {
>>       for_each_cpu(nohz.idle_cpus_mask) {
>>           rebalance_domains() {
>>               update nohz.next_balance <-- compare and update
>>           }
>>       }
>>       rebalance_domains(this_cpu) {
>>           update nohz.next_balance <-- compare and update
>>       }
>>       update nohz.next_balance <-- unconditionally update
>>   }
>>
>> For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
>> cpu5. After the above loop we could gather the earliest *next_balance*
>> among {cpu2,3,8}, then rebalance_domains(this_cpu) update
>> nohz.next_balance with this_rq->next_balance, but finally overwrite
>> nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
>> we may end up with not getting the earliest next_balance.
>>
>> Since we can gather all the updated rq->next_balance, including this_cpu,
>> in _nohz_idle_balance(), it's safe to remove the extra lines in
>> rebalance_domains() which are originally intended for this_cpu. And
>> finally the updating only happen in _nohz_idle_balance().
>
> I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> to return false . Then run_rebalance_domains() calls
> rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> In this case we must keep the code in rebalance_domains().
>
> For example when the tick is not stopped when entering idle. Or when
> need_resched() returns true.
>

I had missed that, good points.

> So instead of removing the code from rebalance_domains, you should
> move the one in _nohz_idle_balance() to make sure that the "if
> (likely(update_next_balance)) ..." is called before calling
> rebalance_domains for the local cpu
>

Why not just get rid of the update in _nohz_idle_balance() entirely then?
The nohz.next_balance update in rebalance_domains() will always happen if
it is required (and we have idle == CPU_IDLE), so the extra update in
_nohz_idle_balance() doesn't seem to be any useful.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-04 15:17 ` Vincent Guittot
  2020-05-04 15:48   ` Valentin Schneider
@ 2020-05-04 16:05   ` Dietmar Eggemann
  2020-05-05 13:40   ` Peng Liu
  2020-05-06 10:28   ` Valentin Schneider
  3 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2020-05-04 16:05 UTC (permalink / raw)
  To: Vincent Guittot, Peng Liu
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

On 04/05/2020 17:17, Vincent Guittot wrote:
> On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
>>
>> commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
>> During idle load balance, this_cpu(ilb) do load balance for the other
>> idle CPUs, also gather the earliest (nohz.)next_balance.
>>
>> Since commit:
>>   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
>>
>> We update nohz.next_balance like this:
>>
>>   _nohz_idle_balance() {
>>       for_each_cpu(nohz.idle_cpus_mask) {
>>           rebalance_domains() {
>>               update nohz.next_balance <-- compare and update
>>           }
>>       }
>>       rebalance_domains(this_cpu) {
>>           update nohz.next_balance <-- compare and update
>>       }
>>       update nohz.next_balance <-- unconditionally update
>>   }
>>
>> For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
>> cpu5. After the above loop we could gather the earliest *next_balance*
>> among {cpu2,3,8}, then rebalance_domains(this_cpu) update
>> nohz.next_balance with this_rq->next_balance, but finally overwrite
>> nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
>> we may end up with not getting the earliest next_balance.
>>
>> Since we can gather all the updated rq->next_balance, including this_cpu,
>> in _nohz_idle_balance(), it's safe to remove the extra lines in
>> rebalance_domains() which are originally intended for this_cpu. And
>> finally the updating only happen in _nohz_idle_balance().
> 
> I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> to return false . Then run_rebalance_domains() calls
> rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> In this case we must keep the code in rebalance_domains().

I came to the same conclusion. It was done like this till v4.0 and IMHO
c5afb6a87f23 fixed it in v4.4.

> For example when the tick is not stopped when entering idle. Or when
> need_resched() returns true.
> 
> So instead of removing the code from rebalance_domains, you should
> move the one in _nohz_idle_balance() to make sure that the "if
> (likely(update_next_balance)) ..." is called before calling
> rebalance_domains for the local cpu

Makes sense, to avoid that we possibly override nohz.next_balance
wrongly (in case time_after(next_balance, this_rq->next_balance);

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-04  0:10 ` Valentin Schneider
@ 2020-05-05 12:36   ` Peng Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Peng Liu @ 2020-05-05 12:36 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, iwtbavbm, linux-kernel

On Mon, May 04, 2020 at 01:10:46AM +0100, Valentin Schneider wrote:
> 
> Hi,
> 
> On 03/05/20 09:34, Peng Liu wrote:
> > commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> 
> I got confused because this has the same topic as your patch, but that's a
> genuine commit from 2015. Is this meant to be a "Fixes:" reference?
> 

Ah, it was careless of me, that's a sane patch from Vincent, which I
referred when tracking the issue, but finally forgot to remove it from
changelog, not related to this patch.

> > During idle load balance, this_cpu(ilb) do load balance for the other
> > idle CPUs, also gather the earliest (nohz.)next_balance.
> >
> > Since commit:
> >   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
> >
> > We update nohz.next_balance like this:
> >
> >   _nohz_idle_balance() {
> >       for_each_cpu(nohz.idle_cpus_mask) {
> >         rebalance_domains() {
> >                     update nohz.next_balance <-- compare and update
> >         }
> >       }
> >       rebalance_domains(this_cpu) {
> >         update nohz.next_balance <-- compare and update
> >       }
> >       update nohz.next_balance <-- unconditionally update
> >   }
> >
> > For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> > cpu5. After the above loop we could gather the earliest *next_balance*
> > among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> > nohz.next_balance with this_rq->next_balance, but finally overwrite
> > nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> > we may end up with not getting the earliest next_balance.
> >
> 
> That does look like it, nice catch!
> 
> > Since we can gather all the updated rq->next_balance, including this_cpu,
> > in _nohz_idle_balance(), it's safe to remove the extra lines in
> > rebalance_domains() which are originally intended for this_cpu. And
> > finally the updating only happen in _nohz_idle_balance().
> >
> 
> One added benefit of this is that we get rid of extra writes to
> nohz.next_balance, since that special case in rebalance_domains() could be
> hit by all NOHZ CPUs, not just the ILB.
> 
> With the below comment taken into account:
> 
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> > Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > ---
> >  kernel/sched/fair.c | 24 ++++++++----------------
> >  1 file changed, 8 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..1d0cf33fefad 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> >        * When the cpu is attached to null domain for ex, it will not be
> >        * updated.
> >        */
> > -	if (likely(update_next_balance)) {
> > +	if (likely(update_next_balance))
> >               rq->next_balance = next_balance;
> > -
> > -#ifdef CONFIG_NO_HZ_COMMON
> > -		/*
> > -		 * If this CPU has been elected to perform the nohz idle
> > -		 * balance. Other idle CPUs have already rebalanced with
> > -		 * nohz_idle_balance() and nohz.next_balance has been
> > -		 * updated accordingly. This CPU is now running the idle load
> > -		 * balance for itself and we need to update the
> > -		 * nohz.next_balance accordingly.
> > -		 */
> > -		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> > -			nohz.next_balance = rq->next_balance;
> > -#endif
> > -	}
> >  }
> >
> >  static inline int on_null_domain(struct rq *rq)
> > @@ -10321,9 +10307,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >               has_blocked_load |= this_rq->has_blocked_load;
> >       }
> >
> > -	if (flags & NOHZ_BALANCE_KICK)
> > +	if (flags & NOHZ_BALANCE_KICK) {
> >               rebalance_domains(this_rq, CPU_IDLE);
> >
> > +		if (time_after(next_balance, this_rq->next_balance)) {
> > +			next_balance = this_rq->next_balance;
> > +			update_next_balance = 1;
> > +		}
> > +	}
> 
> To align with what we do for the other NOHZ CPUs, shouldn't this update
> be outside of the NOHZ_BALANCE_KICK condition? That way we can update
> nohz.next_balance with just NOHZ_STATS_KICK, which IMO is the expected
> course of action.
> 

I think I got your meaning, in _nohz_idle_balance():

  for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
      if (time_after_eq(jiffies, rq->next_balance)) {
	  if (flags & NOHZ_BALANCE_KICK)
	      rebalance_domains(rq, CPU_IDLE);
      }

      if (time_after(next_balance, rq->next_balance)) {
	  next_balance = rq->next_balance;
	  update_next_balance = 1;
      }
  }

nohz.next_balance is the earliest next_balance, so some rqs'
next_balance may be not due, some are due and updated, so we need
take all of them into consideration.

In NOHZ_STAT_KICK case, all rq don't go through rebalance_domains(),
their next_balance are supposed to be unchanged, including this_rq,
so we can safely left nohz.next_balance unchanged.

Thanks for your time!

> > +
> >       WRITE_ONCE(nohz.next_blocked,
> >               now + msecs_to_jiffies(LOAD_AVG_PERIOD));

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-04 15:17 ` Vincent Guittot
  2020-05-04 15:48   ` Valentin Schneider
  2020-05-04 16:05   ` Dietmar Eggemann
@ 2020-05-05 13:40   ` Peng Liu
  2020-05-05 14:27     ` Vincent Guittot
  2020-05-06 10:28   ` Valentin Schneider
  3 siblings, 1 reply; 21+ messages in thread
From: Peng Liu @ 2020-05-05 13:40 UTC (permalink / raw)
  To: vincent.guittot, dietmar.eggemann, valentin.schneider
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, iwtbavbm,
	linux-kernel

On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
> >
> > commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> > During idle load balance, this_cpu(ilb) do load balance for the other
> > idle CPUs, also gather the earliest (nohz.)next_balance.
> >
> > Since commit:
> >   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
> >
> > We update nohz.next_balance like this:
> >
> >   _nohz_idle_balance() {
> >       for_each_cpu(nohz.idle_cpus_mask) {
> >           rebalance_domains() {
> >               update nohz.next_balance <-- compare and update
> >           }
> >       }
> >       rebalance_domains(this_cpu) {
> >           update nohz.next_balance <-- compare and update
> >       }
> >       update nohz.next_balance <-- unconditionally update
> >   }
> >
> > For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> > cpu5. After the above loop we could gather the earliest *next_balance*
> > among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> > nohz.next_balance with this_rq->next_balance, but finally overwrite
> > nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> > we may end up with not getting the earliest next_balance.
> >
> > Since we can gather all the updated rq->next_balance, including this_cpu,
> > in _nohz_idle_balance(), it's safe to remove the extra lines in
> > rebalance_domains() which are originally intended for this_cpu. And
> > finally the updating only happen in _nohz_idle_balance().
> 
> I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> to return false . Then run_rebalance_domains() calls
> rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> In this case we must keep the code in rebalance_domains().
> 
> For example when the tick is not stopped when entering idle. Or when
> need_resched() returns true.
> 
> So instead of removing the code from rebalance_domains, you should
> move the one in _nohz_idle_balance() to make sure that the "if
> (likely(update_next_balance)) ..." is called before calling
> rebalance_domains for the local cpu
> 

Yes, you're right. When need_resched() returns true, things become out
of expectation. We haven't really got the earliest next_balance, abort
the update immediately and let the successor to help. Doubtless this
will incur some overhead due to the repeating work.

About the "tick is not stopped when entering idle" case, defer the
update to nohz_balance_enter_idle() would be a choice too.

Of course, only update nohz.next_balance in rebalance_domains() is the
simpliest way, but as @Valentin put, too many write to it may incur
unnecessary overhead. If we can gather the earliest next_balance in
advance, then a single write is considered to be better.

By the way, remove the redundant check in nohz_idle_balance().

FWIW, how about the below?
***********************************************
* Below code is !!!ENTIRELY UNTESTED!!!, just *
* a draft to see whehter it's sensible!       *
***********************************************
-------------------<-----------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..a7d63ea706ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	 * When the cpu is attached to null domain for ex, it will not be
 	 * updated.
 	 */
-	if (likely(update_next_balance)) {
+	if (likely(update_next_balance))
 		rq->next_balance = next_balance;
-
-#ifdef CONFIG_NO_HZ_COMMON
-		/*
-		 * If this CPU has been elected to perform the nohz idle
-		 * balance. Other idle CPUs have already rebalanced with
-		 * nohz_idle_balance() and nohz.next_balance has been
-		 * updated accordingly. This CPU is now running the idle load
-		 * balance for itself and we need to update the
-		 * nohz.next_balance accordingly.
-		 */
-		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
-			nohz.next_balance = rq->next_balance;
-#endif
-	}
 }

 static inline int on_null_domain(struct rq *rq)
@@ -10218,6 +10204,9 @@ void nohz_balance_enter_idle(int cpu)

 	rq->nohz_tick_stopped = 1;

+	if (time_after(nohz.next_balance, rq->next_balance))
+		nohz.next_balance = rq->next_balance;
+
 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
 	atomic_inc(&nohz.nr_cpus);

@@ -10287,6 +10276,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 		 */
 		if (need_resched()) {
 			has_blocked_load = true;
+			update_next_balance = 0;
 			goto abort;
 		}

@@ -10321,9 +10311,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
 		has_blocked_load |= this_rq->has_blocked_load;
 	}

-	if (flags & NOHZ_BALANCE_KICK)
+	if (flags & NOHZ_BALANCE_KICK) {
 		rebalance_domains(this_rq, CPU_IDLE);

+		if (time_after(next_balance, this_rq->next_balance)) {
+			next_balance = this_rq->next_balance;
+			update_next_balance = 1;
+		}
+	}
+
 	WRITE_ONCE(nohz.next_blocked,
 		now + msecs_to_jiffies(LOAD_AVG_PERIOD));

@@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
 	int this_cpu = this_rq->cpu;
 	unsigned int flags;
-
-	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
-		return false;
+	bool done;

 	if (idle != CPU_IDLE) {
 		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
@@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	if (!(flags & NOHZ_KICK_MASK))
 		return false;

-	_nohz_idle_balance(this_rq, flags, idle);
+	/*
+	 * If idle load balance terinated due to this CPU become busy,
+	 * pretend it has successfully pulled some loads, and abort
+	 * the following load balance.
+	 */
+	done = _nohz_idle_balance(this_rq, flags, idle);
+	if (done == false && need_resched())
+		return true;

-	return true;
+	return done;
 }

 static void nohz_newidle_balance(struct rq *this_rq)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 13:40   ` Peng Liu
@ 2020-05-05 14:27     ` Vincent Guittot
  2020-05-05 15:16       ` Peng Liu
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-05-05 14:27 UTC (permalink / raw)
  To: Peng Liu
  Cc: dietmar.eggemann, valentin.schneider, mingo, peterz, juri.lelli,
	rostedt, bsegall, mgorman, linux-kernel

Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> > On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
> > >
> > > commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> > > During idle load balance, this_cpu(ilb) do load balance for the other
> > > idle CPUs, also gather the earliest (nohz.)next_balance.
> > >
> > > Since commit:
> > >   'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
> > >
> > > We update nohz.next_balance like this:
> > >
> > >   _nohz_idle_balance() {
> > >       for_each_cpu(nohz.idle_cpus_mask) {
> > >           rebalance_domains() {
> > >               update nohz.next_balance <-- compare and update
> > >           }
> > >       }
> > >       rebalance_domains(this_cpu) {
> > >           update nohz.next_balance <-- compare and update
> > >       }
> > >       update nohz.next_balance <-- unconditionally update
> > >   }
> > >
> > > For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> > > cpu5. After the above loop we could gather the earliest *next_balance*
> > > among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> > > nohz.next_balance with this_rq->next_balance, but finally overwrite
> > > nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> > > we may end up with not getting the earliest next_balance.
> > >
> > > Since we can gather all the updated rq->next_balance, including this_cpu,
> > > in _nohz_idle_balance(), it's safe to remove the extra lines in
> > > rebalance_domains() which are originally intended for this_cpu. And
> > > finally the updating only happen in _nohz_idle_balance().
> > 
> > I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> > to return false . Then run_rebalance_domains() calls
> > rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> > In this case we must keep the code in rebalance_domains().
> > 
> > For example when the tick is not stopped when entering idle. Or when
> > need_resched() returns true.
> > 
> > So instead of removing the code from rebalance_domains, you should
> > move the one in _nohz_idle_balance() to make sure that the "if
> > (likely(update_next_balance)) ..." is called before calling
> > rebalance_domains for the local cpu
> > 
> 
> Yes, you're right. When need_resched() returns true, things become out
> of expectation. We haven't really got the earliest next_balance, abort
> the update immediately and let the successor to help. Doubtless this
> will incur some overhead due to the repeating work.

There should not be some repeating works because CPUs and sched_domain, which
have already been balanced, will not be rebalanced until the next load balance
interval.

Futhermore, there is in fact still work to do bcause not all the idle CPUs got
a chance to pull work

>
> 
> About the "tick is not stopped when entering idle" case, defer the
> update to nohz_balance_enter_idle() would be a choice too.
>
> 
> Of course, only update nohz.next_balance in rebalance_domains() is the
> simpliest way, but as @Valentin put, too many write to it may incur
> unnecessary overhead. If we can gather the earliest next_balance in

This is not really possible because we have to move it to the next interval.

> advance, then a single write is considered to be better.
> 
> By the way, remove the redundant check in nohz_idle_balance().
> 
> FWIW, how about the below?

Your proposal below looks quite complex. IMO, one solution would be to move the
update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
so you are back to the previous behavior.

The only difference is that in case of an break because of need_resched, it
doesn't update nohz.next_balance. But on the other hand, we haven't yet
finished run rebalance_domains for all CPUs and some load_balance are still
pending. In fact, this will be done during next tick by an idle CPU.

So I would be in favor of something as simple as :

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04098d678f3b..e028bc1c4744 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
                }
        }

+       /*
+        * next_balance will be updated only when there is a need.
+        * When the CPU is attached to null domain for ex, it will not be
+        * updated.
+        */
+       if (likely(update_next_balance))
+               nohz.next_balance = next_balance;
+
        /* Newly idle CPU doesn't need an update */
        if (idle != CPU_NEWLY_IDLE) {
                update_blocked_averages(this_cpu);
@@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
        if (has_blocked_load)
                WRITE_ONCE(nohz.has_blocked, 1);

-       /*
-        * next_balance will be updated only when there is a need.
-        * When the CPU is attached to null domain for ex, it will not be
-        * updated.
-        */
-       if (likely(update_next_balance))
-               nohz.next_balance = next_balance;
-
        return ret;
 }

> ***********************************************
> * Below code is !!!ENTIRELY UNTESTED!!!, just *
> * a draft to see whehter it's sensible!       *
> ***********************************************
> -------------------<-----------------------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..a7d63ea706ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>  	 * When the cpu is attached to null domain for ex, it will not be
>  	 * updated.
>  	 */
> -	if (likely(update_next_balance)) {
> +	if (likely(update_next_balance))
>  		rq->next_balance = next_balance;
> -
> -#ifdef CONFIG_NO_HZ_COMMON
> -		/*
> -		 * If this CPU has been elected to perform the nohz idle
> -		 * balance. Other idle CPUs have already rebalanced with
> -		 * nohz_idle_balance() and nohz.next_balance has been
> -		 * updated accordingly. This CPU is now running the idle load
> -		 * balance for itself and we need to update the
> -		 * nohz.next_balance accordingly.
> -		 */
> -		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> -			nohz.next_balance = rq->next_balance;
> -#endif
> -	}
>  }
> 
>  static inline int on_null_domain(struct rq *rq)
> @@ -10218,6 +10204,9 @@ void nohz_balance_enter_idle(int cpu)
> 
>  	rq->nohz_tick_stopped = 1;
> 
> +	if (time_after(nohz.next_balance, rq->next_balance))
> +		nohz.next_balance = rq->next_balance;
> +
>  	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>  	atomic_inc(&nohz.nr_cpus);
> 
> @@ -10287,6 +10276,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>  		 */
>  		if (need_resched()) {
>  			has_blocked_load = true;
> +			update_next_balance = 0;
>  			goto abort;
>  		}
> 
> @@ -10321,9 +10311,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>  		has_blocked_load |= this_rq->has_blocked_load;
>  	}
> 
> -	if (flags & NOHZ_BALANCE_KICK)
> +	if (flags & NOHZ_BALANCE_KICK) {
>  		rebalance_domains(this_rq, CPU_IDLE);
> 
> +		if (time_after(next_balance, this_rq->next_balance)) {
> +			next_balance = this_rq->next_balance;
> +			update_next_balance = 1;
> +		}
> +	}
> +
>  	WRITE_ONCE(nohz.next_blocked,
>  		now + msecs_to_jiffies(LOAD_AVG_PERIOD));
> 
> @@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
>  	int this_cpu = this_rq->cpu;
>  	unsigned int flags;
> -
> -	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> -		return false;

why did you remove this ?

> +	bool done;
> 
>  	if (idle != CPU_IDLE) {
>  		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> @@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  	if (!(flags & NOHZ_KICK_MASK))
>  		return false;
> 
> -	_nohz_idle_balance(this_rq, flags, idle);
> +	/*
> +	 * If idle load balance terinated due to this CPU become busy,
> +	 * pretend it has successfully pulled some loads, and abort
> +	 * the following load balance.
> +	 */
> +	done = _nohz_idle_balance(this_rq, flags, idle);
> +	if (done == false && need_resched())
> +		return true;
> 
> -	return true;
> +	return done;
>  }
> 
>  static void nohz_newidle_balance(struct rq *this_rq)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 14:27     ` Vincent Guittot
@ 2020-05-05 15:16       ` Peng Liu
  2020-05-05 15:43         ` Vincent Guittot
  2020-05-06 10:29       ` Valentin Schneider
  2020-05-08 13:01       ` Peng Liu
  2 siblings, 1 reply; 21+ messages in thread
From: Peng Liu @ 2020-05-05 15:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, iwtbavbm,
	linux-kernel, dietmar.eggemann, valentin.schneider

On Tue, May 05, 2020 at 04:27:11PM +0200, Vincent Guittot wrote:
> Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> > On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> > > On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
> > > >

[...]

> > Yes, you're right. When need_resched() returns true, things become out
> > of expectation. We haven't really got the earliest next_balance, abort
> > the update immediately and let the successor to help. Doubtless this
> > will incur some overhead due to the repeating work.
> 
> There should not be some repeating works because CPUs and sched_domain, which
> have already been balanced, will not be rebalanced until the next load balance
> interval.
> 
> Futhermore, there is in fact still work to do bcause not all the idle CPUs got
> a chance to pull work
> 
> >
> > 
> > About the "tick is not stopped when entering idle" case, defer the
> > update to nohz_balance_enter_idle() would be a choice too.
> >
> > 
> > Of course, only update nohz.next_balance in rebalance_domains() is the
> > simpliest way, but as @Valentin put, too many write to it may incur
> > unnecessary overhead. If we can gather the earliest next_balance in
> 
> This is not really possible because we have to move it to the next interval.
> 
> > advance, then a single write is considered to be better.
> > 
> > By the way, remove the redundant check in nohz_idle_balance().
> > 
> > FWIW, how about the below?
> 
> Your proposal below looks quite complex. IMO, one solution would be to move the
> update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
> so you are back to the previous behavior.
> 
> The only difference is that in case of an break because of need_resched, it
> doesn't update nohz.next_balance. But on the other hand, we haven't yet
> finished run rebalance_domains for all CPUs and some load_balance are still
> pending. In fact, this will be done during next tick by an idle CPU.
> 
> So I would be in favor of something as simple as :
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04098d678f3b..e028bc1c4744 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>                 }
>         }
> 
> +       /*
> +        * next_balance will be updated only when there is a need.
> +        * When the CPU is attached to null domain for ex, it will not be
> +        * updated.
> +        */
> +       if (likely(update_next_balance))
> +               nohz.next_balance = next_balance;
> +
>         /* Newly idle CPU doesn't need an update */
>         if (idle != CPU_NEWLY_IDLE) {
>                 update_blocked_averages(this_cpu);
> @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>         if (has_blocked_load)
>                 WRITE_ONCE(nohz.has_blocked, 1);
> 
> -       /*
> -        * next_balance will be updated only when there is a need.
> -        * When the CPU is attached to null domain for ex, it will not be
> -        * updated.
> -        */
> -       if (likely(update_next_balance))
> -               nohz.next_balance = next_balance;
> -
>         return ret;
>  }
> 

Indeed, simple and straightforward, it's better.

> > ***********************************************
> > * Below code is !!!ENTIRELY UNTESTED!!!, just *

[...]

> > @@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  {
> >  	int this_cpu = this_rq->cpu;
> >  	unsigned int flags;
> > -
> > -	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> > -		return false;
> 
> why did you remove this ?
> 

It seems that below 'if' do the same thing, isn't?

/* could be _relaxed() */
flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
if (!(flags & NOHZ_KICK_MASK))
        return false;

> > +	bool done;
> > 
> >  	if (idle != CPU_IDLE) {
> >  		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> > @@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  	if (!(flags & NOHZ_KICK_MASK))
> >  		return false;
> > 

[...]

> >  static void nohz_newidle_balance(struct rq *this_rq)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 15:16       ` Peng Liu
@ 2020-05-05 15:43         ` Vincent Guittot
  2020-05-05 16:08           ` Peng Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-05-05 15:43 UTC (permalink / raw)
  To: Peng Liu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Dietmar Eggemann,
	Valentin Schneider

On Tue, 5 May 2020 at 17:16, Peng Liu <iwtbavbm@gmail.com> wrote:
>
> On Tue, May 05, 2020 at 04:27:11PM +0200, Vincent Guittot wrote:
> > Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> > > On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> > > > On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
> > > > >
>
> [...]
>
> > > Yes, you're right. When need_resched() returns true, things become out
> > > of expectation. We haven't really got the earliest next_balance, abort
> > > the update immediately and let the successor to help. Doubtless this
> > > will incur some overhead due to the repeating work.
> >
> > There should not be some repeating works because CPUs and sched_domain, which
> > have already been balanced, will not be rebalanced until the next load balance
> > interval.
> >
> > Futhermore, there is in fact still work to do bcause not all the idle CPUs got
> > a chance to pull work
> >
> > >
> > >
> > > About the "tick is not stopped when entering idle" case, defer the
> > > update to nohz_balance_enter_idle() would be a choice too.
> > >
> > >
> > > Of course, only update nohz.next_balance in rebalance_domains() is the
> > > simpliest way, but as @Valentin put, too many write to it may incur
> > > unnecessary overhead. If we can gather the earliest next_balance in
> >
> > This is not really possible because we have to move it to the next interval.
> >
> > > advance, then a single write is considered to be better.
> > >
> > > By the way, remove the redundant check in nohz_idle_balance().
> > >
> > > FWIW, how about the below?
> >
> > Your proposal below looks quite complex. IMO, one solution would be to move the
> > update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
> > so you are back to the previous behavior.
> >
> > The only difference is that in case of an break because of need_resched, it
> > doesn't update nohz.next_balance. But on the other hand, we haven't yet
> > finished run rebalance_domains for all CPUs and some load_balance are still
> > pending. In fact, this will be done during next tick by an idle CPU.
> >
> > So I would be in favor of something as simple as :
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04098d678f3b..e028bc1c4744 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >                 }
> >         }
> >
> > +       /*
> > +        * next_balance will be updated only when there is a need.
> > +        * When the CPU is attached to null domain for ex, it will not be
> > +        * updated.
> > +        */
> > +       if (likely(update_next_balance))
> > +               nohz.next_balance = next_balance;
> > +
> >         /* Newly idle CPU doesn't need an update */
> >         if (idle != CPU_NEWLY_IDLE) {
> >                 update_blocked_averages(this_cpu);
> > @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >         if (has_blocked_load)
> >                 WRITE_ONCE(nohz.has_blocked, 1);
> >
> > -       /*
> > -        * next_balance will be updated only when there is a need.
> > -        * When the CPU is attached to null domain for ex, it will not be
> > -        * updated.
> > -        */
> > -       if (likely(update_next_balance))
> > -               nohz.next_balance = next_balance;
> > -
> >         return ret;
> >  }
> >
>
> Indeed, simple and straightforward, it's better.
>
> > > ***********************************************
> > > * Below code is !!!ENTIRELY UNTESTED!!!, just *
>
> [...]
>
> > > @@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > >  {
> > >     int this_cpu = this_rq->cpu;
> > >     unsigned int flags;
> > > -
> > > -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> > > -           return false;
> >
> > why did you remove this ?
> >
>
> It seems that below 'if' do the same thing, isn't?

The test above is an optimization for the most common case

>
> /* could be _relaxed() */
> flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> if (!(flags & NOHZ_KICK_MASK))
>         return false;
>
> > > +   bool done;
> > >
> > >     if (idle != CPU_IDLE) {
> > >             atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> > > @@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > >     if (!(flags & NOHZ_KICK_MASK))
> > >             return false;
> > >
>
> [...]
>
> > >  static void nohz_newidle_balance(struct rq *this_rq)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 15:43         ` Vincent Guittot
@ 2020-05-05 16:08           ` Peng Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Peng Liu @ 2020-05-05 16:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, iwtbavbm,
	linux-kernel, dietmar.eggemann, valentin.schneider

On Tue, May 05, 2020 at 05:43:04PM +0200, Vincent Guittot wrote:
> On Tue, 5 May 2020 at 17:16, Peng Liu <iwtbavbm@gmail.com> wrote:
> >
> > On Tue, May 05, 2020 at 04:27:11PM +0200, Vincent Guittot wrote:
> > > Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> > > > On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> > > > > On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@gmail.com> wrote:
> > > > > >
> >
> > [...]
> >
> > > > Yes, you're right. When need_resched() returns true, things become out
> > > > of expectation. We haven't really got the earliest next_balance, abort
> > > > the update immediately and let the successor to help. Doubtless this
> > > > will incur some overhead due to the repeating work.
> > >
> > > There should not be some repeating works because CPUs and sched_domain, which
> > > have already been balanced, will not be rebalanced until the next load balance
> > > interval.
> > >
> > > Futhermore, there is in fact still work to do bcause not all the idle CPUs got
> > > a chance to pull work
> > >
> > > >
> > > >
> > > > About the "tick is not stopped when entering idle" case, defer the
> > > > update to nohz_balance_enter_idle() would be a choice too.
> > > >
> > > >
> > > > Of course, only update nohz.next_balance in rebalance_domains() is the
> > > > simpliest way, but as @Valentin put, too many write to it may incur
> > > > unnecessary overhead. If we can gather the earliest next_balance in
> > >
> > > This is not really possible because we have to move it to the next interval.
> > >
> > > > advance, then a single write is considered to be better.
> > > >
> > > > By the way, remove the redundant check in nohz_idle_balance().
> > > >
> > > > FWIW, how about the below?
> > >
> > > Your proposal below looks quite complex. IMO, one solution would be to move the
> > > update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
> > > so you are back to the previous behavior.
> > >
> > > The only difference is that in case of an break because of need_resched, it
> > > doesn't update nohz.next_balance. But on the other hand, we haven't yet
> > > finished run rebalance_domains for all CPUs and some load_balance are still
> > > pending. In fact, this will be done during next tick by an idle CPU.
> > >
> > > So I would be in favor of something as simple as :
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04098d678f3b..e028bc1c4744 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> > >                 }
> > >         }
> > >
> > > +       /*
> > > +        * next_balance will be updated only when there is a need.
> > > +        * When the CPU is attached to null domain for ex, it will not be
> > > +        * updated.
> > > +        */
> > > +       if (likely(update_next_balance))
> > > +               nohz.next_balance = next_balance;
> > > +
> > >         /* Newly idle CPU doesn't need an update */
> > >         if (idle != CPU_NEWLY_IDLE) {
> > >                 update_blocked_averages(this_cpu);
> > > @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> > >         if (has_blocked_load)
> > >                 WRITE_ONCE(nohz.has_blocked, 1);
> > >
> > > -       /*
> > > -        * next_balance will be updated only when there is a need.
> > > -        * When the CPU is attached to null domain for ex, it will not be
> > > -        * updated.
> > > -        */
> > > -       if (likely(update_next_balance))
> > > -               nohz.next_balance = next_balance;
> > > -
> > >         return ret;
> > >  }
> > >
> >
> > Indeed, simple and straightforward, it's better.
> >
> > > > ***********************************************
> > > > * Below code is !!!ENTIRELY UNTESTED!!!, just *
> >
> > [...]
> >
> > > > @@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > > >  {
> > > >     int this_cpu = this_rq->cpu;
> > > >     unsigned int flags;
> > > > -
> > > > -   if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> > > > -           return false;
> > >
> > > why did you remove this ?
> > >
> >
> > It seems that below 'if' do the same thing, isn't?
> 
> The test above is an optimization for the most common case
> 

If the above is for optimization, then we can safely remove the below
test, just atomic_fetch_andnot() is enough, right?

If not, frankly speaking, I really got confused.

> >
> > /* could be _relaxed() */
> > flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> > if (!(flags & NOHZ_KICK_MASK))
> >         return false;
> >
> > > > +   bool done;
> > > >
> > > >     if (idle != CPU_IDLE) {
> > > >             atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> > > > @@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > > >     if (!(flags & NOHZ_KICK_MASK))
> > > >             return false;
> > > >
> >
> > [...]
> >
> > > >  static void nohz_newidle_balance(struct rq *this_rq)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-04 15:17 ` Vincent Guittot
                     ` (2 preceding siblings ...)
  2020-05-05 13:40   ` Peng Liu
@ 2020-05-06 10:28   ` Valentin Schneider
  3 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2020-05-06 10:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peng Liu, linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman


On 04/05/20 16:17, Vincent Guittot wrote:
>> Since we can gather all the updated rq->next_balance, including this_cpu,
>> in _nohz_idle_balance(), it's safe to remove the extra lines in
>> rebalance_domains() which are originally intended for this_cpu. And
>> finally the updating only happen in _nohz_idle_balance().
>
> I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> to return false . Then run_rebalance_domains() calls
> rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> In this case we must keep the code in rebalance_domains().
>
> For example when the tick is not stopped when entering idle. Or when
> need_resched() returns true.
>

Going back to this; nohz_idle_balance() will return true regardless of the
return value of _nohz_idle_balance(), so AFAICT we won't fall through to
the rebalance_domains() in run_rebalance_domains() in case we had
need_resched() in _nohz_idle_balance().

This was changed in b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK");
before then we would always have the local rebalance_domains(). Now, since
the bail out is caused by need_resched(), I think it's not such a crazy
thing *not* to do the local rebalance_domains(), but I wasn't super clear
on all of this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 14:27     ` Vincent Guittot
  2020-05-05 15:16       ` Peng Liu
@ 2020-05-06 10:29       ` Valentin Schneider
  2020-05-06 13:45         ` Vincent Guittot
  2020-05-08 13:01       ` Peng Liu
  2 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2020-05-06 10:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peng Liu, dietmar.eggemann, mingo, peterz, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel


On 05/05/20 15:27, Vincent Guittot wrote:
> So I would be in favor of something as simple as :
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04098d678f3b..e028bc1c4744 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>                 }
>         }
>
> +       /*
> +        * next_balance will be updated only when there is a need.
> +        * When the CPU is attached to null domain for ex, it will not be
> +        * updated.
> +        */
> +       if (likely(update_next_balance))
> +               nohz.next_balance = next_balance;
> +
>         /* Newly idle CPU doesn't need an update */
>         if (idle != CPU_NEWLY_IDLE) {
>                 update_blocked_averages(this_cpu);
> @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>         if (has_blocked_load)
>                 WRITE_ONCE(nohz.has_blocked, 1);
>
> -       /*
> -        * next_balance will be updated only when there is a need.
> -        * When the CPU is attached to null domain for ex, it will not be
> -        * updated.
> -        */
> -       if (likely(update_next_balance))
> -               nohz.next_balance = next_balance;
> -
>         return ret;
>  }
>

But then we may skip an update if we goto abort, no? Imagine we have just
NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
go through the last NOHZ CPU in the loop we hit need_resched(). We would
end in the abort part without any update to nohz.next_balance, despite
having accumulated relevant data in the local next_balance variable.

Also note that in this case, nohz_idle_balance() will still return true.

If we rip out just the one update we need from rebalance_domains(), then
perhaps we could go with what Peng was initially suggesting? i.e. something
like the below.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46b7bd41573f..0a292e0a0731 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9934,22 +9934,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
         * When the cpu is attached to null domain for ex, it will not be
         * updated.
         */
-	if (likely(update_next_balance)) {
+	if (likely(update_next_balance))
                rq->next_balance = next_balance;
-
-#ifdef CONFIG_NO_HZ_COMMON
-		/*
-		 * If this CPU has been elected to perform the nohz idle
-		 * balance. Other idle CPUs have already rebalanced with
-		 * nohz_idle_balance() and nohz.next_balance has been
-		 * updated accordingly. This CPU is now running the idle load
-		 * balance for itself and we need to update the
-		 * nohz.next_balance accordingly.
-		 */
-		if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
-			nohz.next_balance = rq->next_balance;
-#endif
-	}
 }

 static inline int on_null_domain(struct rq *rq)
@@ -10315,6 +10301,11 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
        if (flags & NOHZ_BALANCE_KICK)
                rebalance_domains(this_rq, CPU_IDLE);

+	if (time_after(next_balance, this_rq->next_balance)) {
+		next_balance = this_rq->next_balance;
+		update_next_balance = 1;
+	}
+
        WRITE_ONCE(nohz.next_blocked,
                now + msecs_to_jiffies(LOAD_AVG_PERIOD));

@@ -10551,6 +10542,17 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
        /* normal load balance */
        update_blocked_averages(this_rq->cpu);
        rebalance_domains(this_rq, idle);
+
+#ifdef CONFIG_NO_HZ_COMMON
+	/*
+	 * NOHZ idle CPUs will be rebalanced with nohz_idle_balance() and thus
+	 * nohz.next_balance will be updated accordingly. If there was no NOHZ
+	 * kick, then we just need to update nohz.next_balance wrt *this* CPU.
+	 */
+	if ((idle == CPU_IDLE) &&
+	    time_after(nohz.next_balance, this_rq->next_balance))
+		nohz.next_balance = this_rq->next_balance;
+#endif
 }

 /*
---

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-06 10:29       ` Valentin Schneider
@ 2020-05-06 13:45         ` Vincent Guittot
  2020-05-06 16:02           ` Valentin Schneider
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-05-06 13:45 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peng Liu, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Wed, 6 May 2020 at 12:29, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 05/05/20 15:27, Vincent Guittot wrote:
> > So I would be in favor of something as simple as :
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04098d678f3b..e028bc1c4744 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >                 }
> >         }
> >
> > +       /*
> > +        * next_balance will be updated only when there is a need.
> > +        * When the CPU is attached to null domain for ex, it will not be
> > +        * updated.
> > +        */
> > +       if (likely(update_next_balance))
> > +               nohz.next_balance = next_balance;
> > +
> >         /* Newly idle CPU doesn't need an update */
> >         if (idle != CPU_NEWLY_IDLE) {
> >                 update_blocked_averages(this_cpu);
> > @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >         if (has_blocked_load)
> >                 WRITE_ONCE(nohz.has_blocked, 1);
> >
> > -       /*
> > -        * next_balance will be updated only when there is a need.
> > -        * When the CPU is attached to null domain for ex, it will not be
> > -        * updated.
> > -        */
> > -       if (likely(update_next_balance))
> > -               nohz.next_balance = next_balance;
> > -
> >         return ret;
> >  }
> >
>
> But then we may skip an update if we goto abort, no? Imagine we have just
> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
> go through the last NOHZ CPU in the loop we hit need_resched(). We would
> end in the abort part without any update to nohz.next_balance, despite
> having accumulated relevant data in the local next_balance variable.

Yes but on the other end, the last CPU has not been able to run the
rebalance_domain so we must not move  nohz.next_balance otherwise it
will have to wait for at least another full period
In fact, I think that we have a problem with current implementation
because if we abort because  local cpu because busy we might end up
skipping idle load balance for a lot of idle CPUs

As an example, imagine that we have 10 idle CPUs with the same
rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
starts on CPU0, it processes idle lb for CPU1 but then has to abort
because of need_resched. If we update nohz.next_balance like
currently, the next idle load balance  will happen after a full
balance interval whereas we still have 8 CPUs waiting for running an
idle load balance.

My proposal also fixes this problem

>
> Also note that in this case, nohz_idle_balance() will still return true.
>
> If we rip out just the one update we need from rebalance_domains(), then
> perhaps we could go with what Peng was initially suggesting? i.e. something
> like the below.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 46b7bd41573f..0a292e0a0731 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9934,22 +9934,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>          * When the cpu is attached to null domain for ex, it will not be
>          * updated.
>          */
> -       if (likely(update_next_balance)) {
> +       if (likely(update_next_balance))
>                 rq->next_balance = next_balance;
> -
> -#ifdef CONFIG_NO_HZ_COMMON
> -               /*
> -                * If this CPU has been elected to perform the nohz idle
> -                * balance. Other idle CPUs have already rebalanced with
> -                * nohz_idle_balance() and nohz.next_balance has been
> -                * updated accordingly. This CPU is now running the idle load
> -                * balance for itself and we need to update the
> -                * nohz.next_balance accordingly.
> -                */
> -               if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> -                       nohz.next_balance = rq->next_balance;
> -#endif
> -       }
>  }
>
>  static inline int on_null_domain(struct rq *rq)
> @@ -10315,6 +10301,11 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>         if (flags & NOHZ_BALANCE_KICK)
>                 rebalance_domains(this_rq, CPU_IDLE);
>
> +       if (time_after(next_balance, this_rq->next_balance)) {
> +               next_balance = this_rq->next_balance;
> +               update_next_balance = 1;
> +       }
> +
>         WRITE_ONCE(nohz.next_blocked,
>                 now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>
> @@ -10551,6 +10542,17 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
>         /* normal load balance */
>         update_blocked_averages(this_rq->cpu);
>         rebalance_domains(this_rq, idle);
> +
> +#ifdef CONFIG_NO_HZ_COMMON
> +       /*
> +        * NOHZ idle CPUs will be rebalanced with nohz_idle_balance() and thus
> +        * nohz.next_balance will be updated accordingly. If there was no NOHZ
> +        * kick, then we just need to update nohz.next_balance wrt *this* CPU.
> +        */
> +       if ((idle == CPU_IDLE) &&
> +           time_after(nohz.next_balance, this_rq->next_balance))
> +               nohz.next_balance = this_rq->next_balance;
> +#endif
>  }
>
>  /*
> ---

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-06 13:45         ` Vincent Guittot
@ 2020-05-06 16:02           ` Valentin Schneider
  2020-05-06 16:56             ` Vincent Guittot
  2020-05-07 12:41             ` Peng Liu
  0 siblings, 2 replies; 21+ messages in thread
From: Valentin Schneider @ 2020-05-06 16:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peng Liu, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel


On 06/05/20 14:45, Vincent Guittot wrote:
>> But then we may skip an update if we goto abort, no? Imagine we have just
>> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
>> go through the last NOHZ CPU in the loop we hit need_resched(). We would
>> end in the abort part without any update to nohz.next_balance, despite
>> having accumulated relevant data in the local next_balance variable.
>
> Yes but on the other end, the last CPU has not been able to run the
> rebalance_domain so we must not move  nohz.next_balance otherwise it
> will have to wait for at least another full period
> In fact, I think that we have a problem with current implementation
> because if we abort because  local cpu because busy we might end up
> skipping idle load balance for a lot of idle CPUs
>
> As an example, imagine that we have 10 idle CPUs with the same
> rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
> starts on CPU0, it processes idle lb for CPU1 but then has to abort
> because of need_resched. If we update nohz.next_balance like
> currently, the next idle load balance  will happen after a full
> balance interval whereas we still have 8 CPUs waiting for running an
> idle load balance.
>
> My proposal also fixes this problem
>

That's a very good point; so with NOHZ_BALANCE_KICK we can reduce
nohz.next_balance via rebalance_domains(), and otherwise we would only
increase it if we go through a complete for_each_cpu() loop in
_nohz_idle_balance().

That said, if for some reason we keep bailing out of the loop, we won't
push nohz.next_balance forward and thus may repeatedly nohz-balance only
the first few CPUs in the NOHZ mask. I think that can happen if we have
say 2 tasks pinned to a single rq, in that case nohz_balancer_kick() will
kick a NOHZ balance whenever now >= nohz.next_balance.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-06 16:02           ` Valentin Schneider
@ 2020-05-06 16:56             ` Vincent Guittot
  2020-05-06 20:21               ` Valentin Schneider
  2020-05-07 12:41             ` Peng Liu
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-05-06 16:56 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peng Liu, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Wed, 6 May 2020 at 18:03, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 06/05/20 14:45, Vincent Guittot wrote:
> >> But then we may skip an update if we goto abort, no? Imagine we have just
> >> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
> >> go through the last NOHZ CPU in the loop we hit need_resched(). We would
> >> end in the abort part without any update to nohz.next_balance, despite
> >> having accumulated relevant data in the local next_balance variable.
> >
> > Yes but on the other end, the last CPU has not been able to run the
> > rebalance_domain so we must not move  nohz.next_balance otherwise it
> > will have to wait for at least another full period
> > In fact, I think that we have a problem with current implementation
> > because if we abort because  local cpu because busy we might end up
> > skipping idle load balance for a lot of idle CPUs
> >
> > As an example, imagine that we have 10 idle CPUs with the same
> > rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
> > starts on CPU0, it processes idle lb for CPU1 but then has to abort
> > because of need_resched. If we update nohz.next_balance like
> > currently, the next idle load balance  will happen after a full
> > balance interval whereas we still have 8 CPUs waiting for running an
> > idle load balance.
> >
> > My proposal also fixes this problem
> >
>
> That's a very good point; so with NOHZ_BALANCE_KICK we can reduce
> nohz.next_balance via rebalance_domains(), and otherwise we would only
> increase it if we go through a complete for_each_cpu() loop in
> _nohz_idle_balance().
>
> That said, if for some reason we keep bailing out of the loop, we won't
> push nohz.next_balance forward and thus may repeatedly nohz-balance only
> the first few CPUs in the NOHZ mask. I think that can happen if we have
> say 2 tasks pinned to a single rq, in that case nohz_balancer_kick() will
> kick a NOHZ balance whenever now >= nohz.next_balance.

If we take my example above and we have CPU0 which is idle at every
tick and selected as ilb_cpu but unluckily CPU0 has to abort before
running ilb for CPU1 everytime, I agree that we can end up trying to
run ilb on CPU0 at every tick without any success. We might consider
to kick_ilb in _nohz_idle_balance if we have to abort to let another
CPU handle the ilb

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-06 16:56             ` Vincent Guittot
@ 2020-05-06 20:21               ` Valentin Schneider
  0 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2020-05-06 20:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peng Liu, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel


On 06/05/20 17:56, Vincent Guittot wrote:
> On Wed, 6 May 2020 at 18:03, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>>
>> On 06/05/20 14:45, Vincent Guittot wrote:
>> >> But then we may skip an update if we goto abort, no? Imagine we have just
>> >> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
>> >> go through the last NOHZ CPU in the loop we hit need_resched(). We would
>> >> end in the abort part without any update to nohz.next_balance, despite
>> >> having accumulated relevant data in the local next_balance variable.
>> >
>> > Yes but on the other end, the last CPU has not been able to run the
>> > rebalance_domain so we must not move  nohz.next_balance otherwise it
>> > will have to wait for at least another full period
>> > In fact, I think that we have a problem with current implementation
>> > because if we abort because  local cpu because busy we might end up
>> > skipping idle load balance for a lot of idle CPUs
>> >
>> > As an example, imagine that we have 10 idle CPUs with the same
>> > rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
>> > starts on CPU0, it processes idle lb for CPU1 but then has to abort
>> > because of need_resched. If we update nohz.next_balance like
>> > currently, the next idle load balance  will happen after a full
>> > balance interval whereas we still have 8 CPUs waiting for running an
>> > idle load balance.
>> >
>> > My proposal also fixes this problem
>> >
>>
>> That's a very good point; so with NOHZ_BALANCE_KICK we can reduce
>> nohz.next_balance via rebalance_domains(), and otherwise we would only
>> increase it if we go through a complete for_each_cpu() loop in
>> _nohz_idle_balance().
>>
>> That said, if for some reason we keep bailing out of the loop, we won't
>> push nohz.next_balance forward and thus may repeatedly nohz-balance only
>> the first few CPUs in the NOHZ mask. I think that can happen if we have
>> say 2 tasks pinned to a single rq, in that case nohz_balancer_kick() will
>> kick a NOHZ balance whenever now >= nohz.next_balance.
>
> If we take my example above and we have CPU0 which is idle at every
> tick and selected as ilb_cpu but unluckily CPU0 has to abort before
> running ilb for CPU1 everytime, I agree that we can end up trying to
> run ilb on CPU0 at every tick without any success. We might consider
> to kick_ilb in _nohz_idle_balance if we have to abort to let another
> CPU handle the ilb

That's an idea; maybe something like the next CPU that was due to be
rebalanced (i.e. the one for which we hit the goto abort).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-06 16:02           ` Valentin Schneider
  2020-05-06 16:56             ` Vincent Guittot
@ 2020-05-07 12:41             ` Peng Liu
  2020-05-07 12:53               ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Peng Liu @ 2020-05-07 12:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, iwtbavbm,
	linux-kernel, dietmar.eggemann, vincent.guittot

On Wed, May 06, 2020 at 05:02:56PM +0100, Valentin Schneider wrote:
> 
> On 06/05/20 14:45, Vincent Guittot wrote:
> >> But then we may skip an update if we goto abort, no? Imagine we have just
> >> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
> >> go through the last NOHZ CPU in the loop we hit need_resched(). We would
> >> end in the abort part without any update to nohz.next_balance, despite
> >> having accumulated relevant data in the local next_balance variable.
> >
> > Yes but on the other end, the last CPU has not been able to run the
> > rebalance_domain so we must not move  nohz.next_balance otherwise it
> > will have to wait for at least another full period
> > In fact, I think that we have a problem with current implementation
> > because if we abort because  local cpu because busy we might end up
> > skipping idle load balance for a lot of idle CPUs
> >
> > As an example, imagine that we have 10 idle CPUs with the same
> > rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
> > starts on CPU0, it processes idle lb for CPU1 but then has to abort
> > because of need_resched. If we update nohz.next_balance like
> > currently, the next idle load balance  will happen after a full
> > balance interval whereas we still have 8 CPUs waiting for running an
> > idle load balance.
> >
> > My proposal also fixes this problem
> >
> 
> That's a very good point; so with NOHZ_BALANCE_KICK we can reduce
> nohz.next_balance via rebalance_domains(), and otherwise we would only
> increase it if we go through a complete for_each_cpu() loop in
> _nohz_idle_balance().
> 
> That said, if for some reason we keep bailing out of the loop, we won't
> push nohz.next_balance forward and thus may repeatedly nohz-balance only
> the first few CPUs in the NOHZ mask. I think that can happen if we have
> say 2 tasks pinned to a single rq, in that case nohz_balancer_kick() will
> kick a NOHZ balance whenever now >= nohz.next_balance.

If we face the risk of "repeatly nohz-balance only the first few CPUs",
Maybe we could remember the interrupted CPU and start nohz-balance from
it next time. Just replace the loop in _nohz_idle_balance() like:

for_each_cpu_wrap(cpu, nohz.idle_cpus_mask, nohz.anchor) {
	...
	if (need_resched()) {
		...
		nohz.anchor = cpu;
		...
	}
	...
}

This can mitigate the problem, but this can't help the extreme situation
as @Vincent put, it always failed in the same CPU.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-07 12:41             ` Peng Liu
@ 2020-05-07 12:53               ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-05-07 12:53 UTC (permalink / raw)
  To: Peng Liu
  Cc: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Dietmar Eggemann

On Thu, 7 May 2020 at 14:41, Peng Liu <iwtbavbm@gmail.com> wrote:
>
> On Wed, May 06, 2020 at 05:02:56PM +0100, Valentin Schneider wrote:
> >
> > On 06/05/20 14:45, Vincent Guittot wrote:
> > >> But then we may skip an update if we goto abort, no? Imagine we have just
> > >> NOHZ_STATS_KICK, so we don't call any rebalance_domains(), and then as we
> > >> go through the last NOHZ CPU in the loop we hit need_resched(). We would
> > >> end in the abort part without any update to nohz.next_balance, despite
> > >> having accumulated relevant data in the local next_balance variable.
> > >
> > > Yes but on the other end, the last CPU has not been able to run the
> > > rebalance_domain so we must not move  nohz.next_balance otherwise it
> > > will have to wait for at least another full period
> > > In fact, I think that we have a problem with current implementation
> > > because if we abort because  local cpu because busy we might end up
> > > skipping idle load balance for a lot of idle CPUs
> > >
> > > As an example, imagine that we have 10 idle CPUs with the same
> > > rq->next_balance which equal nohz.next_balance.  _nohz_idle_balance
> > > starts on CPU0, it processes idle lb for CPU1 but then has to abort
> > > because of need_resched. If we update nohz.next_balance like
> > > currently, the next idle load balance  will happen after a full
> > > balance interval whereas we still have 8 CPUs waiting for running an
> > > idle load balance.
> > >
> > > My proposal also fixes this problem
> > >
> >
> > That's a very good point; so with NOHZ_BALANCE_KICK we can reduce
> > nohz.next_balance via rebalance_domains(), and otherwise we would only
> > increase it if we go through a complete for_each_cpu() loop in
> > _nohz_idle_balance().
> >
> > That said, if for some reason we keep bailing out of the loop, we won't
> > push nohz.next_balance forward and thus may repeatedly nohz-balance only
> > the first few CPUs in the NOHZ mask. I think that can happen if we have
> > say 2 tasks pinned to a single rq, in that case nohz_balancer_kick() will
> > kick a NOHZ balance whenever now >= nohz.next_balance.
>
> If we face the risk of "repeatly nohz-balance only the first few CPUs",
> Maybe we could remember the interrupted CPU and start nohz-balance from
> it next time. Just replace the loop in _nohz_idle_balance() like:
>
> for_each_cpu_wrap(cpu, nohz.idle_cpus_mask, nohz.anchor) {
>         ...
>         if (need_resched()) {
>                 ...
>                 nohz.anchor = cpu;
>                 ...
>         }
>         ...
> }
>
> This can mitigate the problem, but this can't help the extreme situation

If we rerun _nohz_idle_balance before the balance interval, the 1st
idle CPUs that has already been balanced will be skipped because there
rq->next_balance will be after jiffies and we will start calling
rebalance_domains for idle CPUs which have not yet been balance.
So I'm not sure that this will help a lot because we have to go
through all idle CPU to set the nohz.next_balance at the end

> as @Vincent put, it always failed in the same CPU.

In the case that i described above, the problem comes from the cpu
that is selected to run the ilb but if we kick ilb again, it will not
be selected again.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-05 14:27     ` Vincent Guittot
  2020-05-05 15:16       ` Peng Liu
  2020-05-06 10:29       ` Valentin Schneider
@ 2020-05-08 13:01       ` Peng Liu
  2020-05-08 15:31         ` Vincent Guittot
  2 siblings, 1 reply; 21+ messages in thread
From: Peng Liu @ 2020-05-08 13:01 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: linux-kernel

On Tue, May 05, 2020 at 04:27:11PM +0200, Vincent Guittot wrote:
> Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> 

[...]

> 
> Your proposal below looks quite complex. IMO, one solution would be to move the
> update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
> so you are back to the previous behavior.
> 
> The only difference is that in case of an break because of need_resched, it
> doesn't update nohz.next_balance. But on the other hand, we haven't yet
> finished run rebalance_domains for all CPUs and some load_balance are still
> pending. In fact, this will be done during next tick by an idle CPU.
> 
> So I would be in favor of something as simple as :
> 

Vincent, could you refine this patch with some changelog?
And have my reported-by if possible.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04098d678f3b..e028bc1c4744 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>                 }
>         }
> 
> +       /*
> +        * next_balance will be updated only when there is a need.
> +        * When the CPU is attached to null domain for ex, it will not be
> +        * updated.
> +        */
> +       if (likely(update_next_balance))
> +               nohz.next_balance = next_balance;
> +
>         /* Newly idle CPU doesn't need an update */
>         if (idle != CPU_NEWLY_IDLE) {
>                 update_blocked_averages(this_cpu);
> @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>         if (has_blocked_load)
>                 WRITE_ONCE(nohz.has_blocked, 1);
> 
> -       /*
> -        * next_balance will be updated only when there is a need.
> -        * When the CPU is attached to null domain for ex, it will not be
> -        * updated.
> -        */
> -       if (likely(update_next_balance))
> -               nohz.next_balance = next_balance;
> -
>         return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/fair: Fix nohz.next_balance update
  2020-05-08 13:01       ` Peng Liu
@ 2020-05-08 15:31         ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-05-08 15:31 UTC (permalink / raw)
  To: Peng Liu; +Cc: linux-kernel

On Fri, 8 May 2020 at 15:01, Peng Liu <iwtbavbm@gmail.com> wrote:
>
> On Tue, May 05, 2020 at 04:27:11PM +0200, Vincent Guittot wrote:
> > Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> >
>
> [...]
>
> >
> > Your proposal below looks quite complex. IMO, one solution would be to move the
> > update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
> > so you are back to the previous behavior.
> >
> > The only difference is that in case of an break because of need_resched, it
> > doesn't update nohz.next_balance. But on the other hand, we haven't yet
> > finished run rebalance_domains for all CPUs and some load_balance are still
> > pending. In fact, this will be done during next tick by an idle CPU.
> >
> > So I would be in favor of something as simple as :
> >
>
> Vincent, could you refine this patch with some changelog?

Hi Peng , I'm going to prepare it and another one for the case that we
discussed about kicking a new ilb in case of abort

> And have my reported-by if possible.

Yes I will

>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04098d678f3b..e028bc1c4744 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >                 }
> >         }
> >
> > +       /*
> > +        * next_balance will be updated only when there is a need.
> > +        * When the CPU is attached to null domain for ex, it will not be
> > +        * updated.
> > +        */
> > +       if (likely(update_next_balance))
> > +               nohz.next_balance = next_balance;
> > +
> >         /* Newly idle CPU doesn't need an update */
> >         if (idle != CPU_NEWLY_IDLE) {
> >                 update_blocked_averages(this_cpu);
> > @@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> >         if (has_blocked_load)
> >                 WRITE_ONCE(nohz.has_blocked, 1);
> >
> > -       /*
> > -        * next_balance will be updated only when there is a need.
> > -        * When the CPU is attached to null domain for ex, it will not be
> > -        * updated.
> > -        */
> > -       if (likely(update_next_balance))
> > -               nohz.next_balance = next_balance;
> > -
> >         return ret;
> >  }
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-05-08 15:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-03  8:34 [PATCH] sched/fair: Fix nohz.next_balance update Peng Liu
2020-05-04  0:10 ` Valentin Schneider
2020-05-05 12:36   ` Peng Liu
2020-05-04 15:17 ` Vincent Guittot
2020-05-04 15:48   ` Valentin Schneider
2020-05-04 16:05   ` Dietmar Eggemann
2020-05-05 13:40   ` Peng Liu
2020-05-05 14:27     ` Vincent Guittot
2020-05-05 15:16       ` Peng Liu
2020-05-05 15:43         ` Vincent Guittot
2020-05-05 16:08           ` Peng Liu
2020-05-06 10:29       ` Valentin Schneider
2020-05-06 13:45         ` Vincent Guittot
2020-05-06 16:02           ` Valentin Schneider
2020-05-06 16:56             ` Vincent Guittot
2020-05-06 20:21               ` Valentin Schneider
2020-05-07 12:41             ` Peng Liu
2020-05-07 12:53               ` Vincent Guittot
2020-05-08 13:01       ` Peng Liu
2020-05-08 15:31         ` Vincent Guittot
2020-05-06 10:28   ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).