linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix stuck overutilized
@ 2021-12-20 11:43 Vincent Donnefort
  2021-12-20 11:43 ` [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent Vincent Donnefort
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Vincent Donnefort @ 2021-12-20 11:43 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, qperret, Vincent Donnefort

The overutilized signal indicates when one of the CPU of the root domain has an
utilization above a certain threshold (0.8 * cpu_capacity). Its sole purpose is
to disable EAS on asymmetric CPU capacity systems in favor of CAS. Under certain
circumstances, described in the following patches, this signal can stay stuck.

The two first patches of this series intends to fix this blockage, while the
last one is an improvement.

Vincent Donnefort (3):
  sched/fair: Make cpu_overutilized() EAS dependent
  sched/fair: Fix newidle_balance() for overutilized systems
  sched/fair: Do not raise overutilized for idle CPUs

 kernel/sched/fair.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent
  2021-12-20 11:43 [PATCH 0/3] Fix stuck overutilized Vincent Donnefort
@ 2021-12-20 11:43 ` Vincent Donnefort
  2021-12-20 17:17   ` Valentin Schneider
  2021-12-20 11:43 ` [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems Vincent Donnefort
  2021-12-20 11:43 ` [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs Vincent Donnefort
  2 siblings, 1 reply; 14+ messages in thread
From: Vincent Donnefort @ 2021-12-20 11:43 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, qperret, Vincent Donnefort

On a system with Energy Aware Scheduling (EAS), tasks are placed according
to their energy consumption estimation and load balancing is disabled to
not break that energy biased placement. If the system becomes
overutilized, i.e. one of the CPU has too much utilization, energy
placement would then be disabled, in favor of Capacity-Aware Scheduling
(CAS), including load balancing. This is the sole usage for
rd->overutilized. Hence, there is no need to raise it for !EAS systems.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..e2f6fa14e5e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5511,7 +5511,8 @@ static inline void hrtick_update(struct rq *rq)
 #ifdef CONFIG_SMP
 static inline bool cpu_overutilized(int cpu)
 {
-	return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
+	return sched_energy_enabled() &&
+	       !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
 }
 
 static inline void update_overutilized_status(struct rq *rq)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems
  2021-12-20 11:43 [PATCH 0/3] Fix stuck overutilized Vincent Donnefort
  2021-12-20 11:43 ` [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent Vincent Donnefort
@ 2021-12-20 11:43 ` Vincent Donnefort
  2021-12-20 17:17   ` Valentin Schneider
  2021-12-22  8:14   ` Vincent Guittot
  2021-12-20 11:43 ` [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs Vincent Donnefort
  2 siblings, 2 replies; 14+ messages in thread
From: Vincent Donnefort @ 2021-12-20 11:43 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, qperret, Vincent Donnefort

On Energy-Aware Scheduling systems, load balancing is disabled in favor of
energy based placement, until one of the CPU is identified as being
overutilized. Once the overutilization is resolved, two paths can lead to
marking the system as non overutilized again:

  * load_balance() triggered from newidle_balance().
  * load_balance() triggered from the scheduler tick.

However, small caveat for each of those paths. newidle_balance() needs
rd->overload set to run load_balance(), while the load_balance() triggered
by the scheduler tick needs to run from the first idle CPU of the root
domain (see should_we_balance()).

Overutilized can be triggered without setting overload (this can happen
for a CPU which had a misfit task but didn't had its util_avg updated
yet). Then, only the scheduler tick could help to reset overutilized...
but if most of the CPUs are idle, it is very unlikely load_balance() would
run on the only CPU which can reset the flag. This means the root domain
can spuriously maintain overutilized for a long period of time.

We then need newidle_balance() to proceed with balancing if the system is
overutilized.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2f6fa14e5e7..51f6f55abb37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10849,7 +10849,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (!READ_ONCE(this_rq->rd->overload) ||
+	if ((!READ_ONCE(this_rq->rd->overload) &&
+	    !READ_ONCE(this_rq->rd->overutilized)) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2021-12-20 11:43 [PATCH 0/3] Fix stuck overutilized Vincent Donnefort
  2021-12-20 11:43 ` [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent Vincent Donnefort
  2021-12-20 11:43 ` [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems Vincent Donnefort
@ 2021-12-20 11:43 ` Vincent Donnefort
  2021-12-20 17:17   ` Valentin Schneider
  2021-12-22  8:20   ` Vincent Guittot
  2 siblings, 2 replies; 14+ messages in thread
From: Vincent Donnefort @ 2021-12-20 11:43 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, valentin.schneider,
	morten.rasmussen, qperret, Vincent Donnefort

During a migration, the lock for the previous runqueue is not taken and
hence, the task contribution isn't directly removed from that runqueue
utilization but instead temporarily saved, until the next PELT signals
update where it would be accounted. There is then a window in which a
CPU can ben idle be nonetheless overutilized.

The load balancer wouldn't be able to do anything to help a sleeping CPU,
it brings then no gain to raise overutilized there, only the risk of
spuriously doing it.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51f6f55abb37..37f737c5f0b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;
-
-		if (nr_running > 1)
-			*sg_status |= SG_OVERLOAD;
-
-		if (cpu_overutilized(i))
-			*sg_status |= SG_OVERUTILIZED;
-
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
+		if (nr_running > 1)
+			*sg_status |= SG_OVERLOAD;
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
 		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
-			/* Idle cpu can't have misfit task */
+			/*
+			 * Idle cpu can neither be overutilized nor have a
+			 * misfit task.
+			 */
 			continue;
 		}
 
+		if (cpu_overutilized(i))
+			*sg_status |= SG_OVERUTILIZED;
+
 		if (local_group)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent
  2021-12-20 11:43 ` [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent Vincent Donnefort
@ 2021-12-20 17:17   ` Valentin Schneider
  2021-12-21  9:09     ` Vincent Donnefort
  0 siblings, 1 reply; 14+ messages in thread
From: Valentin Schneider @ 2021-12-20 17:17 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, qperret,
	Vincent Donnefort

On 20/12/21 12:43, Vincent Donnefort wrote:
> On a system with Energy Aware Scheduling (EAS), tasks are placed according
> to their energy consumption estimation and load balancing is disabled to
> not break that energy biased placement. If the system becomes
> overutilized, i.e. one of the CPU has too much utilization, energy
> placement would then be disabled, in favor of Capacity-Aware Scheduling
> (CAS), including load balancing. This is the sole usage for
> rd->overutilized. Hence, there is no need to raise it for !EAS systems.
>
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")

I'm not sure a Fixes: is warranted, this does not fix any misbehaviour or
performance regression (even if this might gain us a few extra IPS by not
writing 1's to rd->overutilized on SMP systems, note that this still gives
us writes of 0's).

Regardless:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..e2f6fa14e5e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5511,7 +5511,8 @@ static inline void hrtick_update(struct rq *rq)
>  #ifdef CONFIG_SMP
>  static inline bool cpu_overutilized(int cpu)
>  {
> -	return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
> +	return sched_energy_enabled() &&
> +	       !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
>  }
>
>  static inline void update_overutilized_status(struct rq *rq)
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems
  2021-12-20 11:43 ` [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems Vincent Donnefort
@ 2021-12-20 17:17   ` Valentin Schneider
  2021-12-22  8:14   ` Vincent Guittot
  1 sibling, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2021-12-20 17:17 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, qperret,
	Vincent Donnefort

On 20/12/21 12:43, Vincent Donnefort wrote:
> On Energy-Aware Scheduling systems, load balancing is disabled in favor of
> energy based placement, until one of the CPU is identified as being
> overutilized. Once the overutilization is resolved, two paths can lead to
> marking the system as non overutilized again:
>
>   * load_balance() triggered from newidle_balance().
>   * load_balance() triggered from the scheduler tick.
>
> However, small caveat for each of those paths. newidle_balance() needs
> rd->overload set to run load_balance(), while the load_balance() triggered
> by the scheduler tick needs to run from the first idle CPU of the root
> domain (see should_we_balance()).
>
> Overutilized can be triggered without setting overload (this can happen
> for a CPU which had a misfit task but didn't had its util_avg updated
> yet). Then, only the scheduler tick could help to reset overutilized...
> but if most of the CPUs are idle, it is very unlikely load_balance() would
> run on the only CPU which can reset the flag. This means the root domain
> can spuriously maintain overutilized for a long period of time.
>
> We then need newidle_balance() to proceed with balancing if the system is
> overutilized.
>
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2f6fa14e5e7..51f6f55abb37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10849,7 +10849,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>       rcu_read_lock();
>       sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> -	if (!READ_ONCE(this_rq->rd->overload) ||
> +	if ((!READ_ONCE(this_rq->rd->overload) &&
> +	    !READ_ONCE(this_rq->rd->overutilized)) ||
>           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
>               if (sd)
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2021-12-20 11:43 ` [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs Vincent Donnefort
@ 2021-12-20 17:17   ` Valentin Schneider
  2021-12-22  8:20   ` Vincent Guittot
  1 sibling, 0 replies; 14+ messages in thread
From: Valentin Schneider @ 2021-12-20 17:17 UTC (permalink / raw)
  To: Vincent Donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, morten.rasmussen, qperret,
	Vincent Donnefort

On 20/12/21 12:43, Vincent Donnefort wrote:
> During a migration, the lock for the previous runqueue is not taken and
> hence, the task contribution isn't directly removed from that runqueue
> utilization but instead temporarily saved, until the next PELT signals
> update where it would be accounted. There is then a window in which a
> CPU can ben idle be nonetheless overutilized.
>
> The load balancer wouldn't be able to do anything to help a sleeping CPU,
> it brings then no gain to raise overutilized there, only the risk of
> spuriously doing it.
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 51f6f55abb37..37f737c5f0b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
>               nr_running = rq->nr_running;
>               sgs->sum_nr_running += nr_running;
> -
> -		if (nr_running > 1)
> -			*sg_status |= SG_OVERLOAD;
> -
> -		if (cpu_overutilized(i))
> -			*sg_status |= SG_OVERUTILIZED;
> -
>  #ifdef CONFIG_NUMA_BALANCING
>               sgs->nr_numa_running += rq->nr_numa_running;
>               sgs->nr_preferred_running += rq->nr_preferred_running;
>  #endif
> +		if (nr_running > 1)
> +			*sg_status |= SG_OVERLOAD;
> +
>               /*
>                * No need to call idle_cpu() if nr_running is not 0
>                */
>               if (!nr_running && idle_cpu(i)) {
>                       sgs->idle_cpus++;
> -			/* Idle cpu can't have misfit task */
> +			/*
> +			 * Idle cpu can neither be overutilized nor have a
> +			 * misfit task.
> +			 */
>                       continue;
>               }
>
> +		if (cpu_overutilized(i))
> +			*sg_status |= SG_OVERUTILIZED;
> +
>               if (local_group)
>                       continue;
>
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent
  2021-12-20 17:17   ` Valentin Schneider
@ 2021-12-21  9:09     ` Vincent Donnefort
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Donnefort @ 2021-12-21  9:09 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: peterz, mingo, vincent.guittot, linux-kernel, dietmar.eggemann,
	morten.rasmussen, qperret

On Mon, Dec 20, 2021 at 05:17:09PM +0000, Valentin Schneider wrote:
> On 20/12/21 12:43, Vincent Donnefort wrote:
> > On a system with Energy Aware Scheduling (EAS), tasks are placed according
> > to their energy consumption estimation and load balancing is disabled to
> > not break that energy biased placement. If the system becomes
> > overutilized, i.e. one of the CPU has too much utilization, energy
> > placement would then be disabled, in favor of Capacity-Aware Scheduling
> > (CAS), including load balancing. This is the sole usage for
> > rd->overutilized. Hence, there is no need to raise it for !EAS systems.
> >
> > Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> 
> I'm not sure a Fixes: is warranted, this does not fix any misbehaviour or
> performance regression (even if this might gain us a few extra IPS by not
> writing 1's to rd->overutilized on SMP systems, note that this still gives
> us writes of 0's).

I put the tag to make sure this patch would be taken for stable releases
with:

  [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems

Without the EAS requirement for cpu_overutilized() (patch 1/3), patch 2/3
could lead to useless newidle_balance() for !EAS systems.

Maybe that means in the end 2/3 and 1/3 should be squashed?

> 
> Regardless:
> 
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 095b0aa378df..e2f6fa14e5e7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5511,7 +5511,8 @@ static inline void hrtick_update(struct rq *rq)
> >  #ifdef CONFIG_SMP
> >  static inline bool cpu_overutilized(int cpu)
> >  {
> > -	return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
> > +	return sched_energy_enabled() &&
> > +	       !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
> >  }
> >
> >  static inline void update_overutilized_status(struct rq *rq)
> > --
> > 2.25.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems
  2021-12-20 11:43 ` [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems Vincent Donnefort
  2021-12-20 17:17   ` Valentin Schneider
@ 2021-12-22  8:14   ` Vincent Guittot
  2022-01-10 16:29     ` Vincent Donnefort
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2021-12-22  8:14 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

On Mon, 20 Dec 2021 at 12:43, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> On Energy-Aware Scheduling systems, load balancing is disabled in favor of
> energy based placement, until one of the CPU is identified as being
> overutilized. Once the overutilization is resolved, two paths can lead to
> marking the system as non overutilized again:
>
>   * load_balance() triggered from newidle_balance().
>   * load_balance() triggered from the scheduler tick.
>
> However, small caveat for each of those paths. newidle_balance() needs
> rd->overload set to run load_balance(), while the load_balance() triggered
> by the scheduler tick needs to run from the first idle CPU of the root
> domain (see should_we_balance()).
>
> Overutilized can be triggered without setting overload (this can happen
> for a CPU which had a misfit task but didn't had its util_avg updated
> yet). Then, only the scheduler tick could help to reset overutilized...
> but if most of the CPUs are idle, it is very unlikely load_balance() would
> run on the only CPU which can reset the flag. This means the root domain

AFAICT, this will happen as you don't have to run on the only CPU but
for the only CPU and this is what ilb is doing. So your problem is not
to run on the only CPU that can clear overutilized

> can spuriously maintain overutilized for a long period of time.
>
> We then need newidle_balance() to proceed with balancing if the system is
> overutilized.

Always triggering a costly newidle_balance when you are already
overutilized for the sole purpose of clearing overutilized seems to be
a bit overkill.

Furthermore, nothing prevents us to abort newidle_balance before
reaching the root domain

So this doesn't seem like the good way to proceed

>
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2f6fa14e5e7..51f6f55abb37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10849,7 +10849,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>         rcu_read_lock();
>         sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> -       if (!READ_ONCE(this_rq->rd->overload) ||
> +       if ((!READ_ONCE(this_rq->rd->overload) &&
> +           !READ_ONCE(this_rq->rd->overutilized)) ||
>             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
>                 if (sd)
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2021-12-20 11:43 ` [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs Vincent Donnefort
  2021-12-20 17:17   ` Valentin Schneider
@ 2021-12-22  8:20   ` Vincent Guittot
  2022-01-10 16:40     ` Vincent Donnefort
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2021-12-22  8:20 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

On Mon, 20 Dec 2021 at 12:43, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> During a migration, the lock for the previous runqueue is not taken and
> hence, the task contribution isn't directly removed from that runqueue
> utilization but instead temporarily saved, until the next PELT signals
> update where it would be accounted. There is then a window in which a
> CPU can ben idle be nonetheless overutilized.
>
> The load balancer wouldn't be able to do anything to help a sleeping CPU,
> it brings then no gain to raise overutilized there, only the risk of
> spuriously doing it.

But how do you make the difference between a very short idle time of
an overutilized CPU and a idle cpu with outdated utilization

Being idle is not a good reason for not being overutilized (ie ~80% of
average utilisation)

>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 51f6f55abb37..37f737c5f0b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
>                 nr_running = rq->nr_running;
>                 sgs->sum_nr_running += nr_running;
> -
> -               if (nr_running > 1)
> -                       *sg_status |= SG_OVERLOAD;
> -
> -               if (cpu_overutilized(i))
> -                       *sg_status |= SG_OVERUTILIZED;
> -
>  #ifdef CONFIG_NUMA_BALANCING
>                 sgs->nr_numa_running += rq->nr_numa_running;
>                 sgs->nr_preferred_running += rq->nr_preferred_running;
>  #endif
> +               if (nr_running > 1)
> +                       *sg_status |= SG_OVERLOAD;

Why do you move this code related to overload ?

> +
>                 /*
>                  * No need to call idle_cpu() if nr_running is not 0
>                  */
>                 if (!nr_running && idle_cpu(i)) {
>                         sgs->idle_cpus++;
> -                       /* Idle cpu can't have misfit task */
> +                       /*
> +                        * Idle cpu can neither be overutilized nor have a
> +                        * misfit task.
> +                        */
>                         continue;
>                 }
>
> +               if (cpu_overutilized(i))
> +                       *sg_status |= SG_OVERUTILIZED;
> +
>                 if (local_group)
>                         continue;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems
  2021-12-22  8:14   ` Vincent Guittot
@ 2022-01-10 16:29     ` Vincent Donnefort
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Donnefort @ 2022-01-10 16:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

[...]

> 
> > can spuriously maintain overutilized for a long period of time.
> >
> > We then need newidle_balance() to proceed with balancing if the system is
> > overutilized.
> 
> Always triggering a costly newidle_balance when you are already
> overutilized for the sole purpose of clearing overutilized seems to be
> a bit overkill.

But the only cases where newidle_balance() would now run while it used not to,
are when overutilized is set but overload is not. Which is either a transient
state for which we do not anticipate more than one stat update or it is the
situation where one of the biggest CPU is overutilized while having nr_running <
2.

It can indeed add some additional costly calls to newidle_balance, but they
will not be plentiful, especially with the other patch from this series: 

  "sched/fair: Do not raise overutilized for idle CPUs"

> 
> Furthermore, nothing prevents us to abort newidle_balance before
> reaching the root domain

should_we_balance() always return true in the case of newidle. So I suppose you
refer to max_newidle_lb_cost?

> 
> So this doesn't seem like the good way to proceed

What are our other options?

Resolving it in the nohz balancer would need to change should_we_balance().

I also tried solely to not raise overutilized when the CPU is idle but this is
not a solution either as when a task migration is pending, you can end-up with
a !idle CPU but with nr_running < 2, so once again overutilized set, overload
not.

> 
> >
> > Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e2f6fa14e5e7..51f6f55abb37 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10849,7 +10849,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >         rcu_read_lock();
> >         sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >
> > -       if (!READ_ONCE(this_rq->rd->overload) ||
> > +       if ((!READ_ONCE(this_rq->rd->overload) &&
> > +           !READ_ONCE(this_rq->rd->overutilized)) ||
> >             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> >
> >                 if (sd)
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2021-12-22  8:20   ` Vincent Guittot
@ 2022-01-10 16:40     ` Vincent Donnefort
  2022-01-17 10:45       ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Donnefort @ 2022-01-10 16:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

On Wed, Dec 22, 2021 at 09:20:17AM +0100, Vincent Guittot wrote:
> On Mon, 20 Dec 2021 at 12:43, Vincent Donnefort
> <vincent.donnefort@arm.com> wrote:
> >
> > During a migration, the lock for the previous runqueue is not taken and
> > hence, the task contribution isn't directly removed from that runqueue
> > utilization but instead temporarily saved, until the next PELT signals
> > update where it would be accounted. There is then a window in which a
> > CPU can ben idle be nonetheless overutilized.
> >
> > The load balancer wouldn't be able to do anything to help a sleeping CPU,
> > it brings then no gain to raise overutilized there, only the risk of
> > spuriously doing it.
> 
> But how do you make the difference between a very short idle time of
> an overutilized CPU and a idle cpu with outdated utilization

No distinction here, but if the CPU is idle there's nothing to pull, so the load
balance wouldn't do anything with this information.

> 
> Being idle is not a good reason for not being overutilized (ie ~80% of
> average utilisation)
> 
> >
> > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 51f6f55abb37..37f737c5f0b8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >
> >                 nr_running = rq->nr_running;
> >                 sgs->sum_nr_running += nr_running;
> > -
> > -               if (nr_running > 1)
> > -                       *sg_status |= SG_OVERLOAD;
> > -
> > -               if (cpu_overutilized(i))
> > -                       *sg_status |= SG_OVERUTILIZED;
> > -
> >  #ifdef CONFIG_NUMA_BALANCING
> >                 sgs->nr_numa_running += rq->nr_numa_running;
> >                 sgs->nr_preferred_running += rq->nr_preferred_running;
> >  #endif
> > +               if (nr_running > 1)
> > +                       *sg_status |= SG_OVERLOAD;
> 
> Why do you move this code related to overload ?

This was a cosmetic change to put the NUMA related stats next to the other ones.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2022-01-10 16:40     ` Vincent Donnefort
@ 2022-01-17 10:45       ` Vincent Guittot
  2022-01-17 12:18         ` Vincent Donnefort
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2022-01-17 10:45 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

On Mon, 10 Jan 2022 at 17:40, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> On Wed, Dec 22, 2021 at 09:20:17AM +0100, Vincent Guittot wrote:
> > On Mon, 20 Dec 2021 at 12:43, Vincent Donnefort
> > <vincent.donnefort@arm.com> wrote:
> > >
> > > During a migration, the lock for the previous runqueue is not taken and
> > > hence, the task contribution isn't directly removed from that runqueue
> > > utilization but instead temporarily saved, until the next PELT signals
> > > update where it would be accounted. There is then a window in which a
> > > CPU can ben idle be nonetheless overutilized.
> > >
> > > The load balancer wouldn't be able to do anything to help a sleeping CPU,
> > > it brings then no gain to raise overutilized there, only the risk of
> > > spuriously doing it.
> >
> > But how do you make the difference between a very short idle time of
> > an overutilized CPU and a idle cpu with outdated utilization
>
> No distinction here, but if the CPU is idle there's nothing to pull, so the load
> balance wouldn't do anything with this information.

The load balance has never done anything with this information. This
information is only used to disable LB for EAS and as mentioned below,
being idle is not a good reason for not being overutilized.

Also this patch seems to be there just to fix a problem created by the
previous one which triggers the costly new idle load balance without
good reason

>
> >
> > Being idle is not a good reason for not being overutilized (ie ~80% of
> > average utilisation)
> >
> > >
> > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 51f6f55abb37..37f737c5f0b8 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > >
> > >                 nr_running = rq->nr_running;
> > >                 sgs->sum_nr_running += nr_running;
> > > -
> > > -               if (nr_running > 1)
> > > -                       *sg_status |= SG_OVERLOAD;
> > > -
> > > -               if (cpu_overutilized(i))
> > > -                       *sg_status |= SG_OVERUTILIZED;
> > > -
> > >  #ifdef CONFIG_NUMA_BALANCING
> > >                 sgs->nr_numa_running += rq->nr_numa_running;
> > >                 sgs->nr_preferred_running += rq->nr_preferred_running;
> > >  #endif
> > > +               if (nr_running > 1)
> > > +                       *sg_status |= SG_OVERLOAD;
> >
> > Why do you move this code related to overload ?
>
> This was a cosmetic change to put the NUMA related stats next to the other ones.

Please don't add unrelated cosmetic changes in a patch

>
> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs
  2022-01-17 10:45       ` Vincent Guittot
@ 2022-01-17 12:18         ` Vincent Donnefort
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Donnefort @ 2022-01-17 12:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, dietmar.eggemann,
	Valentin.Schneider, Morten.Rasmussen, qperret

On Mon, Jan 17, 2022 at 11:45:33AM +0100, Vincent Guittot wrote:
> On Mon, 10 Jan 2022 at 17:40, Vincent Donnefort
> <vincent.donnefort@arm.com> wrote:
> >
> > On Wed, Dec 22, 2021 at 09:20:17AM +0100, Vincent Guittot wrote:
> > > On Mon, 20 Dec 2021 at 12:43, Vincent Donnefort
> > > <vincent.donnefort@arm.com> wrote:
> > > >
> > > > During a migration, the lock for the previous runqueue is not taken and
> > > > hence, the task contribution isn't directly removed from that runqueue
> > > > utilization but instead temporarily saved, until the next PELT signals
> > > > update where it would be accounted. There is then a window in which a
> > > > CPU can ben idle be nonetheless overutilized.
> > > >
> > > > The load balancer wouldn't be able to do anything to help a sleeping CPU,
> > > > it brings then no gain to raise overutilized there, only the risk of
> > > > spuriously doing it.
> > >
> > > But how do you make the difference between a very short idle time of
> > > an overutilized CPU and a idle cpu with outdated utilization
> >
> > No distinction here, but if the CPU is idle there's nothing to pull, so the load
> > balance wouldn't do anything with this information.
> 
> The load balance has never done anything with this information. This
> information is only used to disable LB for EAS and as mentioned below,
> being idle is not a good reason for not being overutilized.

But what would then be the point of running the load balancer and waste time
there?

We could alternatively keep OU (for the sack of signal continuity) and bail-out
earlier if we know nothing can be done (i.e. OU but idle)?

But still that doesn't solve that EAS can stay disabled for a moment (until the
util_avg is properly decayed) and we would waste energy for that duration (which
might not represent a lot of energy, I agree).

> 
> Also this patch seems to be there just to fix a problem created by the
> previous one which triggers the costly new idle load balance without
> good reason

Not related. Even without (sched/fair: Fix newidle_balance() for overutilized systems) 
the load balancer would run while it is not necessary. 

Anyway, it was an attempt to maximize the time where EAS is enabled to save
energy. If you think it is too risky to bring potential OU discontinuities,
I'll drop that idea.

> 
> >
> > >
> > > Being idle is not a good reason for not being overutilized (ie ~80% of
> > > average utilisation)
> > >
> > > >
> > > > Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 51f6f55abb37..37f737c5f0b8 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8641,26 +8641,28 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > > >
> > > >                 nr_running = rq->nr_running;
> > > >                 sgs->sum_nr_running += nr_running;
> > > > -
> > > > -               if (nr_running > 1)
> > > > -                       *sg_status |= SG_OVERLOAD;
> > > > -
> > > > -               if (cpu_overutilized(i))
> > > > -                       *sg_status |= SG_OVERUTILIZED;
> > > > -
> > > >  #ifdef CONFIG_NUMA_BALANCING
> > > >                 sgs->nr_numa_running += rq->nr_numa_running;
> > > >                 sgs->nr_preferred_running += rq->nr_preferred_running;
> > > >  #endif
> > > > +               if (nr_running > 1)
> > > > +                       *sg_status |= SG_OVERLOAD;
> > >
> > > Why do you move this code related to overload ?
> >
> > This was a cosmetic change to put the NUMA related stats next to the other ones.
> 
> Please don't add unrelated cosmetic changes in a patch

My bad, I understood the policy was to make cosmetic changes only alongside "useful"
content.

> 
> >
> > [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-17 12:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-20 11:43 [PATCH 0/3] Fix stuck overutilized Vincent Donnefort
2021-12-20 11:43 ` [PATCH 1/3] sched/fair: Make cpu_overutilized() EAS dependent Vincent Donnefort
2021-12-20 17:17   ` Valentin Schneider
2021-12-21  9:09     ` Vincent Donnefort
2021-12-20 11:43 ` [PATCH 2/3] sched/fair: Fix newidle_balance() for overutilized systems Vincent Donnefort
2021-12-20 17:17   ` Valentin Schneider
2021-12-22  8:14   ` Vincent Guittot
2022-01-10 16:29     ` Vincent Donnefort
2021-12-20 11:43 ` [PATCH 3/3] sched/fair: Do not raise overutilized for idle CPUs Vincent Donnefort
2021-12-20 17:17   ` Valentin Schneider
2021-12-22  8:20   ` Vincent Guittot
2022-01-10 16:40     ` Vincent Donnefort
2022-01-17 10:45       ` Vincent Guittot
2022-01-17 12:18         ` Vincent Donnefort

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).