Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>,
	mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, linux-kernel@vger.kernel.org,
	rickyiu@google.com, odin@uged.al
Cc: sachinp@linux.vnet.ibm.com, naresh.kamboju@linaro.org
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
Date: Tue, 4 Jan 2022 12:47:10 +0100	[thread overview]
Message-ID: <9e526482-905c-e759-8aa6-1ff84bb5b2a3@arm.com> (raw)
In-Reply-To: <20211222093802.22357-2-vincent.guittot@linaro.org>

On 22/12/2021 10:38, Vincent Guittot wrote:

s/util_sum with uti_avg/util_sum with util_avg

[...]

> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)

Shouldn't this be in pelt.h?

[...]

> @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
>  	 */
>  	divider = get_pelt_divider(&cfs_rq->avg);
>  
> +
>  	/* Set new sched_entity's utilization */
>  	se->avg.util_avg = gcfs_rq->avg.util_avg;
> -	se->avg.util_sum = se->avg.util_avg * divider;
> +	new_sum = se->avg.util_avg * divider;
> +	delta_sum = (long)new_sum - (long)se->avg.util_sum;
> +	se->avg.util_sum = new_sum;
>  
>  	/* Update parent cfs_rq utilization */
> -	add_positive(&cfs_rq->avg.util_avg, delta);
> -	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> +	add_positive(&cfs_rq->avg.util_avg, delta_avg);
> +	add_positive(&cfs_rq->avg.util_sum, delta_sum);
> +
> +	/*
> +	 * Because of rounding, se->util_sum might ends up being +1 more than
> +	 * cfs->util_sum (XXX fix the rounding). Although this is not
> +	 * a problem by itself, detaching a lot of tasks with the rounding
> +	 * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> +	 * becoming null whereas cfs_util_avg is not.
> +	 * Check that util_sum is still above its lower bound for the new
> +	 * util_avg. Given that period_contrib might have moved since the last
> +	 * sync, we are only sure that util_sum must be above or equal to
> +	 *    util_avg * minimum possible divider
> +	 */
> +	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> +					  cfs_rq->avg.util_avg * MIN_DIVIDER);
>  }
>

I still wonder whether the regression only comes from the changes in
update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
And could be fixed only by this part of the patch-set (A):

@@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
*cfs_rq)

    r = removed_load;
    sub_positive(&sa->load_avg, r);
-   sa->load_sum = sa->load_avg * divider;
+   sub_positive(&sa->load_sum, r * divider);
+   sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);

    r = removed_util;
    sub_positive(&sa->util_avg, r);
-   sa->util_sum = sa->util_avg * divider;
+   sub_positive(&sa->util_sum, r * divider);
+   sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

    r = removed_runnable;
    sub_positive(&sa->runnable_avg, r);
-   sa->runnable_sum = sa->runnable_avg * divider;
+   sub_positive(&sa->runnable_sum, r * divider);
+   sa->runnable_sum = max_t(u32, sa->runnable_sum,
+                                 sa->runnable_avg * MIN_DIVIDER);

i.e. w/o changing update_tg_cfs_X() (and
detach_entity_load_avg()/dequeue_load_avg()).

update_load_avg()
  update_cfs_rq_load_avg()    <---
  propagate_entity_load_avg()
    update_tg_cfs_X()         <---


I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
hackbench in several different sched group levels on
[Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].

Rick is probably in a position to test whether this would be sufficient
to cure the CPU frequency regression.

I can see that you want to use the same _avg/_sum sync in
detach_entity_load_avg()/dequeue_load_avg() as in
update_cfs_rq_load_avg(). (B)

And finally in update_tg_cfs_X() as well plus down-propagating _sum
independently from _avg. (C)

So rather splitting the patchset into X (util, runnable, load) the whole
change might be easier to handle IMHO when split into (A), (B), (C)
(obviously only in case (A) cures the regression).

>  static inline void
> @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  
>  		r = removed_util;
>  		sub_positive(&sa->util_avg, r);
> -		sa->util_sum = sa->util_avg * divider;
> +		sub_positive(&sa->util_sum, r * divider);
> +		/* See update_tg_cfs_util() */
> +		sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>  
>  		r = removed_runnable;
>  		sub_positive(&sa->runnable_avg, r);
> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  
>  	dequeue_load_avg(cfs_rq, se);
>  	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> -	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> +	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> +	/* See update_tg_cfs_util() */
> +	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> +					  cfs_rq->avg.util_avg * MIN_DIVIDER);
> +

Maybe add a:

Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
with *_avg")

[...]

This max_t() should make sure that `_sum is always >= _avg *
MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in

(1) update_cfs_rq_load_avg()
(2) detach_entity_load_avg() and dequeue_load_avg()
(3) update_tg_cfs_X()

but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
reason for this?