* [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot
@ 2021-12-22 9:38 ` Vincent Guittot
2022-01-04 11:47 ` Dietmar Eggemann
2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju, Vincent Guittot
Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045
He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.
Instead of always setting util_sum to the low bound of util_avg, which can
significantly lower the utilization of root cfs_rq after propagating the
change down into the hierarchy, we revert the commit and propagate the
difference in util_sum.
In addition, we also check that cfs's util_sum always stays above the
lower bound for a given util_avg as it has been observed that
sched_entity's util_sum is sometimes above cfs one.
Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..9ac28f0f3645 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}
+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
@@ -3449,15 +3450,14 @@ void set_task_rq_fair(struct sched_entity *se,
* XXX: only do this for the part of runnable > running ?
*
*/
-
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
+ u32 new_sum, divider;
/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;
/*
@@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
*/
divider = get_pelt_divider(&cfs_rq->avg);
+
/* Set new sched_entity's utilization */
se->avg.util_avg = gcfs_rq->avg.util_avg;
- se->avg.util_sum = se->avg.util_avg * divider;
+ new_sum = se->avg.util_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.util_sum;
+ se->avg.util_sum = new_sum;
/* Update parent cfs_rq utilization */
- add_positive(&cfs_rq->avg.util_avg, delta);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ add_positive(&cfs_rq->avg.util_avg, delta_avg);
+ add_positive(&cfs_rq->avg.util_sum, delta_sum);
+
+ /*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum (XXX fix the rounding). Although this is not
+ * a problem by itself, detaching a lot of tasks with the rounding
+ * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
+ * becoming null whereas cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
}
static inline void
@@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
@@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
+
sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
--
2.17.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot
@ 2022-01-04 11:47 ` Dietmar Eggemann
2022-01-04 13:42 ` Vincent Guittot
2022-01-04 13:48 ` Vincent Guittot
0 siblings, 2 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju
On 22/12/2021 10:38, Vincent Guittot wrote:
s/util_sum with uti_avg/util_sum with util_avg
[...]
> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
Shouldn't this be in pelt.h?
[...]
> @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> */
> divider = get_pelt_divider(&cfs_rq->avg);
>
> +
> /* Set new sched_entity's utilization */
> se->avg.util_avg = gcfs_rq->avg.util_avg;
> - se->avg.util_sum = se->avg.util_avg * divider;
> + new_sum = se->avg.util_avg * divider;
> + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> + se->avg.util_sum = new_sum;
>
> /* Update parent cfs_rq utilization */
> - add_positive(&cfs_rq->avg.util_avg, delta);
> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> +
> + /*
> + * Because of rounding, se->util_sum might ends up being +1 more than
> + * cfs->util_sum (XXX fix the rounding). Although this is not
> + * a problem by itself, detaching a lot of tasks with the rounding
> + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> + * becoming null whereas cfs_util_avg is not.
> + * Check that util_sum is still above its lower bound for the new
> + * util_avg. Given that period_contrib might have moved since the last
> + * sync, we are only sure that util_sum must be above or equal to
> + * util_avg * minimum possible divider
> + */
> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> }
>
I still wonder whether the regression only comes from the changes in
update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
And could be fixed only by this part of the patch-set (A):
@@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
*cfs_rq)
r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * MIN_DIVIDER);
i.e. w/o changing update_tg_cfs_X() (and
detach_entity_load_avg()/dequeue_load_avg()).
update_load_avg()
update_cfs_rq_load_avg() <---
propagate_entity_load_avg()
update_tg_cfs_X() <---
I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
hackbench in several different sched group levels on
[Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
Rick is probably in a position to test whether this would be sufficient
to cure the CPU frequency regression.
I can see that you want to use the same _avg/_sum sync in
detach_entity_load_avg()/dequeue_load_avg() as in
update_cfs_rq_load_avg(). (B)
And finally in update_tg_cfs_X() as well plus down-propagating _sum
independently from _avg. (C)
So rather splitting the patchset into X (util, runnable, load) the whole
change might be easier to handle IMHO when split into (A), (B), (C)
(obviously only in case (A) cures the regression).
> static inline void
> @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> dequeue_load_avg(cfs_rq, se);
> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> +
Maybe add a:
Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
with *_avg")
[...]
This max_t() should make sure that `_sum is always >= _avg *
MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
(1) update_cfs_rq_load_avg()
(2) detach_entity_load_avg() and dequeue_load_avg()
(3) update_tg_cfs_X()
but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
reason for this?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-04 11:47 ` Dietmar Eggemann
@ 2022-01-04 13:42 ` Vincent Guittot
2022-01-05 13:15 ` Dietmar Eggemann
2022-01-04 13:48 ` Vincent Guittot
1 sibling, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2022-01-04 13:42 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> s/util_sum with uti_avg/util_sum with util_avg
yes
>
> [...]
>
> > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
>
> Shouldn't this be in pelt.h?
>
> [...]
>
> > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> > */
> > divider = get_pelt_divider(&cfs_rq->avg);
> >
> > +
> > /* Set new sched_entity's utilization */
> > se->avg.util_avg = gcfs_rq->avg.util_avg;
> > - se->avg.util_sum = se->avg.util_avg * divider;
> > + new_sum = se->avg.util_avg * divider;
> > + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> > + se->avg.util_sum = new_sum;
> >
> > /* Update parent cfs_rq utilization */
> > - add_positive(&cfs_rq->avg.util_avg, delta);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> > + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> > +
> > + /*
> > + * Because of rounding, se->util_sum might ends up being +1 more than
> > + * cfs->util_sum (XXX fix the rounding). Although this is not
> > + * a problem by itself, detaching a lot of tasks with the rounding
> > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> > + * becoming null whereas cfs_util_avg is not.
> > + * Check that util_sum is still above its lower bound for the new
> > + * util_avg. Given that period_contrib might have moved since the last
> > + * sync, we are only sure that util_sum must be above or equal to
> > + * util_avg * minimum possible divider
> > + */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > }
> >
>
> I still wonder whether the regression only comes from the changes in
> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> And could be fixed only by this part of the patch-set (A):
I have been able to trigger the warning even with (A) though It took
much more time.
And I have been able to catch wrong situations (with additional
traces) in the 3 places A, B and C
>
> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);
>
> i.e. w/o changing update_tg_cfs_X() (and
> detach_entity_load_avg()/dequeue_load_avg()).
>
> update_load_avg()
> update_cfs_rq_load_avg() <---
> propagate_entity_load_avg()
> update_tg_cfs_X() <---
>
>
> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> hackbench in several different sched group levels on
> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
IIRC, it was with hikey960 with cgroup v1
As a side note, I never trigger the problem with dragonboard845 and cgroup v2
>
> Rick is probably in a position to test whether this would be sufficient
> to cure the CPU frequency regression.
>
> I can see that you want to use the same _avg/_sum sync in
> detach_entity_load_avg()/dequeue_load_avg() as in
> update_cfs_rq_load_avg(). (B)
>
> And finally in update_tg_cfs_X() as well plus down-propagating _sum
> independently from _avg. (C)
>
> So rather splitting the patchset into X (util, runnable, load) the whole
> change might be easier to handle IMHO when split into (A), (B), (C)
> (obviously only in case (A) cures the regression).
>
> > static inline void
> > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > r = removed_util;
> > sub_positive(&sa->util_avg, r);
> > - sa->util_sum = sa->util_avg * divider;
> > + sub_positive(&sa->util_sum, r * divider);
> > + /* See update_tg_cfs_util() */
> > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >
> > r = removed_runnable;
> > sub_positive(&sa->runnable_avg, r);
> > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > dequeue_load_avg(cfs_rq, se);
> > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > + /* See update_tg_cfs_util() */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > +
>
> Maybe add a:
>
> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> with *_avg")
I spent time thinking about adding fixes tag. There is no crash/warn
so far so should we propagate it back in LTS for better performance ?
>
> [...]
>
> This max_t() should make sure that `_sum is always >= _avg *
> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>
> (1) update_cfs_rq_load_avg()
> (2) detach_entity_load_avg() and dequeue_load_avg()
> (3) update_tg_cfs_X()
>
> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> reason for this?
Main reason is that I have never seen the problem.
Then, the problem comes from subtracting task's value whereas here we
always add positive value
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-04 13:42 ` Vincent Guittot
@ 2022-01-05 13:15 ` Dietmar Eggemann
2022-01-05 13:57 ` Vincent Guittot
0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-05 13:15 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On 04/01/2022 14:42, Vincent Guittot wrote:
> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 22/12/2021 10:38, Vincent Guittot wrote:
[...]
>> I still wonder whether the regression only comes from the changes in
>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
>> And could be fixed only by this part of the patch-set (A):
>
> I have been able to trigger the warning even with (A) though It took
> much more time.
> And I have been able to catch wrong situations (with additional
> traces) in the 3 places A, B and C
OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
>> *cfs_rq)
>>
>> r = removed_load;
>> sub_positive(&sa->load_avg, r);
>> - sa->load_sum = sa->load_avg * divider;
>> + sub_positive(&sa->load_sum, r * divider);
>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>>
>> r = removed_util;
>> sub_positive(&sa->util_avg, r);
>> - sa->util_sum = sa->util_avg * divider;
>> + sub_positive(&sa->util_sum, r * divider);
>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>>
>> r = removed_runnable;
>> sub_positive(&sa->runnable_avg, r);
>> - sa->runnable_sum = sa->runnable_avg * divider;
>> + sub_positive(&sa->runnable_sum, r * divider);
>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
>> + sa->runnable_avg * MIN_DIVIDER);
>>
>> i.e. w/o changing update_tg_cfs_X() (and
>> detach_entity_load_avg()/dequeue_load_avg()).
>>
>> update_load_avg()
>> update_cfs_rq_load_avg() <---
>> propagate_entity_load_avg()
>> update_tg_cfs_X() <---
>>
>>
>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
>> hackbench in several different sched group levels on
>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>
> IIRC, it was with hikey960 with cgroup v1
> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
[...]
>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>
>>> dequeue_load_avg(cfs_rq, se);
>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>>> + /* See update_tg_cfs_util() */
>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
>>> +
>>
>> Maybe add a:
>>
>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
>> with *_avg")
>
> I spent time thinking about adding fixes tag. There is no crash/warn
> so far so should we propagate it back in LTS for better performance ?
Not sure I understand. What do you mean by 'should we propagate it back
in LTS'?
[...]
>> This max_t() should make sure that `_sum is always >= _avg *
>> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>>
>> (1) update_cfs_rq_load_avg()
>> (2) detach_entity_load_avg() and dequeue_load_avg()
>> (3) update_tg_cfs_X()
>>
>> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
>> reason for this?
>
> Main reason is that I have never seen the problem.
> Then, the problem comes from subtracting task's value whereas here we
> always add positive value
OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-05 13:15 ` Dietmar Eggemann
@ 2022-01-05 13:57 ` Vincent Guittot
2022-01-07 11:43 ` Dietmar Eggemann
0 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2022-01-05 13:57 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 04/01/2022 14:42, Vincent Guittot wrote:
> > On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> [...]
>
> >> I still wonder whether the regression only comes from the changes in
> >> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> >> And could be fixed only by this part of the patch-set (A):
> >
> > I have been able to trigger the warning even with (A) though It took
> > much more time.
> > And I have been able to catch wrong situations (with additional
> > traces) in the 3 places A, B and C
>
> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
not only.
also util_sum == 0 but util_avg !=0 in different places although these
situation didn't trigger sched_warn because some other sync happened
before the periodic call of __update_blocked_fair
or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
just before removing the task
>
> >> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> >> *cfs_rq)
> >>
> >> r = removed_load;
> >> sub_positive(&sa->load_avg, r);
> >> - sa->load_sum = sa->load_avg * divider;
> >> + sub_positive(&sa->load_sum, r * divider);
> >> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> >>
> >> r = removed_util;
> >> sub_positive(&sa->util_avg, r);
> >> - sa->util_sum = sa->util_avg * divider;
> >> + sub_positive(&sa->util_sum, r * divider);
> >> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >>
> >> r = removed_runnable;
> >> sub_positive(&sa->runnable_avg, r);
> >> - sa->runnable_sum = sa->runnable_avg * divider;
> >> + sub_positive(&sa->runnable_sum, r * divider);
> >> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> >> + sa->runnable_avg * MIN_DIVIDER);
> >>
> >> i.e. w/o changing update_tg_cfs_X() (and
> >> detach_entity_load_avg()/dequeue_load_avg()).
> >>
> >> update_load_avg()
> >> update_cfs_rq_load_avg() <---
> >> propagate_entity_load_avg()
> >> update_tg_cfs_X() <---
> >>
> >>
> >> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> >> hackbench in several different sched group levels on
> >> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> >
> > IIRC, it was with hikey960 with cgroup v1
> > As a side note, I never trigger the problem with dragonboard845 and cgroup v2
>
> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
>
> [...]
>
> >>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>>
> >>> dequeue_load_avg(cfs_rq, se);
> >>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> >>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> >>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> >>> + /* See update_tg_cfs_util() */
> >>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> >>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> >>> +
> >>
> >> Maybe add a:
> >>
> >> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> >> with *_avg")
> >
> > I spent time thinking about adding fixes tag. There is no crash/warn
> > so far so should we propagate it back in LTS for better performance ?
>
> Not sure I understand. What do you mean by 'should we propagate it back
> in LTS'?
Sorry I had any stables in mind and not only LTS.
Some of the changes done in PELT signal propagation that replace
subtracting util_sum by using util_avg * divider instead, are related
to other problems with sched group hierarchy and
throttling/unthrottling. I'm not 100% confident that using fixes tag
to backport this on stables doesn't need to backport more patches on
other areas in order to not resurrect old problems. So I wonder if
it's worth backporting this on stables
>
> [...]
>
> >> This max_t() should make sure that `_sum is always >= _avg *
> >> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
> >>
> >> (1) update_cfs_rq_load_avg()
> >> (2) detach_entity_load_avg() and dequeue_load_avg()
> >> (3) update_tg_cfs_X()
> >>
> >> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> >> reason for this?
> >
> > Main reason is that I have never seen the problem.
> > Then, the problem comes from subtracting task's value whereas here we
> > always add positive value
>
> OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-05 13:57 ` Vincent Guittot
@ 2022-01-07 11:43 ` Dietmar Eggemann
2022-01-07 15:21 ` Vincent Guittot
0 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-07 11:43 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On 05/01/2022 14:57, Vincent Guittot wrote:
> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 04/01/2022 14:42, Vincent Guittot wrote:
>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 22/12/2021 10:38, Vincent Guittot wrote:
>>
>> [...]
>>
>>>> I still wonder whether the regression only comes from the changes in
>>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
>>>> And could be fixed only by this part of the patch-set (A):
>>>
>>> I have been able to trigger the warning even with (A) though It took
>>> much more time.
>>> And I have been able to catch wrong situations (with additional
>>> traces) in the 3 places A, B and C
>>
>> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
>
> not only.
> also util_sum == 0 but util_avg !=0 in different places although these
Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.
> situation didn't trigger sched_warn because some other sync happened
> before the periodic call of __update_blocked_fair
> or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> just before removing the task
I see.
>>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
>>>> *cfs_rq)
>>>>
>>>> r = removed_load;
>>>> sub_positive(&sa->load_avg, r);
>>>> - sa->load_sum = sa->load_avg * divider;
>>>> + sub_positive(&sa->load_sum, r * divider);
>>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>>>>
>>>> r = removed_util;
>>>> sub_positive(&sa->util_avg, r);
>>>> - sa->util_sum = sa->util_avg * divider;
>>>> + sub_positive(&sa->util_sum, r * divider);
>>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>>>>
>>>> r = removed_runnable;
>>>> sub_positive(&sa->runnable_avg, r);
>>>> - sa->runnable_sum = sa->runnable_avg * divider;
>>>> + sub_positive(&sa->runnable_sum, r * divider);
>>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
>>>> + sa->runnable_avg * MIN_DIVIDER);
>>>>
>>>> i.e. w/o changing update_tg_cfs_X() (and
>>>> detach_entity_load_avg()/dequeue_load_avg()).
>>>>
>>>> update_load_avg()
>>>> update_cfs_rq_load_avg() <---
>>>> propagate_entity_load_avg()
>>>> update_tg_cfs_X() <---
>>>>
>>>>
>>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
>>>> hackbench in several different sched group levels on
>>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>>>
>>> IIRC, it was with hikey960 with cgroup v1
>>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
>>
>> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
E/F G/H > 45h uptime
>>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>>>
>>>>> dequeue_load_avg(cfs_rq, se);
>>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
>>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>>>>> + /* See update_tg_cfs_util() */
>>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
>>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
>>>>> +
>>>>
>>>> Maybe add a:
>>>>
>>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
>>>> with *_avg")
>>>
>>> I spent time thinking about adding fixes tag. There is no crash/warn
>>> so far so should we propagate it back in LTS for better performance ?
>>
>> Not sure I understand. What do you mean by 'should we propagate it back
>> in LTS'?
>
> Sorry I had any stables in mind and not only LTS.
>
> Some of the changes done in PELT signal propagation that replace
> subtracting util_sum by using util_avg * divider instead, are related
> to other problems with sched group hierarchy and
> throttling/unthrottling. I'm not 100% confident that using fixes tag
> to backport this on stables doesn't need to backport more patches on
> other areas in order to not resurrect old problems. So I wonder if
> it's worth backporting this on stables
OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
was the reason why I initially suggested to split the patch-set
differently. But you said that you saw the issue also when (1) is fixed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-07 11:43 ` Dietmar Eggemann
@ 2022-01-07 15:21 ` Vincent Guittot
2022-01-11 7:54 ` Vincent Guittot
0 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2022-01-07 15:21 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 05/01/2022 14:57, Vincent Guittot wrote:
> > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 04/01/2022 14:42, Vincent Guittot wrote:
> >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 22/12/2021 10:38, Vincent Guittot wrote:
> >>
> >> [...]
> >>
> >>>> I still wonder whether the regression only comes from the changes in
> >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> >>>> And could be fixed only by this part of the patch-set (A):
> >>>
> >>> I have been able to trigger the warning even with (A) though It took
> >>> much more time.
> >>> And I have been able to catch wrong situations (with additional
> >>> traces) in the 3 places A, B and C
> >>
> >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
> >
> > not only.
> > also util_sum == 0 but util_avg !=0 in different places although these
>
> Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.
>
> > situation didn't trigger sched_warn because some other sync happened
> > before the periodic call of __update_blocked_fair
> > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> > just before removing the task
>
> I see.
>
> >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> >>>> *cfs_rq)
> >>>>
> >>>> r = removed_load;
> >>>> sub_positive(&sa->load_avg, r);
> >>>> - sa->load_sum = sa->load_avg * divider;
> >>>> + sub_positive(&sa->load_sum, r * divider);
> >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> >>>>
> >>>> r = removed_util;
> >>>> sub_positive(&sa->util_avg, r);
> >>>> - sa->util_sum = sa->util_avg * divider;
> >>>> + sub_positive(&sa->util_sum, r * divider);
> >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >>>>
> >>>> r = removed_runnable;
> >>>> sub_positive(&sa->runnable_avg, r);
> >>>> - sa->runnable_sum = sa->runnable_avg * divider;
> >>>> + sub_positive(&sa->runnable_sum, r * divider);
> >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> >>>> + sa->runnable_avg * MIN_DIVIDER);
> >>>>
> >>>> i.e. w/o changing update_tg_cfs_X() (and
> >>>> detach_entity_load_avg()/dequeue_load_avg()).
> >>>>
> >>>> update_load_avg()
> >>>> update_cfs_rq_load_avg() <---
> >>>> propagate_entity_load_avg()
> >>>> update_tg_cfs_X() <---
> >>>>
> >>>>
> >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> >>>> hackbench in several different sched group levels on
> >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> >>>
> >>> IIRC, it was with hikey960 with cgroup v1
> >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
> >>
> >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
>
> Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
> E/F G/H > 45h uptime
>
> >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>>>>
> >>>>> dequeue_load_avg(cfs_rq, se);
> >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> >>>>> + /* See update_tg_cfs_util() */
> >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> >>>>> +
> >>>>
> >>>> Maybe add a:
> >>>>
> >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> >>>> with *_avg")
> >>>
> >>> I spent time thinking about adding fixes tag. There is no crash/warn
> >>> so far so should we propagate it back in LTS for better performance ?
> >>
> >> Not sure I understand. What do you mean by 'should we propagate it back
> >> in LTS'?
> >
> > Sorry I had any stables in mind and not only LTS.
> >
> > Some of the changes done in PELT signal propagation that replace
> > subtracting util_sum by using util_avg * divider instead, are related
> > to other problems with sched group hierarchy and
> > throttling/unthrottling. I'm not 100% confident that using fixes tag
> > to backport this on stables doesn't need to backport more patches on
> > other areas in order to not resurrect old problems. So I wonder if
> > it's worth backporting this on stables
>
> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
> was the reason why I initially suggested to split the patch-set
> differently. But you said that you saw the issue also when (1) is fixed.
Ok, I think that we were not speaking about the same setup. I wrongly
read that you were saying that
sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
was only needed in update_cfs_rq_load_avg() but not in the other places.
But what you said is that we only need the below to fix the perf
regression raised by rick ?
r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
The WARN that I mentioned in my previous email was about not adding
the max_t in all 3 places. I rerun some test today and I triggered the
WARN after a detach without the max_t line.
I can probably isolate the code above in a dedicated patch for the
regression raised by Rick and we could consider adding a fixes tag; I
will run more tests with only this part during the weekend.
That being said, we need to stay consistent in all 3 places where we
move or propagate some *_avg. In particular, using "sa->util_sum =
sa->util_avg * divider" has the drawback of filtering the contribution
not already accounted for in util_avg and the impact is much larger
than expected. It means that although fixing only
update_cfs_rq_load_avg() seems enough for rick's case, some other
cases could be impacted by the 2 other places and we need to fixed
them now that we have a better view of the root cause
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-07 15:21 ` Vincent Guittot
@ 2022-01-11 7:54 ` Vincent Guittot
2022-01-11 12:37 ` Dietmar Eggemann
0 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2022-01-11 7:54 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 05/01/2022 14:57, Vincent Guittot wrote:
> > > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 04/01/2022 14:42, Vincent Guittot wrote:
> > >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>>>
> > >>>> On 22/12/2021 10:38, Vincent Guittot wrote:
> > >>
> > >> [...]
> > >>
> > >>>> I still wonder whether the regression only comes from the changes in
> > >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> > >>>> And could be fixed only by this part of the patch-set (A):
> > >>>
> > >>> I have been able to trigger the warning even with (A) though It took
> > >>> much more time.
> > >>> And I have been able to catch wrong situations (with additional
> > >>> traces) in the 3 places A, B and C
> > >>
> > >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
> > >
> > > not only.
> > > also util_sum == 0 but util_avg !=0 in different places although these
> >
> > Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.
> >
> > > situation didn't trigger sched_warn because some other sync happened
> > > before the periodic call of __update_blocked_fair
> > > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> > > just before removing the task
> >
> > I see.
> >
> > >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> > >>>> *cfs_rq)
> > >>>>
> > >>>> r = removed_load;
> > >>>> sub_positive(&sa->load_avg, r);
> > >>>> - sa->load_sum = sa->load_avg * divider;
> > >>>> + sub_positive(&sa->load_sum, r * divider);
> > >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> > >>>>
> > >>>> r = removed_util;
> > >>>> sub_positive(&sa->util_avg, r);
> > >>>> - sa->util_sum = sa->util_avg * divider;
> > >>>> + sub_positive(&sa->util_sum, r * divider);
> > >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> > >>>>
> > >>>> r = removed_runnable;
> > >>>> sub_positive(&sa->runnable_avg, r);
> > >>>> - sa->runnable_sum = sa->runnable_avg * divider;
> > >>>> + sub_positive(&sa->runnable_sum, r * divider);
> > >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> > >>>> + sa->runnable_avg * MIN_DIVIDER);
> > >>>>
> > >>>> i.e. w/o changing update_tg_cfs_X() (and
> > >>>> detach_entity_load_avg()/dequeue_load_avg()).
> > >>>>
> > >>>> update_load_avg()
> > >>>> update_cfs_rq_load_avg() <---
> > >>>> propagate_entity_load_avg()
> > >>>> update_tg_cfs_X() <---
> > >>>>
> > >>>>
> > >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> > >>>> hackbench in several different sched group levels on
> > >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> > >>>
> > >>> IIRC, it was with hikey960 with cgroup v1
> > >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
> > >>
> > >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
> >
> > Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
> > E/F G/H > 45h uptime
> >
> > >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >>>>>
> > >>>>> dequeue_load_avg(cfs_rq, se);
> > >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > >>>>> + /* See update_tg_cfs_util() */
> > >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > >>>>> +
> > >>>>
> > >>>> Maybe add a:
> > >>>>
> > >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> > >>>> with *_avg")
> > >>>
> > >>> I spent time thinking about adding fixes tag. There is no crash/warn
> > >>> so far so should we propagate it back in LTS for better performance ?
> > >>
> > >> Not sure I understand. What do you mean by 'should we propagate it back
> > >> in LTS'?
> > >
> > > Sorry I had any stables in mind and not only LTS.
> > >
> > > Some of the changes done in PELT signal propagation that replace
> > > subtracting util_sum by using util_avg * divider instead, are related
> > > to other problems with sched group hierarchy and
> > > throttling/unthrottling. I'm not 100% confident that using fixes tag
> > > to backport this on stables doesn't need to backport more patches on
> > > other areas in order to not resurrect old problems. So I wonder if
> > > it's worth backporting this on stables
> >
> > OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
> > update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
> > was the reason why I initially suggested to split the patch-set
> > differently. But you said that you saw the issue also when (1) is fixed.
>
> Ok, I think that we were not speaking about the same setup. I wrongly
> read that you were saying that
> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> was only needed in update_cfs_rq_load_avg() but not in the other places.
>
> But what you said is that we only need the below to fix the perf
> regression raised by rick ?
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
The test with the code above doesn't trigger any SCHEd_WARN over the
weekend so it's probably ok to make a dedicated patch for this with
tag.
I'm going to prepare a v2
>
> The WARN that I mentioned in my previous email was about not adding
> the max_t in all 3 places. I rerun some test today and I triggered the
> WARN after a detach without the max_t line.
>
> I can probably isolate the code above in a dedicated patch for the
> regression raised by Rick and we could consider adding a fixes tag; I
> will run more tests with only this part during the weekend.
> That being said, we need to stay consistent in all 3 places where we
> move or propagate some *_avg. In particular, using "sa->util_sum =
> sa->util_avg * divider" has the drawback of filtering the contribution
> not already accounted for in util_avg and the impact is much larger
> than expected. It means that although fixing only
> update_cfs_rq_load_avg() seems enough for rick's case, some other
> cases could be impacted by the 2 other places and we need to fixed
> them now that we have a better view of the root cause
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-11 7:54 ` Vincent Guittot
@ 2022-01-11 12:37 ` Dietmar Eggemann
0 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-11 12:37 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On 11/01/2022 08:54, Vincent Guittot wrote:
> On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>
>> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 05/01/2022 14:57, Vincent Guittot wrote:
>>>> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>
>>>>> On 04/01/2022 14:42, Vincent Guittot wrote:
>>>>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>>>
>>>>>>> On 22/12/2021 10:38, Vincent Guittot wrote:
[...]
>>>> Some of the changes done in PELT signal propagation that replace
>>>> subtracting util_sum by using util_avg * divider instead, are related
>>>> to other problems with sched group hierarchy and
>>>> throttling/unthrottling. I'm not 100% confident that using fixes tag
>>>> to backport this on stables doesn't need to backport more patches on
>>>> other areas in order to not resurrect old problems. So I wonder if
>>>> it's worth backporting this on stables
>>>
>>> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
>>> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
>>> was the reason why I initially suggested to split the patch-set
>>> differently. But you said that you saw the issue also when (1) is fixed.
>>
>> Ok, I think that we were not speaking about the same setup. I wrongly
>> read that you were saying that
>> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>> was only needed in update_cfs_rq_load_avg() but not in the other places.
>>
>> But what you said is that we only need the below to fix the perf
>> regression raised by rick ?
>> r = removed_util;
>> sub_positive(&sa->util_avg, r);
>> - sa->util_sum = sa->util_avg * divider;
>> + sub_positive(&sa->util_sum, r * divider);
>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> The test with the code above doesn't trigger any SCHEd_WARN over the
> weekend so it's probably ok to make a dedicated patch for this with
> tag.
> I'm going to prepare a v2
Yes, `sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER)` is
needed for all 3 X = [load, runnable, util] in update_cfs_rq_load_avg()
to not hit the SCHED_WARN_ON() in cfs_rq_is_decayed().
>> The WARN that I mentioned in my previous email was about not adding
>> the max_t in all 3 places. I rerun some test today and I triggered the
>> WARN after a detach without the max_t line.
>>
>> I can probably isolate the code above in a dedicated patch for the
>> regression raised by Rick and we could consider adding a fixes tag; I
>> will run more tests with only this part during the weekend.
>> That being said, we need to stay consistent in all 3 places where we
>> move or propagate some *_avg. In particular, using "sa->util_sum =
>> sa->util_avg * divider" has the drawback of filtering the contribution
>> not already accounted for in util_avg and the impact is much larger
>> than expected. It means that although fixing only
>> update_cfs_rq_load_avg() seems enough for rick's case, some other
>> cases could be impacted by the 2 other places and we need to fixed
>> them now that we have a better view of the root cause
Agreed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg
2022-01-04 11:47 ` Dietmar Eggemann
2022-01-04 13:42 ` Vincent Guittot
@ 2022-01-04 13:48 ` Vincent Guittot
1 sibling, 0 replies; 16+ messages in thread
From: Vincent Guittot @ 2022-01-04 13:48 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
linux-kernel, rickyiu, odin, sachinp, naresh.kamboju
On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> s/util_sum with uti_avg/util_sum with util_avg
>
> [...]
>
> > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
>
> Shouldn't this be in pelt.h?
It is only used in fair.c so I kept it local like some other defines in fair.c
>
> [...]
>
> > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> > */
> > divider = get_pelt_divider(&cfs_rq->avg);
> >
> > +
> > /* Set new sched_entity's utilization */
> > se->avg.util_avg = gcfs_rq->avg.util_avg;
> > - se->avg.util_sum = se->avg.util_avg * divider;
> > + new_sum = se->avg.util_avg * divider;
> > + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> > + se->avg.util_sum = new_sum;
> >
> > /* Update parent cfs_rq utilization */
> > - add_positive(&cfs_rq->avg.util_avg, delta);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> > + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> > +
> > + /*
> > + * Because of rounding, se->util_sum might ends up being +1 more than
> > + * cfs->util_sum (XXX fix the rounding). Although this is not
> > + * a problem by itself, detaching a lot of tasks with the rounding
> > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> > + * becoming null whereas cfs_util_avg is not.
> > + * Check that util_sum is still above its lower bound for the new
> > + * util_avg. Given that period_contrib might have moved since the last
> > + * sync, we are only sure that util_sum must be above or equal to
> > + * util_avg * minimum possible divider
> > + */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > }
> >
>
> I still wonder whether the regression only comes from the changes in
> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> And could be fixed only by this part of the patch-set (A):
>
> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);
>
> i.e. w/o changing update_tg_cfs_X() (and
> detach_entity_load_avg()/dequeue_load_avg()).
>
> update_load_avg()
> update_cfs_rq_load_avg() <---
> propagate_entity_load_avg()
> update_tg_cfs_X() <---
>
>
> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> hackbench in several different sched group levels on
> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>
> Rick is probably in a position to test whether this would be sufficient
> to cure the CPU frequency regression.
>
> I can see that you want to use the same _avg/_sum sync in
> detach_entity_load_avg()/dequeue_load_avg() as in
> update_cfs_rq_load_avg(). (B)
>
> And finally in update_tg_cfs_X() as well plus down-propagating _sum
> independently from _avg. (C)
>
> So rather splitting the patchset into X (util, runnable, load) the whole
> change might be easier to handle IMHO when split into (A), (B), (C)
> (obviously only in case (A) cures the regression).
>
> > static inline void
> > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > r = removed_util;
> > sub_positive(&sa->util_avg, r);
> > - sa->util_sum = sa->util_avg * divider;
> > + sub_positive(&sa->util_sum, r * divider);
> > + /* See update_tg_cfs_util() */
> > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >
> > r = removed_runnable;
> > sub_positive(&sa->runnable_avg, r);
> > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > dequeue_load_avg(cfs_rq, se);
> > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > + /* See update_tg_cfs_util() */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > +
>
> Maybe add a:
>
> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> with *_avg")
>
> [...]
>
> This max_t() should make sure that `_sum is always >= _avg *
> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>
> (1) update_cfs_rq_load_avg()
> (2) detach_entity_load_avg() and dequeue_load_avg()
> (3) update_tg_cfs_X()
>
> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> reason for this?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg
2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot
2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot
@ 2021-12-22 9:38 ` Vincent Guittot
2022-01-04 11:47 ` Dietmar Eggemann
2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot
2022-01-04 11:46 ` [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Dietmar Eggemann
3 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju, Vincent Guittot
Similarly to util_avg and util_sum, don't sync runnable_sum with the low
bound of runnable_avg but only ensure that runnable_sum stays in the
correct range.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ac28f0f3645..b4c350715c16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3495,11 +3495,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+ u32 new_sum, divider;
/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;
/*
@@ -3510,11 +3510,16 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
/* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
- se->avg.runnable_sum = se->avg.runnable_avg * divider;
+ new_sum = se->avg.runnable_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
+ se->avg.runnable_sum = new_sum;
/* Update parent cfs_rq runnable */
- add_positive(&cfs_rq->avg.runnable_avg, delta);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
+ add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);
}
static inline void
@@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * MIN_DIVIDER);
/*
* removed_runnable is the unweighted version of removed_load so we
@@ -3791,12 +3799,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*/
static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
- * See ___update_load_avg() for details.
- */
- u32 divider = get_pelt_divider(&cfs_rq->avg);
-
dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
@@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
cfs_rq->avg.util_avg * MIN_DIVIDER);
sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);
add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
--
2.17.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg
2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot
@ 2022-01-04 11:47 ` Dietmar Eggemann
0 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju
On 22/12/2021 10:38, Vincent Guittot wrote:
[...]
> @@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);
Maybe add a:
Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay
consistent")
[...]
> @@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> cfs_rq->avg.util_avg * MIN_DIVIDER);
>
> sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
> - cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
> + sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
> + cfs_rq->avg.runnable_avg * MIN_DIVIDER);
Maybe add a:
Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
with *_avg")
[...]
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg
2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot
2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot
2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot
@ 2021-12-22 9:38 ` Vincent Guittot
2022-01-04 11:47 ` Dietmar Eggemann
2022-01-04 11:46 ` [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Dietmar Eggemann
3 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju, Vincent Guittot
Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4c350715c16..488213d98770 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}
+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u32 divider = get_pelt_divider(&se->avg);
sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
+
}
#else
static inline void
@@ -3381,8 +3386,6 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}
-#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
-
/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
* propagate its contribution. The key to this propagation is the invariant
@@ -3525,9 +3528,10 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+ long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
unsigned long load_avg;
u64 load_sum = 0;
+ s64 delta_sum;
u32 divider;
if (!runnable_sum)
@@ -3554,7 +3558,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
* assuming all tasks are equally runnable.
*/
if (scale_load_down(gcfs_rq->load.weight)) {
- load_sum = div_s64(gcfs_rq->avg.load_sum,
+ load_sum = div_u64(gcfs_rq->avg.load_sum,
scale_load_down(gcfs_rq->load.weight));
}
@@ -3571,19 +3575,22 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
runnable_sum = max(runnable_sum, running_sum);
- load_sum = (s64)se_weight(se) * runnable_sum;
- load_avg = div_s64(load_sum, divider);
-
- se->avg.load_sum = runnable_sum;
+ load_sum = se_weight(se) * runnable_sum;
+ load_avg = div_u64(load_sum, divider);
- delta = load_avg - se->avg.load_avg;
- if (!delta)
+ delta_avg = load_avg - se->avg.load_avg;
+ if (!delta_avg)
return;
- se->avg.load_avg = load_avg;
+ delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
- add_positive(&cfs_rq->avg.load_avg, delta);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ se->avg.load_sum = runnable_sum;
+ se->avg.load_avg = load_avg;
+ add_positive(&cfs_rq->avg.load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.load_sum, delta_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
}
static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
r = removed_util;
sub_positive(&sa->util_avg, r);
--
2.17.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg
2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot
@ 2022-01-04 11:47 ` Dietmar Eggemann
0 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju
On 22/12/2021 10:38, Vincent Guittot wrote:
> Similarly to util_avg and util_sum, don't sync load_sum with the low
> bound of load_avg but only ensure that load_sum stays in the correct range.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++----------------
> 1 file changed, 25 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4c350715c16..488213d98770 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
> }
>
> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
> +
> static inline void
> dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - u32 divider = get_pelt_divider(&se->avg);
> sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
> + sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
> + cfs_rq->avg.load_avg * MIN_DIVIDER);
> +
Maybe add a:
Fixes: ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after
dequeue")
[...]
> static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
> @@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
Since this max_t() is used 9 times in this patch-set, maybe a macro in
pelt.h is better:
+/* Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum (XXX fix the rounding). Although this is not
+ * a problem by itself, detaching a lot of tasks with the rounding
+ * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
+ * becoming null whereas cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
+#define enforce_lower_bound_on_pelt_sum(sa, var) do { \
+ (sa)->var##_sum = max_t(u32, \
+ (sa)->var##_sum, \
+ (sa)->var##_avg * MIN_DIVIDER); \
+} while (0)
This way, you could also add the comment from update_tg_cfs_util()
there. And there would be no need anymore to point to it from the other
places where it is used.
[...]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg
2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot
` (2 preceding siblings ...)
2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot
@ 2022-01-04 11:46 ` Dietmar Eggemann
3 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2022-01-04 11:46 UTC (permalink / raw)
To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
mgorman, bristot, linux-kernel, rickyiu, odin
Cc: sachinp, naresh.kamboju
On 22/12/2021 10:37, Vincent Guittot wrote:
INHO you mean s/hardly/hard here?
> Rick reported performance regressions in bugzilla because of cpu
> frequency being lower than before:
> https://bugzilla.kernel.org/show_bug.cgi?id=215045
>
> He bisected the problem to:
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
>
> More details are available in commit message of patch 1.
>
> This patchset reverts the commit above and adds several checks when
> propagating the changes in the hierarchy to make sure that we still have
> coherent util_avg and util_sum.
>
> Dietmar found a simple way to reproduce the WARN fixed by
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> by looping on hackbench in several different sched group levels.
>
> This patchset as run on the reproducer with success but it probably needs
> more tests by people who faced the WARN before.
>
> The changes done on util_sum have been also applied to runnable_sum and
> load_sum which faces the same rounding problem although this has not been
> reflected in measurable performance impact.
I think the overall idea here is that:
[add\|sub]_positive(&sa->X_avg, Y); (`add` in update_tg_cfs_X())
sa->FOO_sum = sa->X_avg * divider;
with X equal (util, runnable, load)
changes to:
[add\|sub]_positive(&sa->X_avg, Y);
[add\|sub]_positive(&sa->X_sum, Z);
sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER);
This change is done in:
(1) update_cfs_rq_load_avg()
(2) detach_entity_load_avg() and dequeue_load_avg()
(3) update_tg_cfs_X() (+ down-propagating _sum independently from _avg)
Prior to:
1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced w/ *_avg")
ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after dequeue")
(i.e. the commits which get fixed by this patchset):
sub_positive(&sa->X_avg, Y);
sub_positive(&sa->X_sum, Z);
was used in (1) and (2).
(3) used sa->util_sum = sa->util_avg * divider already before (Since
95d685935a2e ("sched/pelt: Sync util/runnable_sum with PELT window when
propagating").
[...]
^ permalink raw reply [flat|nested] 16+ messages in thread