*[PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg@ 2021-12-22 9:37 Vincent Guittot2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Vincent Guittot @ 2021-12-22 9:37 UTC (permalink / raw) To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju, Vincent Guittot Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") More details are available in commit message of patch 1. This patchset reverts the commit above and adds several checks when propagating the changes in the hierarchy to make sure that we still have coherent util_avg and util_sum. Dietmar found a simple way to reproduce the WARN fixed by commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") by looping on hackbench in several different sched group levels. This patchset as run on the reproducer with success but it probably needs more tests by people who faced the WARN before. The changes done on util_sum have been also applied to runnable_sum and load_sum which faces the same rounding problem although this has not been reflected in measurable performance impact. Changes for v2: - fix wrong update of load_sum - move a change from patch 3 to patch 2 - update patch 3 commit message Vincent Guittot (3): sched/pelt: Don't sync hardly util_sum with uti_avg sched/pelt: Don't sync hardly runnable_sum with runnable_avg sched/pelt: Don't sync hardly load_sum with load_avg kernel/sched/fair.c | 113 +++++++++++++++++++++++++++++--------------- 1 file changed, 75 insertions(+), 38 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 16+ messages in thread

*[PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot@ 2021-12-22 9:38 ` Vincent Guittot2022-01-04 11:47 ` Dietmar Eggemann 2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot ` (2 subsequent siblings) 3 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw) To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju, Vincent Guittot Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") This commit forces util_sum to be synced with the new util_avg after removing the contribution of a task and before the next periodic sync. By doing so util_sum is rounded to its lower bound and might lost up to LOAD_AVG_MAX-1 of accumulated contribution which has not yet been reflected in util_avg. Instead of always setting util_sum to the low bound of util_avg, which can significantly lower the utilization of root cfs_rq after propagating the change down into the hierarchy, we revert the commit and propagate the difference in util_sum. In addition, we also check that cfs's util_sum always stays above the lower bound for a given util_avg as it has been observed that sched_entity's util_sum is sometimes above cfs one. Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++--------- 1 file changed, 32 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 095b0aa378df..9ac28f0f3645 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se, se->avg.last_update_time = n_last_update_time; } +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) /* * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to @@ -3449,15 +3450,14 @@ void set_task_rq_fair(struct sched_entity *se, * XXX: only do this for the part of runnable > running ? * */ - static inline void update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { - long delta = gcfs_rq->avg.util_avg - se->avg.util_avg; - u32 divider; + long delta_sum, delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg; + u32 new_sum, divider; /* Nothing to update */ - if (!delta) + if (!delta_avg) return; /* @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq */ divider = get_pelt_divider(&cfs_rq->avg); + /* Set new sched_entity's utilization */ se->avg.util_avg = gcfs_rq->avg.util_avg; - se->avg.util_sum = se->avg.util_avg * divider; + new_sum = se->avg.util_avg * divider; + delta_sum = (long)new_sum - (long)se->avg.util_sum; + se->avg.util_sum = new_sum; /* Update parent cfs_rq utilization */ - add_positive(&cfs_rq->avg.util_avg, delta); - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; + add_positive(&cfs_rq->avg.util_avg, delta_avg); + add_positive(&cfs_rq->avg.util_sum, delta_sum); + + /* + * Because of rounding, se->util_sum might ends up being +1 more than + * cfs->util_sum (XXX fix the rounding). Although this is not + * a problem by itself, detaching a lot of tasks with the rounding + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum + * becoming null whereas cfs_util_avg is not. + * Check that util_sum is still above its lower bound for the new + * util_avg. Given that period_contrib might have moved since the last + * sync, we are only sure that util_sum must be above or equal to + * util_avg * minimum possible divider + */ + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, + cfs_rq->avg.util_avg * MIN_DIVIDER); } static inline void @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) r = removed_util; sub_positive(&sa->util_avg, r); - sa->util_sum = sa->util_avg * divider; + sub_positive(&sa->util_sum, r * divider); + /* See update_tg_cfs_util() */ + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); r = removed_runnable; sub_positive(&sa->runnable_avg, r); @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s dequeue_load_avg(cfs_rq, se); sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); + /* See update_tg_cfs_util() */ + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, + cfs_rq->avg.util_avg * MIN_DIVIDER); + sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider; -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread

*[PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot 2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot@ 2021-12-22 9:38 ` Vincent Guittot2022-01-04 11:47 ` Dietmar Eggemann 2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot 2022-01-04 11:46 ` [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Dietmar Eggemann 3 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw) To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju, Vincent Guittot Similarly to util_avg and util_sum, don't sync runnable_sum with the low bound of runnable_avg but only ensure that runnable_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9ac28f0f3645..b4c350715c16 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3495,11 +3495,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq static inline void update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { - long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg; - u32 divider; + long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg; + u32 new_sum, divider; /* Nothing to update */ - if (!delta) + if (!delta_avg) return; /* @@ -3510,11 +3510,16 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf /* Set new sched_entity's runnable */ se->avg.runnable_avg = gcfs_rq->avg.runnable_avg; - se->avg.runnable_sum = se->avg.runnable_avg * divider; + new_sum = se->avg.runnable_avg * divider; + delta_sum = (long)new_sum - (long)se->avg.runnable_sum; + se->avg.runnable_sum = new_sum; /* Update parent cfs_rq runnable */ - add_positive(&cfs_rq->avg.runnable_avg, delta); - cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider; + add_positive(&cfs_rq->avg.runnable_avg, delta_avg); + add_positive(&cfs_rq->avg.runnable_sum, delta_sum); + /* See update_tg_cfs_util() */ + cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, + cfs_rq->avg.runnable_avg * MIN_DIVIDER); } static inline void @@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) r = removed_runnable; sub_positive(&sa->runnable_avg, r); - sa->runnable_sum = sa->runnable_avg * divider; + sub_positive(&sa->runnable_sum, r * divider); + /* See update_tg_cfs_util() */ + sa->runnable_sum = max_t(u32, sa->runnable_sum, + sa->runnable_avg * MIN_DIVIDER); /* * removed_runnable is the unweighted version of removed_load so we @@ -3791,12 +3799,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s */ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - /* - * cfs_rq->avg.period_contrib can be used for both cfs_rq and se. - * See ___update_load_avg() for details. - */ - u32 divider = get_pelt_divider(&cfs_rq->avg); - dequeue_load_avg(cfs_rq, se); sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); @@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s cfs_rq->avg.util_avg * MIN_DIVIDER); sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); - cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider; + sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum); + /* See update_tg_cfs_util() */ + cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, + cfs_rq->avg.runnable_avg * MIN_DIVIDER); add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread

*[PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot 2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot 2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot@ 2021-12-22 9:38 ` Vincent Guittot2022-01-04 11:47 ` Dietmar Eggemann 2022-01-04 11:46 ` [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Dietmar Eggemann 3 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2021-12-22 9:38 UTC (permalink / raw) To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju, Vincent Guittot Similarly to util_avg and util_sum, don't sync load_sum with the low bound of load_avg but only ensure that load_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 41 +++++++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b4c350715c16..488213d98770 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; } +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) + static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - u32 divider = get_pelt_divider(&se->avg); sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider; + sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum); + /* See update_tg_cfs_util() */ + cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, + cfs_rq->avg.load_avg * MIN_DIVIDER); + } #else static inline void @@ -3381,8 +3386,6 @@ void set_task_rq_fair(struct sched_entity *se, se->avg.last_update_time = n_last_update_time; } -#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) - /* * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to * propagate its contribution. The key to this propagation is the invariant @@ -3525,9 +3528,10 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf static inline void update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { - long delta, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum; + long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum; unsigned long load_avg; u64 load_sum = 0; + s64 delta_sum; u32 divider; if (!runnable_sum) @@ -3554,7 +3558,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq * assuming all tasks are equally runnable. */ if (scale_load_down(gcfs_rq->load.weight)) { - load_sum = div_s64(gcfs_rq->avg.load_sum, + load_sum = div_u64(gcfs_rq->avg.load_sum, scale_load_down(gcfs_rq->load.weight)); } @@ -3571,19 +3575,22 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; runnable_sum = max(runnable_sum, running_sum); - load_sum = (s64)se_weight(se) * runnable_sum; - load_avg = div_s64(load_sum, divider); - - se->avg.load_sum = runnable_sum; + load_sum = se_weight(se) * runnable_sum; + load_avg = div_u64(load_sum, divider); - delta = load_avg - se->avg.load_avg; - if (!delta) + delta_avg = load_avg - se->avg.load_avg; + if (!delta_avg) return; - se->avg.load_avg = load_avg; + delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum; - add_positive(&cfs_rq->avg.load_avg, delta); - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider; + se->avg.load_sum = runnable_sum; + se->avg.load_avg = load_avg; + add_positive(&cfs_rq->avg.load_avg, delta_avg); + add_positive(&cfs_rq->avg.load_sum, delta_sum); + /* See update_tg_cfs_util() */ + cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, + cfs_rq->avg.load_avg * MIN_DIVIDER); } static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) @@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) r = removed_load; sub_positive(&sa->load_avg, r); - sa->load_sum = sa->load_avg * divider; + sub_positive(&sa->load_sum, r * divider); + /* See update_tg_cfs_util() */ + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); r = removed_util; sub_positive(&sa->util_avg, r); -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread

*Re: [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot ` (2 preceding siblings ...) 2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot@ 2022-01-04 11:46 ` Dietmar Eggemann3 siblings, 0 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-04 11:46 UTC (permalink / raw) To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju On 22/12/2021 10:37, Vincent Guittot wrote: INHO you mean s/hardly/hard here? > Rick reported performance regressions in bugzilla because of cpu > frequency being lower than before: > https://bugzilla.kernel.org/show_bug.cgi?id=215045 > > He bisected the problem to: > commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") > > More details are available in commit message of patch 1. > > This patchset reverts the commit above and adds several checks when > propagating the changes in the hierarchy to make sure that we still have > coherent util_avg and util_sum. > > Dietmar found a simple way to reproduce the WARN fixed by > commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") > by looping on hackbench in several different sched group levels. > > This patchset as run on the reproducer with success but it probably needs > more tests by people who faced the WARN before. > > The changes done on util_sum have been also applied to runnable_sum and > load_sum which faces the same rounding problem although this has not been > reflected in measurable performance impact. I think the overall idea here is that: [add\|sub]_positive(&sa->X_avg, Y); (`add` in update_tg_cfs_X()) sa->FOO_sum = sa->X_avg * divider; with X equal (util, runnable, load) changes to: [add\|sub]_positive(&sa->X_avg, Y); [add\|sub]_positive(&sa->X_sum, Z); sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER); This change is done in: (1) update_cfs_rq_load_avg() (2) detach_entity_load_avg() and dequeue_load_avg() (3) update_tg_cfs_X() (+ down-propagating _sum independently from _avg) Prior to: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced w/ *_avg") ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after dequeue") (i.e. the commits which get fixed by this patchset): sub_positive(&sa->X_avg, Y); sub_positive(&sa->X_sum, Z); was used in (1) and (2). (3) used sa->util_sum = sa->util_avg * divider already before (Since 95d685935a2e ("sched/pelt: Sync util/runnable_sum with PELT window when propagating"). [...] ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot@ 2022-01-04 11:47 ` Dietmar Eggemann2022-01-04 13:42 ` Vincent Guittot 2022-01-04 13:48 ` Vincent Guittot 0 siblings, 2 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw) To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju On 22/12/2021 10:38, Vincent Guittot wrote: s/util_sum with uti_avg/util_sum with util_avg [...] > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) Shouldn't this be in pelt.h? [...] > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq > */ > divider = get_pelt_divider(&cfs_rq->avg); > > + > /* Set new sched_entity's utilization */ > se->avg.util_avg = gcfs_rq->avg.util_avg; > - se->avg.util_sum = se->avg.util_avg * divider; > + new_sum = se->avg.util_avg * divider; > + delta_sum = (long)new_sum - (long)se->avg.util_sum; > + se->avg.util_sum = new_sum; > > /* Update parent cfs_rq utilization */ > - add_positive(&cfs_rq->avg.util_avg, delta); > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > + add_positive(&cfs_rq->avg.util_avg, delta_avg); > + add_positive(&cfs_rq->avg.util_sum, delta_sum); > + > + /* > + * Because of rounding, se->util_sum might ends up being +1 more than > + * cfs->util_sum (XXX fix the rounding). Although this is not > + * a problem by itself, detaching a lot of tasks with the rounding > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum > + * becoming null whereas cfs_util_avg is not. > + * Check that util_sum is still above its lower bound for the new > + * util_avg. Given that period_contrib might have moved since the last > + * sync, we are only sure that util_sum must be above or equal to > + * util_avg * minimum possible divider > + */ > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > + cfs_rq->avg.util_avg * MIN_DIVIDER); > } > I still wonder whether the regression only comes from the changes in update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. And could be fixed only by this part of the patch-set (A): @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) r = removed_load; sub_positive(&sa->load_avg, r); - sa->load_sum = sa->load_avg * divider; + sub_positive(&sa->load_sum, r * divider); + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); r = removed_util; sub_positive(&sa->util_avg, r); - sa->util_sum = sa->util_avg * divider; + sub_positive(&sa->util_sum, r * divider); + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); r = removed_runnable; sub_positive(&sa->runnable_avg, r); - sa->runnable_sum = sa->runnable_avg * divider; + sub_positive(&sa->runnable_sum, r * divider); + sa->runnable_sum = max_t(u32, sa->runnable_sum, + sa->runnable_avg * MIN_DIVIDER); i.e. w/o changing update_tg_cfs_X() (and detach_entity_load_avg()/dequeue_load_avg()). update_load_avg() update_cfs_rq_load_avg() <--- propagate_entity_load_avg() update_tg_cfs_X() <--- I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on hackbench in several different sched group levels on [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. Rick is probably in a position to test whether this would be sufficient to cure the CPU frequency regression. I can see that you want to use the same _avg/_sum sync in detach_entity_load_avg()/dequeue_load_avg() as in update_cfs_rq_load_avg(). (B) And finally in update_tg_cfs_X() as well plus down-propagating _sum independently from _avg. (C) So rather splitting the patchset into X (util, runnable, load) the whole change might be easier to handle IMHO when split into (A), (B), (C) (obviously only in case (A) cures the regression). > static inline void > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > r = removed_util; > sub_positive(&sa->util_avg, r); > - sa->util_sum = sa->util_avg * divider; > + sub_positive(&sa->util_sum, r * divider); > + /* See update_tg_cfs_util() */ > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > r = removed_runnable; > sub_positive(&sa->runnable_avg, r); > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > dequeue_load_avg(cfs_rq, se); > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > + /* See update_tg_cfs_util() */ > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > + cfs_rq->avg.util_avg * MIN_DIVIDER); > + Maybe add a: Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced with *_avg") [...] This max_t() should make sure that `_sum is always >= _avg * MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in (1) update_cfs_rq_load_avg() (2) detach_entity_load_avg() and dequeue_load_avg() (3) update_tg_cfs_X() but not in attach_entity_load_avg(), enqueue_load_avg(). What's the reason for this? ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot@ 2022-01-04 11:47 ` Dietmar Eggemann0 siblings, 0 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw) To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju On 22/12/2021 10:38, Vincent Guittot wrote: [...] > @@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > r = removed_runnable; > sub_positive(&sa->runnable_avg, r); > - sa->runnable_sum = sa->runnable_avg * divider; > + sub_positive(&sa->runnable_sum, r * divider); > + /* See update_tg_cfs_util() */ > + sa->runnable_sum = max_t(u32, sa->runnable_sum, > + sa->runnable_avg * MIN_DIVIDER); Maybe add a: Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") [...] > @@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > cfs_rq->avg.util_avg * MIN_DIVIDER); > > sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); > - cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider; > + sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum); > + /* See update_tg_cfs_util() */ > + cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum, > + cfs_rq->avg.runnable_avg * MIN_DIVIDER); Maybe add a: Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced with *_avg") [...] ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot@ 2022-01-04 11:47 ` Dietmar Eggemann0 siblings, 0 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-04 11:47 UTC (permalink / raw) To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin Cc: sachinp, naresh.kamboju On 22/12/2021 10:38, Vincent Guittot wrote: > Similarly to util_avg and util_sum, don't sync load_sum with the low > bound of load_avg but only ensure that load_sum stays in the correct range. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 41 +++++++++++++++++++++++++---------------- > 1 file changed, 25 insertions(+), 16 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b4c350715c16..488213d98770 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; > } > > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) > + > static inline void > dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - u32 divider = get_pelt_divider(&se->avg); > sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); > - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider; > + sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum); > + /* See update_tg_cfs_util() */ > + cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum, > + cfs_rq->avg.load_avg * MIN_DIVIDER); > + Maybe add a: Fixes: ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after dequeue") [...] > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) > @@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > r = removed_load; > sub_positive(&sa->load_avg, r); > - sa->load_sum = sa->load_avg * divider; > + sub_positive(&sa->load_sum, r * divider); > + /* See update_tg_cfs_util() */ > + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); Since this max_t() is used 9 times in this patch-set, maybe a macro in pelt.h is better: +/* Because of rounding, se->util_sum might ends up being +1 more than + * cfs->util_sum (XXX fix the rounding). Although this is not + * a problem by itself, detaching a lot of tasks with the rounding + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum + * becoming null whereas cfs_util_avg is not. + * Check that util_sum is still above its lower bound for the new + * util_avg. Given that period_contrib might have moved since the last + * sync, we are only sure that util_sum must be above or equal to + * util_avg * minimum possible divider + */ +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) + +#define enforce_lower_bound_on_pelt_sum(sa, var) do { \ + (sa)->var##_sum = max_t(u32, \ + (sa)->var##_sum, \ + (sa)->var##_avg * MIN_DIVIDER); \ +} while (0) This way, you could also add the comment from update_tg_cfs_util() there. And there would be no need anymore to point to it from the other places where it is used. [...] ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-04 11:47 ` Dietmar Eggemann@ 2022-01-04 13:42 ` Vincent Guittot2022-01-05 13:15 ` Dietmar Eggemann 2022-01-04 13:48 ` Vincent Guittot 1 sibling, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2022-01-04 13:42 UTC (permalink / raw) To: Dietmar Eggemann Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 22/12/2021 10:38, Vincent Guittot wrote: > > s/util_sum with uti_avg/util_sum with util_avg yes > > [...] > > > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) > > Shouldn't this be in pelt.h? > > [...] > > > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq > > */ > > divider = get_pelt_divider(&cfs_rq->avg); > > > > + > > /* Set new sched_entity's utilization */ > > se->avg.util_avg = gcfs_rq->avg.util_avg; > > - se->avg.util_sum = se->avg.util_avg * divider; > > + new_sum = se->avg.util_avg * divider; > > + delta_sum = (long)new_sum - (long)se->avg.util_sum; > > + se->avg.util_sum = new_sum; > > > > /* Update parent cfs_rq utilization */ > > - add_positive(&cfs_rq->avg.util_avg, delta); > > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > > + add_positive(&cfs_rq->avg.util_avg, delta_avg); > > + add_positive(&cfs_rq->avg.util_sum, delta_sum); > > + > > + /* > > + * Because of rounding, se->util_sum might ends up being +1 more than > > + * cfs->util_sum (XXX fix the rounding). Although this is not > > + * a problem by itself, detaching a lot of tasks with the rounding > > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum > > + * becoming null whereas cfs_util_avg is not. > > + * Check that util_sum is still above its lower bound for the new > > + * util_avg. Given that period_contrib might have moved since the last > > + * sync, we are only sure that util_sum must be above or equal to > > + * util_avg * minimum possible divider > > + */ > > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > > + cfs_rq->avg.util_avg * MIN_DIVIDER); > > } > > > > I still wonder whether the regression only comes from the changes in > update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. > And could be fixed only by this part of the patch-set (A): I have been able to trigger the warning even with (A) though It took much more time. And I have been able to catch wrong situations (with additional traces) in the 3 places A, B and C > > @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > *cfs_rq) > > r = removed_load; > sub_positive(&sa->load_avg, r); > - sa->load_sum = sa->load_avg * divider; > + sub_positive(&sa->load_sum, r * divider); > + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); > > r = removed_util; > sub_positive(&sa->util_avg, r); > - sa->util_sum = sa->util_avg * divider; > + sub_positive(&sa->util_sum, r * divider); > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > r = removed_runnable; > sub_positive(&sa->runnable_avg, r); > - sa->runnable_sum = sa->runnable_avg * divider; > + sub_positive(&sa->runnable_sum, r * divider); > + sa->runnable_sum = max_t(u32, sa->runnable_sum, > + sa->runnable_avg * MIN_DIVIDER); > > i.e. w/o changing update_tg_cfs_X() (and > detach_entity_load_avg()/dequeue_load_avg()). > > update_load_avg() > update_cfs_rq_load_avg() <--- > propagate_entity_load_avg() > update_tg_cfs_X() <--- > > > I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on > hackbench in several different sched group levels on > [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. IIRC, it was with hikey960 with cgroup v1 As a side note, I never trigger the problem with dragonboard845 and cgroup v2 > > Rick is probably in a position to test whether this would be sufficient > to cure the CPU frequency regression. > > I can see that you want to use the same _avg/_sum sync in > detach_entity_load_avg()/dequeue_load_avg() as in > update_cfs_rq_load_avg(). (B) > > And finally in update_tg_cfs_X() as well plus down-propagating _sum > independently from _avg. (C) > > So rather splitting the patchset into X (util, runnable, load) the whole > change might be easier to handle IMHO when split into (A), (B), (C) > (obviously only in case (A) cures the regression). > > > static inline void > > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > r = removed_util; > > sub_positive(&sa->util_avg, r); > > - sa->util_sum = sa->util_avg * divider; > > + sub_positive(&sa->util_sum, r * divider); > > + /* See update_tg_cfs_util() */ > > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > > > r = removed_runnable; > > sub_positive(&sa->runnable_avg, r); > > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > dequeue_load_avg(cfs_rq, se); > > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > > + /* See update_tg_cfs_util() */ > > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > > + cfs_rq->avg.util_avg * MIN_DIVIDER); > > + > > Maybe add a: > > Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced > with *_avg") I spent time thinking about adding fixes tag. There is no crash/warn so far so should we propagate it back in LTS for better performance ? > > [...] > > This max_t() should make sure that `_sum is always >= _avg * > MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in > > (1) update_cfs_rq_load_avg() > (2) detach_entity_load_avg() and dequeue_load_avg() > (3) update_tg_cfs_X() > > but not in attach_entity_load_avg(), enqueue_load_avg(). What's the > reason for this? Main reason is that I have never seen the problem. Then, the problem comes from subtracting task's value whereas here we always add positive value ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-04 11:47 ` Dietmar Eggemann 2022-01-04 13:42 ` Vincent Guittot@ 2022-01-04 13:48 ` Vincent Guittot1 sibling, 0 replies; 16+ messages in thread From: Vincent Guittot @ 2022-01-04 13:48 UTC (permalink / raw) To: Dietmar Eggemann Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 22/12/2021 10:38, Vincent Guittot wrote: > > s/util_sum with uti_avg/util_sum with util_avg > > [...] > > > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024) > > Shouldn't this be in pelt.h? It is only used in fair.c so I kept it local like some other defines in fair.c > > [...] > > > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq > > */ > > divider = get_pelt_divider(&cfs_rq->avg); > > > > + > > /* Set new sched_entity's utilization */ > > se->avg.util_avg = gcfs_rq->avg.util_avg; > > - se->avg.util_sum = se->avg.util_avg * divider; > > + new_sum = se->avg.util_avg * divider; > > + delta_sum = (long)new_sum - (long)se->avg.util_sum; > > + se->avg.util_sum = new_sum; > > > > /* Update parent cfs_rq utilization */ > > - add_positive(&cfs_rq->avg.util_avg, delta); > > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > > + add_positive(&cfs_rq->avg.util_avg, delta_avg); > > + add_positive(&cfs_rq->avg.util_sum, delta_sum); > > + > > + /* > > + * Because of rounding, se->util_sum might ends up being +1 more than > > + * cfs->util_sum (XXX fix the rounding). Although this is not > > + * a problem by itself, detaching a lot of tasks with the rounding > > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum > > + * becoming null whereas cfs_util_avg is not. > > + * Check that util_sum is still above its lower bound for the new > > + * util_avg. Given that period_contrib might have moved since the last > > + * sync, we are only sure that util_sum must be above or equal to > > + * util_avg * minimum possible divider > > + */ > > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > > + cfs_rq->avg.util_avg * MIN_DIVIDER); > > } > > > > I still wonder whether the regression only comes from the changes in > update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. > And could be fixed only by this part of the patch-set (A): > > @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > *cfs_rq) > > r = removed_load; > sub_positive(&sa->load_avg, r); > - sa->load_sum = sa->load_avg * divider; > + sub_positive(&sa->load_sum, r * divider); > + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); > > r = removed_util; > sub_positive(&sa->util_avg, r); > - sa->util_sum = sa->util_avg * divider; > + sub_positive(&sa->util_sum, r * divider); > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > r = removed_runnable; > sub_positive(&sa->runnable_avg, r); > - sa->runnable_sum = sa->runnable_avg * divider; > + sub_positive(&sa->runnable_sum, r * divider); > + sa->runnable_sum = max_t(u32, sa->runnable_sum, > + sa->runnable_avg * MIN_DIVIDER); > > i.e. w/o changing update_tg_cfs_X() (and > detach_entity_load_avg()/dequeue_load_avg()). > > update_load_avg() > update_cfs_rq_load_avg() <--- > propagate_entity_load_avg() > update_tg_cfs_X() <--- > > > I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on > hackbench in several different sched group levels on > [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. > > Rick is probably in a position to test whether this would be sufficient > to cure the CPU frequency regression. > > I can see that you want to use the same _avg/_sum sync in > detach_entity_load_avg()/dequeue_load_avg() as in > update_cfs_rq_load_avg(). (B) > > And finally in update_tg_cfs_X() as well plus down-propagating _sum > independently from _avg. (C) > > So rather splitting the patchset into X (util, runnable, load) the whole > change might be easier to handle IMHO when split into (A), (B), (C) > (obviously only in case (A) cures the regression). > > > static inline void > > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > r = removed_util; > > sub_positive(&sa->util_avg, r); > > - sa->util_sum = sa->util_avg * divider; > > + sub_positive(&sa->util_sum, r * divider); > > + /* See update_tg_cfs_util() */ > > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > > > r = removed_runnable; > > sub_positive(&sa->runnable_avg, r); > > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > dequeue_load_avg(cfs_rq, se); > > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > > + /* See update_tg_cfs_util() */ > > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > > + cfs_rq->avg.util_avg * MIN_DIVIDER); > > + > > Maybe add a: > > Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced > with *_avg") > > [...] > > This max_t() should make sure that `_sum is always >= _avg * > MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in > > (1) update_cfs_rq_load_avg() > (2) detach_entity_load_avg() and dequeue_load_avg() > (3) update_tg_cfs_X() > > but not in attach_entity_load_avg(), enqueue_load_avg(). What's the > reason for this? ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-04 13:42 ` Vincent Guittot@ 2022-01-05 13:15 ` Dietmar Eggemann2022-01-05 13:57 ` Vincent Guittot 0 siblings, 1 reply; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-05 13:15 UTC (permalink / raw) To: Vincent Guittot Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On 04/01/2022 14:42, Vincent Guittot wrote: > On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 22/12/2021 10:38, Vincent Guittot wrote: [...] >> I still wonder whether the regression only comes from the changes in >> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. >> And could be fixed only by this part of the patch-set (A): > > I have been able to trigger the warning even with (A) though It took > much more time. > And I have been able to catch wrong situations (with additional > traces) in the 3 places A, B and C OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ? >> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq >> *cfs_rq) >> >> r = removed_load; >> sub_positive(&sa->load_avg, r); >> - sa->load_sum = sa->load_avg * divider; >> + sub_positive(&sa->load_sum, r * divider); >> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); >> >> r = removed_util; >> sub_positive(&sa->util_avg, r); >> - sa->util_sum = sa->util_avg * divider; >> + sub_positive(&sa->util_sum, r * divider); >> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); >> >> r = removed_runnable; >> sub_positive(&sa->runnable_avg, r); >> - sa->runnable_sum = sa->runnable_avg * divider; >> + sub_positive(&sa->runnable_sum, r * divider); >> + sa->runnable_sum = max_t(u32, sa->runnable_sum, >> + sa->runnable_avg * MIN_DIVIDER); >> >> i.e. w/o changing update_tg_cfs_X() (and >> detach_entity_load_avg()/dequeue_load_avg()). >> >> update_load_avg() >> update_cfs_rq_load_avg() <--- >> propagate_entity_load_avg() >> update_tg_cfs_X() <--- >> >> >> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on >> hackbench in several different sched group levels on >> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. > > IIRC, it was with hikey960 with cgroup v1 > As a side note, I never trigger the problem with dragonboard845 and cgroup v2 OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it. [...] >>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s >>> >>> dequeue_load_avg(cfs_rq, se); >>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); >>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; >>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); >>> + /* See update_tg_cfs_util() */ >>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, >>> + cfs_rq->avg.util_avg * MIN_DIVIDER); >>> + >> >> Maybe add a: >> >> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced >> with *_avg") > > I spent time thinking about adding fixes tag. There is no crash/warn > so far so should we propagate it back in LTS for better performance ? Not sure I understand. What do you mean by 'should we propagate it back in LTS'? [...] >> This max_t() should make sure that `_sum is always >= _avg * >> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in >> >> (1) update_cfs_rq_load_avg() >> (2) detach_entity_load_avg() and dequeue_load_avg() >> (3) update_tg_cfs_X() >> >> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the >> reason for this? > > Main reason is that I have never seen the problem. > Then, the problem comes from subtracting task's value whereas here we > always add positive value OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values. ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-05 13:15 ` Dietmar Eggemann@ 2022-01-05 13:57 ` Vincent Guittot2022-01-07 11:43 ` Dietmar Eggemann 0 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2022-01-05 13:57 UTC (permalink / raw) To: Dietmar Eggemann Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 04/01/2022 14:42, Vincent Guittot wrote: > > On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >> > >> On 22/12/2021 10:38, Vincent Guittot wrote: > > [...] > > >> I still wonder whether the regression only comes from the changes in > >> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. > >> And could be fixed only by this part of the patch-set (A): > > > > I have been able to trigger the warning even with (A) though It took > > much more time. > > And I have been able to catch wrong situations (with additional > > traces) in the 3 places A, B and C > > OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ? not only. also util_sum == 0 but util_avg !=0 in different places although these situation didn't trigger sched_warn because some other sync happened before the periodic call of __update_blocked_fair or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum just before removing the task > > >> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > >> *cfs_rq) > >> > >> r = removed_load; > >> sub_positive(&sa->load_avg, r); > >> - sa->load_sum = sa->load_avg * divider; > >> + sub_positive(&sa->load_sum, r * divider); > >> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); > >> > >> r = removed_util; > >> sub_positive(&sa->util_avg, r); > >> - sa->util_sum = sa->util_avg * divider; > >> + sub_positive(&sa->util_sum, r * divider); > >> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > >> > >> r = removed_runnable; > >> sub_positive(&sa->runnable_avg, r); > >> - sa->runnable_sum = sa->runnable_avg * divider; > >> + sub_positive(&sa->runnable_sum, r * divider); > >> + sa->runnable_sum = max_t(u32, sa->runnable_sum, > >> + sa->runnable_avg * MIN_DIVIDER); > >> > >> i.e. w/o changing update_tg_cfs_X() (and > >> detach_entity_load_avg()/dequeue_load_avg()). > >> > >> update_load_avg() > >> update_cfs_rq_load_avg() <--- > >> propagate_entity_load_avg() > >> update_tg_cfs_X() <--- > >> > >> > >> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on > >> hackbench in several different sched group levels on > >> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. > > > > IIRC, it was with hikey960 with cgroup v1 > > As a side note, I never trigger the problem with dragonboard845 and cgroup v2 > > OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it. > > [...] > > >>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > >>> > >>> dequeue_load_avg(cfs_rq, se); > >>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > >>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > >>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > >>> + /* See update_tg_cfs_util() */ > >>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > >>> + cfs_rq->avg.util_avg * MIN_DIVIDER); > >>> + > >> > >> Maybe add a: > >> > >> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced > >> with *_avg") > > > > I spent time thinking about adding fixes tag. There is no crash/warn > > so far so should we propagate it back in LTS for better performance ? > > Not sure I understand. What do you mean by 'should we propagate it back > in LTS'? Sorry I had any stables in mind and not only LTS. Some of the changes done in PELT signal propagation that replace subtracting util_sum by using util_avg * divider instead, are related to other problems with sched group hierarchy and throttling/unthrottling. I'm not 100% confident that using fixes tag to backport this on stables doesn't need to backport more patches on other areas in order to not resurrect old problems. So I wonder if it's worth backporting this on stables > > [...] > > >> This max_t() should make sure that `_sum is always >= _avg * > >> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in > >> > >> (1) update_cfs_rq_load_avg() > >> (2) detach_entity_load_avg() and dequeue_load_avg() > >> (3) update_tg_cfs_X() > >> > >> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the > >> reason for this? > > > > Main reason is that I have never seen the problem. > > Then, the problem comes from subtracting task's value whereas here we > > always add positive value > > OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values. ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-05 13:57 ` Vincent Guittot@ 2022-01-07 11:43 ` Dietmar Eggemann2022-01-07 15:21 ` Vincent Guittot 0 siblings, 1 reply; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-07 11:43 UTC (permalink / raw) To: Vincent Guittot Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On 05/01/2022 14:57, Vincent Guittot wrote: > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> >> On 04/01/2022 14:42, Vincent Guittot wrote: >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>> >>>> On 22/12/2021 10:38, Vincent Guittot wrote: >> >> [...] >> >>>> I still wonder whether the regression only comes from the changes in >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. >>>> And could be fixed only by this part of the patch-set (A): >>> >>> I have been able to trigger the warning even with (A) though It took >>> much more time. >>> And I have been able to catch wrong situations (with additional >>> traces) in the 3 places A, B and C >> >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ? > > not only. > also util_sum == 0 but util_avg !=0 in different places although these Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'. > situation didn't trigger sched_warn because some other sync happened > before the periodic call of __update_blocked_fair > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum > just before removing the task I see. >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq >>>> *cfs_rq) >>>> >>>> r = removed_load; >>>> sub_positive(&sa->load_avg, r); >>>> - sa->load_sum = sa->load_avg * divider; >>>> + sub_positive(&sa->load_sum, r * divider); >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); >>>> >>>> r = removed_util; >>>> sub_positive(&sa->util_avg, r); >>>> - sa->util_sum = sa->util_avg * divider; >>>> + sub_positive(&sa->util_sum, r * divider); >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); >>>> >>>> r = removed_runnable; >>>> sub_positive(&sa->runnable_avg, r); >>>> - sa->runnable_sum = sa->runnable_avg * divider; >>>> + sub_positive(&sa->runnable_sum, r * divider); >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum, >>>> + sa->runnable_avg * MIN_DIVIDER); >>>> >>>> i.e. w/o changing update_tg_cfs_X() (and >>>> detach_entity_load_avg()/dequeue_load_avg()). >>>> >>>> update_load_avg() >>>> update_cfs_rq_load_avg() <--- >>>> propagate_entity_load_avg() >>>> update_tg_cfs_X() <--- >>>> >>>> >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on >>>> hackbench in several different sched group levels on >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. >>> >>> IIRC, it was with hikey960 with cgroup v1 >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2 >> >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it. Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D E/F G/H > 45h uptime >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s >>>>> >>>>> dequeue_load_avg(cfs_rq, se); >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); >>>>> + /* See update_tg_cfs_util() */ >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER); >>>>> + >>>> >>>> Maybe add a: >>>> >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced >>>> with *_avg") >>> >>> I spent time thinking about adding fixes tag. There is no crash/warn >>> so far so should we propagate it back in LTS for better performance ? >> >> Not sure I understand. What do you mean by 'should we propagate it back >> in LTS'? > > Sorry I had any stables in mind and not only LTS. > > Some of the changes done in PELT signal propagation that replace > subtracting util_sum by using util_avg * divider instead, are related > to other problems with sched group hierarchy and > throttling/unthrottling. I'm not 100% confident that using fixes tag > to backport this on stables doesn't need to backport more patches on > other areas in order to not resurrect old problems. So I wonder if > it's worth backporting this on stables OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That was the reason why I initially suggested to split the patch-set differently. But you said that you saw the issue also when (1) is fixed. ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-07 11:43 ` Dietmar Eggemann@ 2022-01-07 15:21 ` Vincent Guittot2022-01-11 7:54 ` Vincent Guittot 0 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2022-01-07 15:21 UTC (permalink / raw) To: Dietmar Eggemann Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 05/01/2022 14:57, Vincent Guittot wrote: > > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >> > >> On 04/01/2022 14:42, Vincent Guittot wrote: > >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > >>>> > >>>> On 22/12/2021 10:38, Vincent Guittot wrote: > >> > >> [...] > >> > >>>> I still wonder whether the regression only comes from the changes in > >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. > >>>> And could be fixed only by this part of the patch-set (A): > >>> > >>> I have been able to trigger the warning even with (A) though It took > >>> much more time. > >>> And I have been able to catch wrong situations (with additional > >>> traces) in the 3 places A, B and C > >> > >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ? > > > > not only. > > also util_sum == 0 but util_avg !=0 in different places although these > > Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'. > > > situation didn't trigger sched_warn because some other sync happened > > before the periodic call of __update_blocked_fair > > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum > > just before removing the task > > I see. > > >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > >>>> *cfs_rq) > >>>> > >>>> r = removed_load; > >>>> sub_positive(&sa->load_avg, r); > >>>> - sa->load_sum = sa->load_avg * divider; > >>>> + sub_positive(&sa->load_sum, r * divider); > >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); > >>>> > >>>> r = removed_util; > >>>> sub_positive(&sa->util_avg, r); > >>>> - sa->util_sum = sa->util_avg * divider; > >>>> + sub_positive(&sa->util_sum, r * divider); > >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > >>>> > >>>> r = removed_runnable; > >>>> sub_positive(&sa->runnable_avg, r); > >>>> - sa->runnable_sum = sa->runnable_avg * divider; > >>>> + sub_positive(&sa->runnable_sum, r * divider); > >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum, > >>>> + sa->runnable_avg * MIN_DIVIDER); > >>>> > >>>> i.e. w/o changing update_tg_cfs_X() (and > >>>> detach_entity_load_avg()/dequeue_load_avg()). > >>>> > >>>> update_load_avg() > >>>> update_cfs_rq_load_avg() <--- > >>>> propagate_entity_load_avg() > >>>> update_tg_cfs_X() <--- > >>>> > >>>> > >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on > >>>> hackbench in several different sched group levels on > >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. > >>> > >>> IIRC, it was with hikey960 with cgroup v1 > >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2 > >> > >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it. > > Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D > E/F G/H > 45h uptime > > >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > >>>>> > >>>>> dequeue_load_avg(cfs_rq, se); > >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > >>>>> + /* See update_tg_cfs_util() */ > >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER); > >>>>> + > >>>> > >>>> Maybe add a: > >>>> > >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced > >>>> with *_avg") > >>> > >>> I spent time thinking about adding fixes tag. There is no crash/warn > >>> so far so should we propagate it back in LTS for better performance ? > >> > >> Not sure I understand. What do you mean by 'should we propagate it back > >> in LTS'? > > > > Sorry I had any stables in mind and not only LTS. > > > > Some of the changes done in PELT signal propagation that replace > > subtracting util_sum by using util_avg * divider instead, are related > > to other problems with sched group hierarchy and > > throttling/unthrottling. I'm not 100% confident that using fixes tag > > to backport this on stables doesn't need to backport more patches on > > other areas in order to not resurrect old problems. So I wonder if > > it's worth backporting this on stables > > OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in > update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That > was the reason why I initially suggested to split the patch-set > differently. But you said that you saw the issue also when (1) is fixed. Ok, I think that we were not speaking about the same setup. I wrongly read that you were saying that sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); was only needed in update_cfs_rq_load_avg() but not in the other places. But what you said is that we only need the below to fix the perf regression raised by rick ? r = removed_util; sub_positive(&sa->util_avg, r); - sa->util_sum = sa->util_avg * divider; + sub_positive(&sa->util_sum, r * divider); + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); The WARN that I mentioned in my previous email was about not adding the max_t in all 3 places. I rerun some test today and I triggered the WARN after a detach without the max_t line. I can probably isolate the code above in a dedicated patch for the regression raised by Rick and we could consider adding a fixes tag; I will run more tests with only this part during the weekend. That being said, we need to stay consistent in all 3 places where we move or propagate some *_avg. In particular, using "sa->util_sum = sa->util_avg * divider" has the drawback of filtering the contribution not already accounted for in util_avg and the impact is much larger than expected. It means that although fixing only update_cfs_rq_load_avg() seems enough for rick's case, some other cases could be impacted by the 2 other places and we need to fixed them now that we have a better view of the root cause ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-07 15:21 ` Vincent Guittot@ 2022-01-11 7:54 ` Vincent Guittot2022-01-11 12:37 ` Dietmar Eggemann 0 siblings, 1 reply; 16+ messages in thread From: Vincent Guittot @ 2022-01-11 7:54 UTC (permalink / raw) To: Dietmar Eggemann Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > > > On 05/01/2022 14:57, Vincent Guittot wrote: > > > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > >> > > >> On 04/01/2022 14:42, Vincent Guittot wrote: > > >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > >>>> > > >>>> On 22/12/2021 10:38, Vincent Guittot wrote: > > >> > > >> [...] > > >> > > >>>> I still wonder whether the regression only comes from the changes in > > >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39. > > >>>> And could be fixed only by this part of the patch-set (A): > > >>> > > >>> I have been able to trigger the warning even with (A) though It took > > >>> much more time. > > >>> And I have been able to catch wrong situations (with additional > > >>> traces) in the 3 places A, B and C > > >> > > >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ? > > > > > > not only. > > > also util_sum == 0 but util_avg !=0 in different places although these > > > > Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'. > > > > > situation didn't trigger sched_warn because some other sync happened > > > before the periodic call of __update_blocked_fair > > > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum > > > just before removing the task > > > > I see. > > > > >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq > > >>>> *cfs_rq) > > >>>> > > >>>> r = removed_load; > > >>>> sub_positive(&sa->load_avg, r); > > >>>> - sa->load_sum = sa->load_avg * divider; > > >>>> + sub_positive(&sa->load_sum, r * divider); > > >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER); > > >>>> > > >>>> r = removed_util; > > >>>> sub_positive(&sa->util_avg, r); > > >>>> - sa->util_sum = sa->util_avg * divider; > > >>>> + sub_positive(&sa->util_sum, r * divider); > > >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > >>>> > > >>>> r = removed_runnable; > > >>>> sub_positive(&sa->runnable_avg, r); > > >>>> - sa->runnable_sum = sa->runnable_avg * divider; > > >>>> + sub_positive(&sa->runnable_sum, r * divider); > > >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum, > > >>>> + sa->runnable_avg * MIN_DIVIDER); > > >>>> > > >>>> i.e. w/o changing update_tg_cfs_X() (and > > >>>> detach_entity_load_avg()/dequeue_load_avg()). > > >>>> > > >>>> update_load_avg() > > >>>> update_cfs_rq_load_avg() <--- > > >>>> propagate_entity_load_avg() > > >>>> update_tg_cfs_X() <--- > > >>>> > > >>>> > > >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on > > >>>> hackbench in several different sched group levels on > > >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime]. > > >>> > > >>> IIRC, it was with hikey960 with cgroup v1 > > >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2 > > >> > > >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it. > > > > Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D > > E/F G/H > 45h uptime > > > > >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > >>>>> > > >>>>> dequeue_load_avg(cfs_rq, se); > > >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); > > >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider; > > >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); > > >>>>> + /* See update_tg_cfs_util() */ > > >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum, > > >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER); > > >>>>> + > > >>>> > > >>>> Maybe add a: > > >>>> > > >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced > > >>>> with *_avg") > > >>> > > >>> I spent time thinking about adding fixes tag. There is no crash/warn > > >>> so far so should we propagate it back in LTS for better performance ? > > >> > > >> Not sure I understand. What do you mean by 'should we propagate it back > > >> in LTS'? > > > > > > Sorry I had any stables in mind and not only LTS. > > > > > > Some of the changes done in PELT signal propagation that replace > > > subtracting util_sum by using util_avg * divider instead, are related > > > to other problems with sched group hierarchy and > > > throttling/unthrottling. I'm not 100% confident that using fixes tag > > > to backport this on stables doesn't need to backport more patches on > > > other areas in order to not resurrect old problems. So I wonder if > > > it's worth backporting this on stables > > > > OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in > > update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That > > was the reason why I initially suggested to split the patch-set > > differently. But you said that you saw the issue also when (1) is fixed. > > Ok, I think that we were not speaking about the same setup. I wrongly > read that you were saying that > sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > was only needed in update_cfs_rq_load_avg() but not in the other places. > > But what you said is that we only need the below to fix the perf > regression raised by rick ? > r = removed_util; > sub_positive(&sa->util_avg, r); > - sa->util_sum = sa->util_avg * divider; > + sub_positive(&sa->util_sum, r * divider); > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); The test with the code above doesn't trigger any SCHEd_WARN over the weekend so it's probably ok to make a dedicated patch for this with tag. I'm going to prepare a v2 > > The WARN that I mentioned in my previous email was about not adding > the max_t in all 3 places. I rerun some test today and I triggered the > WARN after a detach without the max_t line. > > I can probably isolate the code above in a dedicated patch for the > regression raised by Rick and we could consider adding a fixes tag; I > will run more tests with only this part during the weekend. > That being said, we need to stay consistent in all 3 places where we > move or propagate some *_avg. In particular, using "sa->util_sum = > sa->util_avg * divider" has the drawback of filtering the contribution > not already accounted for in util_avg and the impact is much larger > than expected. It means that although fixing only > update_cfs_rq_load_avg() seems enough for rick's case, some other > cases could be impacted by the 2 other places and we need to fixed > them now that we have a better view of the root cause ^ permalink raw reply [flat|nested] 16+ messages in thread

*Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg2022-01-11 7:54 ` Vincent Guittot@ 2022-01-11 12:37 ` Dietmar Eggemann0 siblings, 0 replies; 16+ messages in thread From: Dietmar Eggemann @ 2022-01-11 12:37 UTC (permalink / raw) To: Vincent Guittot Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot, linux-kernel, rickyiu, odin, sachinp, naresh.kamboju On 11/01/2022 08:54, Vincent Guittot wrote: > On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <vincent.guittot@linaro.org> wrote: >> >> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>> >>> On 05/01/2022 14:57, Vincent Guittot wrote: >>>> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>>> >>>>> On 04/01/2022 14:42, Vincent Guittot wrote: >>>>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>>>>>> >>>>>>> On 22/12/2021 10:38, Vincent Guittot wrote: [...] >>>> Some of the changes done in PELT signal propagation that replace >>>> subtracting util_sum by using util_avg * divider instead, are related >>>> to other problems with sched group hierarchy and >>>> throttling/unthrottling. I'm not 100% confident that using fixes tag >>>> to backport this on stables doesn't need to backport more patches on >>>> other areas in order to not resurrect old problems. So I wonder if >>>> it's worth backporting this on stables >>> >>> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in >>> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That >>> was the reason why I initially suggested to split the patch-set >>> differently. But you said that you saw the issue also when (1) is fixed. >> >> Ok, I think that we were not speaking about the same setup. I wrongly >> read that you were saying that >> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); >> was only needed in update_cfs_rq_load_avg() but not in the other places. >> >> But what you said is that we only need the below to fix the perf >> regression raised by rick ? >> r = removed_util; >> sub_positive(&sa->util_avg, r); >> - sa->util_sum = sa->util_avg * divider; >> + sub_positive(&sa->util_sum, r * divider); >> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER); > > The test with the code above doesn't trigger any SCHEd_WARN over the > weekend so it's probably ok to make a dedicated patch for this with > tag. > I'm going to prepare a v2 Yes, `sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER)` is needed for all 3 X = [load, runnable, util] in update_cfs_rq_load_avg() to not hit the SCHED_WARN_ON() in cfs_rq_is_decayed(). >> The WARN that I mentioned in my previous email was about not adding >> the max_t in all 3 places. I rerun some test today and I triggered the >> WARN after a detach without the max_t line. >> >> I can probably isolate the code above in a dedicated patch for the >> regression raised by Rick and we could consider adding a fixes tag; I >> will run more tests with only this part during the weekend. >> That being said, we need to stay consistent in all 3 places where we >> move or propagate some *_avg. In particular, using "sa->util_sum = >> sa->util_avg * divider" has the drawback of filtering the contribution >> not already accounted for in util_avg and the impact is much larger >> than expected. It means that although fixing only >> update_cfs_rq_load_avg() seems enough for rick's case, some other >> cases could be impacted by the 2 other places and we need to fixed >> them now that we have a better view of the root cause Agreed. ^ permalink raw reply [flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-01-11 12:37 UTC | newest]Thread overview:16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-22 9:37 [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Vincent Guittot 2021-12-22 9:38 ` [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg Vincent Guittot 2022-01-04 11:47 ` Dietmar Eggemann 2022-01-04 13:42 ` Vincent Guittot 2022-01-05 13:15 ` Dietmar Eggemann 2022-01-05 13:57 ` Vincent Guittot 2022-01-07 11:43 ` Dietmar Eggemann 2022-01-07 15:21 ` Vincent Guittot 2022-01-11 7:54 ` Vincent Guittot 2022-01-11 12:37 ` Dietmar Eggemann 2022-01-04 13:48 ` Vincent Guittot 2021-12-22 9:38 ` [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg Vincent Guittot 2022-01-04 11:47 ` Dietmar Eggemann 2021-12-22 9:38 ` [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg Vincent Guittot 2022-01-04 11:47 ` Dietmar Eggemann 2022-01-04 11:46 ` [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg Dietmar Eggemann

This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.