* [PATCH 0/3] sched/fair: Fix load decay issues related to throttling @ 2021-05-18 12:51 Odin Ugedal 2021-05-18 12:52 ` [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking Odin Ugedal ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Odin Ugedal @ 2021-05-18 12:51 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira Cc: cgroups, linux-kernel, Odin Ugedal Here is a follow up with some fairness fixes related to throttling, PELT and load decay in general. It is related to the discussion in: https://lore.kernel.org/lkml/20210425080902.11854-1-odin@uged.al and https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al Tested on v5.13-rc2 (since that contain the fix from above^). The patch descriptions should make sense in its own, and I have attached some simple reproduction scripts at the end of this mail. I also appended a patch fixing some ascii art that I have been looking at several times without understanding, when it turns out it breaks if tabs is not 8 spaces. I can submit that as a separate patch if necessary. Also, I have no idea what to call the "insert_on_unthrottle" var, so feel free to come with suggestions. There are probably "better" and more reliable ways to reproduce this, but these works for me "most of the time", and gives an ok context imo. Throttling is not deterministic, so keep that in mind. I have been testing with CONFIG_HZ=250, so if you use =1000 (or anything else), you might get other results/harder to reproduce. Reprod script for "Add tg_load_contrib cfs_rq decay checking": --- bash start CGROUP=/sys/fs/cgroup/slice function run_sandbox { local CG="$1" local LCPU="$2" local SHARES="$3" local CMD="$4" local PIPE="$(mktemp -u)" mkfifo "$PIPE" sh -c "read < $PIPE ; exec $CMD" & local TASK="$!" mkdir -p "$CG/sub" tee "$CG"/cgroup.subtree_control <<< "+cpuset +cpu" tee "$CG"/sub/cgroup.procs <<< "$TASK" tee "$CG"/sub/cpuset.cpus <<< "$LCPU" tee "$CG"/sub/cpu.weight <<< "$SHARES" tee "$CG"/cpu.max <<< "10000 100000" sleep .1 tee "$PIPE" <<< sandox_done rm "$PIPE" } mkdir -p "$CGROUP" tee "$CGROUP"/cgroup.subtree_control <<< "+cpuset +cpu" run_sandbox "$CGROUP/cg-1" "0" 100 "stress --cpu 1" run_sandbox "$CGROUP/cg-2" "3" 100 "stress --cpu 1" sleep 1.02 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "1" sleep 1.05 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "2" sleep 1.07 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "3" sleep 2 tee "$CGROUP"/cg-1/cpu.max <<< "max" tee "$CGROUP"/cg-2/cpu.max <<< "max" read killall stress sleep .2 rmdir /sys/fs/cgroup/slice/{cg-{1,2}{/sub,},} # Often gives: # cat /sys/kernel/debug/sched/debug | grep ":/slice" -A 28 | egrep "(:/slice)|tg_load_avg" odin@4670k # # cfs_rq[3]:/slice/cg-2/sub # .tg_load_avg_contrib : 1024 # .tg_load_avg : 1024 # cfs_rq[3]:/slice/cg-1/sub # .tg_load_avg_contrib : 1023 # .tg_load_avg : 1023 # cfs_rq[3]:/slice/cg-1 # .tg_load_avg_contrib : 1040 # .tg_load_avg : 2062 # cfs_rq[3]:/slice/cg-2 # .tg_load_avg_contrib : 1013 # .tg_load_avg : 1013 # cfs_rq[3]:/slice # .tg_load_avg_contrib : 1540 # .tg_load_avg : 1540 --- bash end Reprod for "sched/fair: Correctly insert cfs_rqs to list on unthrottle": --- bash start CGROUP=/sys/fs/cgroup/slice TMP_CG=/sys/fs/cgroup/tmp OLD_CG=/sys/fs/cgroup"$(cat /proc/self/cgroup | cut -c4-)" function run_sandbox { local CG="$1" local LCPU="$2" local SHARES="$3" local CMD="$4" local PIPE="$(mktemp -u)" mkfifo "$PIPE" sh -c "read < $PIPE ; exec $CMD" & local TASK="$!" mkdir -p "$CG/sub" tee "$CG"/cgroup.subtree_control <<< "+cpuset +cpu" tee "$CG"/sub/cpuset.cpus <<< "$LCPU" tee "$CG"/sub/cgroup.procs <<< "$TASK" tee "$CG"/sub/cpu.weight <<< "$SHARES" sleep .01 tee "$PIPE" <<< sandox_done rm "$PIPE" } mkdir -p "$CGROUP" mkdir -p "$TMP_CG" tee "$CGROUP"/cgroup.subtree_control <<< "+cpuset +cpu" echo $$ | tee "$TMP_CG"/cgroup.procs tee "$TMP_CG"/cpuset.cpus <<< "0" sleep .1 tee "$CGROUP"/cpu.max <<< "1000 4000" run_sandbox "$CGROUP/cg-0" "0" 10000 "stress --cpu 1" run_sandbox "$CGROUP/cg-3" "3" 1 "stress --cpu 1" sleep 2 tee "$CGROUP"/cg-0/sub/cpuset.cpus <<< "3" tee "$CGROUP"/cpu.max <<< "max" read killall stress sleep .2 echo $$ | tee "$OLD_CG"/cgroup.procs rmdir "$TMP_CG" /sys/fs/cgroup/slice/{cg-{0,3}{/sub,},} # Often gives: # cat /sys/kernel/debug/sched/debug | grep ":/slice" -A 28 | egrep "(:/slice)|tg_load_avg" odin@4670k # # cfs_rq[3]:/slice/cg-3/sub # .tg_load_avg_contrib : 1039 # .tg_load_avg : 2036 # cfs_rq[3]:/slice/cg-0/sub # .tg_load_avg_contrib : 1023 # .tg_load_avg : 1023 # cfs_rq[3]:/slice/cg-0 # .tg_load_avg_contrib : 102225 # .tg_load_avg : 102225 # cfs_rq[3]:/slice/cg-3 # .tg_load_avg_contrib : 4 # .tg_load_avg : 1001 # cfs_rq[3]:/slice # .tg_load_avg_contrib : 1038 # .tg_load_avg : 1038 --- bash end Thanks Odin Odin Ugedal (3): sched/fair: Add tg_load_contrib cfs_rq decay checking sched/fair: Correctly insert cfs_rq's to list on unthrottle sched/fair: Fix ascii art by relpacing tabs kernel/sched/fair.c | 22 +++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 14 insertions(+), 9 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-18 12:51 [PATCH 0/3] sched/fair: Fix load decay issues related to throttling Odin Ugedal @ 2021-05-18 12:52 ` Odin Ugedal 2021-05-25 9:58 ` Vincent Guittot 2021-05-18 12:52 ` [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle Odin Ugedal 2021-05-18 12:52 ` [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs Odin Ugedal 2 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-18 12:52 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira Cc: cgroups, linux-kernel, Odin Ugedal Make sure cfs_rq does not contribute to task group load avg when checking if it is decayed. Due to how the pelt tracking works, the divider can result in a situation where: cfs_rq->avg.load_sum = 0 cfs_rq->avg.load_avg = 4 cfs_rq->avg.tg_load_avg_contrib = 4 If pelt tracking in this case does not cross a period, there is no "change" in load_sum, and therefore load_avg is not recalculated, and keeps its value. If this cfs_rq is then removed from the leaf list, it results in a situation where the load is never removed from the tg. If that happen, the fiarness is permanently skewed. Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3248e24a90b0..ceda53c2a87a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8004,6 +8004,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) if (cfs_rq->avg.runnable_sum) return false; + if (cfs_rq->tg_load_avg_contrib) + return false; + return true; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-18 12:52 ` [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking Odin Ugedal @ 2021-05-25 9:58 ` Vincent Guittot 2021-05-25 10:33 ` Odin Ugedal 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-25 9:58 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Tue, 18 May 2021 at 14:54, Odin Ugedal <odin@uged.al> wrote: > > Make sure cfs_rq does not contribute to task group load avg when > checking if it is decayed. Due to how the pelt tracking works, > the divider can result in a situation where: > > cfs_rq->avg.load_sum = 0 > cfs_rq->avg.load_avg = 4 Could you give more details about how cfs_rq->avg.load_avg = 4 but cfs_rq->avg.load_sum = 0 ? cfs_rq->avg.load_sum is decayed and can become null when crossing period which implies an update of cfs_rq->avg.load_avg. This means that your case is generated by something outside the pelt formula ... like maybe the propagation of load in the tree. If this is the case, we should find the error and fix it > cfs_rq->avg.tg_load_avg_contrib = 4 > > If pelt tracking in this case does not cross a period, there is no > "change" in load_sum, and therefore load_avg is not recalculated, and > keeps its value. > > If this cfs_rq is then removed from the leaf list, it results in a > situation where the load is never removed from the tg. If that happen, > the fiarness is permanently skewed. > > Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") > Signed-off-by: Odin Ugedal <odin@uged.al> > --- > kernel/sched/fair.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 3248e24a90b0..ceda53c2a87a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8004,6 +8004,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > if (cfs_rq->avg.runnable_sum) > return false; > > + if (cfs_rq->tg_load_avg_contrib) > + return false; > + > return true; > } > > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-25 9:58 ` Vincent Guittot @ 2021-05-25 10:33 ` Odin Ugedal 2021-05-25 14:30 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-25 10:33 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi, tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <vincent.guittot@linaro.org>: > Could you give more details about how cfs_rq->avg.load_avg = 4 but > cfs_rq->avg.load_sum = 0 ? > > cfs_rq->avg.load_sum is decayed and can become null when crossing > period which implies an update of cfs_rq->avg.load_avg. This means > that your case is generated by something outside the pelt formula ... > like maybe the propagation of load in the tree. If this is the case, > we should find the error and fix it Ahh, yeah, that could probably be described better. It is (as far as I have found out) because the pelt divider is changed, and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting in a different value being removed than added. Inside pelt itself, this cannot happen. When pelt changes the load_sum, it recalculates the load_avg based on load_sum, and not the delta, afaik. And as you say, the "issue" therefore (as I see it) outside of PELT. Due to how the pelt divider is changed, I assume it is hard to pinpoint where the issue is. I can try to find a clear path where where we can see what is added and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg, to better be able to pinpoint what is happening. Previously I thought this was a result of precision loss due to division and multiplication during load add/remove inside fair.c, but I am not sure that is the issue, or is it? If my above line of thought makes sense, do you still view this as an error outside PELT, or do you see another possible/better solution? Will investigate further. Thanks Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-25 10:33 ` Odin Ugedal @ 2021-05-25 14:30 ` Vincent Guittot 2021-05-26 10:50 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-25 14:30 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Tue, 25 May 2021 at 12:34, Odin Ugedal <odin@uged.al> wrote: > > Hi, > > tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <vincent.guittot@linaro.org>: > > Could you give more details about how cfs_rq->avg.load_avg = 4 but > > cfs_rq->avg.load_sum = 0 ? > > > > cfs_rq->avg.load_sum is decayed and can become null when crossing > > period which implies an update of cfs_rq->avg.load_avg. This means > > that your case is generated by something outside the pelt formula ... > > like maybe the propagation of load in the tree. If this is the case, > > we should find the error and fix it > > Ahh, yeah, that could probably be described better. > > It is (as far as I have found out) because the pelt divider is changed, > and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting > in a different value being removed than added. ok so IIUC, it happens during the adding/removing/propagating entities' load in the cfs_rq. > > Inside pelt itself, this cannot happen. When pelt changes the load_sum, it > recalculates the load_avg based on load_sum, and not the delta, afaik. > > And as you say, the "issue" therefore (as I see it) outside of PELT. Due to > how the pelt divider is changed, I assume it is hard to pinpoint where the issue > is. I can try to find a clear path where where we can see what is added > and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg, > to better be able to pinpoint what is happening. > > Previously I thought this was a result of precision loss due to division and > multiplication during load add/remove inside fair.c, but I am not sure that > is the issue, or is it? I don't think the precision looss is the problem because adding/removing load in fair.c could truncate load_sum but it stays sync with load_avg. I will have a llo to see if i can see something weird > > If my above line of thought makes sense, do you still view this as an error > outside PELT, or do you see another possible/better solution? > > Will investigate further. Thanks > > Thanks > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-25 14:30 ` Vincent Guittot @ 2021-05-26 10:50 ` Vincent Guittot 2021-05-27 7:50 ` Odin Ugedal 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-26 10:50 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Tue, 25 May 2021 at 16:30, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 25 May 2021 at 12:34, Odin Ugedal <odin@uged.al> wrote: > > > > Hi, > > > > tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <vincent.guittot@linaro.org>: > > > Could you give more details about how cfs_rq->avg.load_avg = 4 but > > > cfs_rq->avg.load_sum = 0 ? > > > > > > > cfs_rq->avg.load_sum is decayed and can become null when crossing > > > period which implies an update of cfs_rq->avg.load_avg. This means > > > that your case is generated by something outside the pelt formula ... > > > like maybe the propagation of load in the tree. If this is the case, > > > we should find the error and fix it > > > > Ahh, yeah, that could probably be described better. > > > > It is (as far as I have found out) because the pelt divider is changed, > > and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting > > in a different value being removed than added. > > ok so IIUC, it happens during the adding/removing/propagating > entities' load in the cfs_rq. > > > > > Inside pelt itself, this cannot happen. When pelt changes the load_sum, it > > recalculates the load_avg based on load_sum, and not the delta, afaik. > > > > And as you say, the "issue" therefore (as I see it) outside of PELT. Due to > > how the pelt divider is changed, I assume it is hard to pinpoint where the issue > > is. I can try to find a clear path where where we can see what is added > > and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg, > > to better be able to pinpoint what is happening. > > > > Previously I thought this was a result of precision loss due to division and > > multiplication during load add/remove inside fair.c, but I am not sure that > > is the issue, or is it? > > I don't think the precision looss is the problem because > adding/removing load in fair.c could truncate load_sum but it stays > sync with load_avg. I will have a llo to see if i can see something > weird I have added a trace in cfs_rq_is_decayed() but I'm not able to reproduce a situation where load_sum == 0 but not load_avg even with the script in the cover letter > > > > > If my above line of thought makes sense, do you still view this as an error > > outside PELT, or do you see another possible/better solution? > > > > Will investigate further. > > Thanks > > > > > Thanks > > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-26 10:50 ` Vincent Guittot @ 2021-05-27 7:50 ` Odin Ugedal 2021-05-27 9:35 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-27 7:50 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi, Did some quick testing with bpftrace, and it looks like it is "update_tg_cfs_load" that does this; kretfunc:attach_entity_load_avg: cfs_rq: 0xffff8a3e6773e400, se.load_sum: 0, se.load_avg: 0, se->load.weight: 1048576 diff.sum: 0, diff.load: 0, new_sum: 0, new_load: 0, period_contrib: 821 // quite some time passes kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum 0, old load_avg: 0, new load_sum: 47876096, new load_avg: 1022 tg_load_avg_contrib: 0, period_contrib: 82, on_list: 0, throttled: 0/0 stack: bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 bpf_get_stackid_raw_tp+115 bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 bpf_trampoline_6442466486_1+144 update_tg_cfs_load+5 update_load_avg+450 enqueue_entity+91 enqueue_task_fair+134 move_queued_task+172 affine_move_task+1282 __set_cpus_allowed_ptr+317 update_tasks_cpumask+70 update_cpumasks_hier+448 update_cpumask+345 cpuset_write_resmask+796 cgroup_file_write+162 kernfs_fop_write_iter+284 new_sync_write+345 vfs_write+511 ksys_write+103 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 // There might be more stuff happening here // __update_load_avg_cfs_rq runs way too often to be able to // trace it without filtering, and didn't look into that. kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum 48117265, old load_avg: 1025, new load_sum: 0, new load_avg: 2 tg_load_avg_contrib: 1022, period_contrib: 223, on_list: 1, throttled: 1/1 stack: bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 bpf_get_stackid_raw_tp+115 bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 bpf_trampoline_6442466486_1+144 update_tg_cfs_load+5 update_load_avg+450 update_blocked_averages+723 newidle_balance+533 pick_next_task_fair+57 __schedule+376 schedule+91 schedule_hrtimeout_range_clock+164 do_sys_poll+1043 __x64_sys_poll+179 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 This particular example resulted in a 75/25 share of cpu time for the two groups. From this, it also looks like it doesn't matter if new load_avg is 0 or not (when new load_sum is 0), since tg_load_avg_contrib will not be properly removed either way (without this patch). There might be some other precision > I have added a trace in cfs_rq_is_decayed() but I'm not able to > reproduce a situation where load_sum == 0 but not load_avg even with > the script in the cover letter Hmm, strange. I am able to reproduce on both v5.12 and v5.13. I am running on a non SMT 4 core CPU (4670k), with CONFIG_HZ=250. Since this is timing sensitive, you might need to tweak the timings for either the CFS bw period/quota, or the time waiting between changing what cpu it runs on. If you have more than 4 cpus, you can also try to start on CPU 0 and move it through all cores and then onto CPU n. Maybe that makes it possible to reproduce. Since bpftrace doesn't work properly in v5.13, this particular test was on v1.12 with cherry-pick of 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 and the other patch in this patchset. testscript.bt: kfunc:attach_entity_load_avg {@a_sum[tid] = args->cfs_rq->avg.load_sum; @a_load[tid] = args->cfs_rq->avg.load_avg;} kretfunc:attach_entity_load_avg{printf("%s: cfs_rq: %p, se.load_sum: %llu, se.load_avg: %llu, se->load.weight: %llu diff.sum: %llu, diff.load: %llu, new_sum: %llu, new_load: %llu, period_contrib: %llu\n", probe, args->cfs_rq,args->se->avg.load_sum, args->se->avg.load_avg, args->se->load.weight,args->cfs_rq->avg.load_sum - @a_sum[tid], args->cfs_rq->avg.load_avg - @a_load[tid], args->cfs_rq->avg.load_sum, args->cfs_rq->avg.load_avg, args->cfs_rq->avg.period_contrib)} kfunc:update_tg_cfs_load {@beg_load_avg[tid] = args->cfs_rq->avg.load_avg; @beg_load_sum[tid] = args->cfs_rq->avg.load_sum} kretfunc:update_tg_cfs_load {printf("%s: cfs_rq: %p, old load_sum %llu, old load_avg: %llu, new load_sum: %llu, new load_avg: %llu tg_load_avg_contrib: %llu, period_contrib: %llu, on_list: %d, throttled: %d/%d stack: %s\n",probe, args->cfs_rq, @beg_load_sum[tid], @beg_load_avg[tid], args->cfs_rq->avg.load_sum, args->cfs_rq->avg.load_avg, args->cfs_rq->tg_load_avg_contrib, args->cfs_rq->avg.period_contrib,args->cfs_rq->on_list, args->cfs_rq->throttled, args->cfs_rq->throttle_count,kstack)} Thanks Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 7:50 ` Odin Ugedal @ 2021-05-27 9:35 ` Vincent Guittot 2021-05-27 9:45 ` Odin Ugedal 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-27 9:35 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Thu, 27 May 2021 at 09:51, Odin Ugedal <odin@uged.al> wrote: > > Hi, > > Did some quick testing with bpftrace, and it looks like it is > "update_tg_cfs_load" that does this; > > kretfunc:attach_entity_load_avg: cfs_rq: 0xffff8a3e6773e400, > se.load_sum: 0, se.load_avg: 0, se->load.weight: 1048576 diff.sum: 0, > diff.load: 0, new_sum: 0, new_load: 0, period_contrib: 821 > > // quite some time passes > > kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum > 0, old load_avg: 0, new load_sum: 47876096, new load_avg: 1022 > tg_load_avg_contrib: 0, period_contrib: 82, on_list: 0, throttled: 0/0 > stack: > bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 > bpf_get_stackid_raw_tp+115 > bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 > bpf_trampoline_6442466486_1+144 > update_tg_cfs_load+5 > update_load_avg+450 > enqueue_entity+91 > enqueue_task_fair+134 > move_queued_task+172 > affine_move_task+1282 > __set_cpus_allowed_ptr+317 > update_tasks_cpumask+70 > update_cpumasks_hier+448 > update_cpumask+345 > cpuset_write_resmask+796 > cgroup_file_write+162 > kernfs_fop_write_iter+284 > new_sync_write+345 > vfs_write+511 > ksys_write+103 > do_syscall_64+51 > entry_SYSCALL_64_after_hwframe+68 > > // There might be more stuff happening here > // __update_load_avg_cfs_rq runs way too often to be able to > // trace it without filtering, and didn't look into that. > > kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum > 48117265, old load_avg: 1025, new load_sum: 0, new load_avg: 2 > tg_load_avg_contrib: 1022, period_contrib: 223, on_list: 1, throttled: > 1/1 stack: > bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 > bpf_get_stackid_raw_tp+115 > bpf_prog_a37cf01922e02958_update_tg_cfs_l+504 > bpf_trampoline_6442466486_1+144 > update_tg_cfs_load+5 > update_load_avg+450 > update_blocked_averages+723 > newidle_balance+533 > pick_next_task_fair+57 > __schedule+376 > schedule+91 > schedule_hrtimeout_range_clock+164 > do_sys_poll+1043 > __x64_sys_poll+179 > do_syscall_64+51 > entry_SYSCALL_64_after_hwframe+68 > > This particular example resulted in a 75/25 share of cpu time for the > two groups. > > From this, it also looks like it doesn't matter if new load_avg is 0 or not > (when new load_sum is 0), since tg_load_avg_contrib will not be properly > removed either way (without this patch). There might be some other precision > > > > I have added a trace in cfs_rq_is_decayed() but I'm not able to > > reproduce a situation where load_sum == 0 but not load_avg even with > > the script in the cover letter > > Hmm, strange. I am able to reproduce on both v5.12 and v5.13. I am running on > a non SMT 4 core CPU (4670k), with CONFIG_HZ=250. Since this is timing > sensitive, > you might need to tweak the timings for either the CFS bw period/quota, or the > time waiting between changing what cpu it runs on. If you have more than 4 cpus, > you can also try to start on CPU 0 and move it through all cores and > then onto CPU n. > Maybe that makes it possible to reproduce. I finally got it this morning with your script and I confirm that the problem of load_sum == 0 but load_avg != 0 comes from update_tg_cfs_load(). Then, it seems that we don't call update_tg_load_avg for this cfs_rq in __update_blocked_fair() because of a recent update while propagating child's load changes. At the end we remove the cfs_rq from the list without updating its contribution. I'm going to prepare a patch to fix this > > Since bpftrace doesn't work properly in v5.13, this particular test was > on v1.12 with cherry-pick of 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 and > the other patch in this patchset. > > testscript.bt: > > kfunc:attach_entity_load_avg {@a_sum[tid] = > args->cfs_rq->avg.load_sum; @a_load[tid] = > args->cfs_rq->avg.load_avg;} > kretfunc:attach_entity_load_avg{printf("%s: cfs_rq: %p, se.load_sum: > %llu, se.load_avg: %llu, se->load.weight: %llu diff.sum: %llu, > diff.load: %llu, new_sum: %llu, new_load: %llu, period_contrib: > %llu\n", probe, args->cfs_rq,args->se->avg.load_sum, > args->se->avg.load_avg, > args->se->load.weight,args->cfs_rq->avg.load_sum - @a_sum[tid], > args->cfs_rq->avg.load_avg - @a_load[tid], args->cfs_rq->avg.load_sum, > args->cfs_rq->avg.load_avg, args->cfs_rq->avg.period_contrib)} > > kfunc:update_tg_cfs_load {@beg_load_avg[tid] = > args->cfs_rq->avg.load_avg; @beg_load_sum[tid] = > args->cfs_rq->avg.load_sum} > kretfunc:update_tg_cfs_load {printf("%s: cfs_rq: %p, old load_sum > %llu, old load_avg: %llu, new load_sum: %llu, new load_avg: %llu > tg_load_avg_contrib: %llu, period_contrib: %llu, on_list: %d, > throttled: %d/%d stack: %s\n",probe, args->cfs_rq, @beg_load_sum[tid], > @beg_load_avg[tid], args->cfs_rq->avg.load_sum, > args->cfs_rq->avg.load_avg, args->cfs_rq->tg_load_avg_contrib, > args->cfs_rq->avg.period_contrib,args->cfs_rq->on_list, > args->cfs_rq->throttled, args->cfs_rq->throttle_count,kstack)} > > Thanks > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 9:35 ` Vincent Guittot @ 2021-05-27 9:45 ` Odin Ugedal 2021-05-27 10:49 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-27 9:45 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi, > I finally got it this morning with your script and I confirm that the > problem of load_sum == 0 but load_avg != 0 comes from > update_tg_cfs_load(). Then, it seems that we don't call > update_tg_load_avg for this cfs_rq in __update_blocked_fair() because > of a recent update while propagating child's load changes. At the end > we remove the cfs_rq from the list without updating its contribution. > > I'm going to prepare a patch to fix this Yeah, that is another way to look at it. Have not verified, but wouldn't update_tg_load_avg() in this case just remove the diff (load_avg - tg_load_avg_contrib)? Wouldn't we still see some tg_load_avg_contrib after the cfs_rq is removed from the list then? Eg. in my example above, the cfs_rq will be removed from the list while tg_load_avg_contrib=2, or am I missing something? That was my thought when I looked at it last week at least.. Thanks Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 9:45 ` Odin Ugedal @ 2021-05-27 10:49 ` Vincent Guittot 2021-05-27 11:04 ` Odin Ugedal 2021-05-27 12:37 ` Odin Ugedal 0 siblings, 2 replies; 24+ messages in thread From: Vincent Guittot @ 2021-05-27 10:49 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Thu, 27 May 2021 at 11:46, Odin Ugedal <odin@uged.al> wrote: > > Hi, > > > I finally got it this morning with your script and I confirm that the > > problem of load_sum == 0 but load_avg != 0 comes from > > update_tg_cfs_load(). Then, it seems that we don't call > > update_tg_load_avg for this cfs_rq in __update_blocked_fair() because > > of a recent update while propagating child's load changes. At the end > > we remove the cfs_rq from the list without updating its contribution. > > > > I'm going to prepare a patch to fix this > > Yeah, that is another way to look at it. Have not verified, but > wouldn't update_tg_load_avg() in this case > just remove the diff (load_avg - tg_load_avg_contrib)? Wouldn't we > still see some tg_load_avg_contrib > after the cfs_rq is removed from the list then? Eg. in my example > above, the cfs_rq will be removed from > the list while tg_load_avg_contrib=2, or am I missing something? That > was my thought when I looked > at it last week at least.. 1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too 2nd : call update_tg_load_avg() during child update so we will be sure to update tg_load_avg_contrib before removing the cfs from the list > > Thanks > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 10:49 ` Vincent Guittot @ 2021-05-27 11:04 ` Odin Ugedal 2021-05-27 12:37 ` Vincent Guittot 2021-05-27 12:37 ` Odin Ugedal 1 sibling, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-27 11:04 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel > 1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too > 2nd : call update_tg_load_avg() during child update so we will be sure > to update tg_load_avg_contrib before removing the cfs from the list Ahh, yeah, with "1st" that would work. Yeah, that was my initial implementation of the change, but I thought that it was better to keep the logic away from the "hot path". We can verify this in update_tg_cfs_load(), and then force update_tg_load_avg() inside __update_blocked_fair() when avg.load_avg is 0. (Given that this is the only place where we can end up in this situation. I can update this patch to do that instead. Another solution is to update avg.load_avg inside__update_blocked_fair() when load_sum is 0, and then propagate that with update_tg_load_avg(). This removes the logic from the hot path all together. Not sure what the preferred way is. I have not found any other places where this situation _should_ occur, but who knows.. Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 11:04 ` Odin Ugedal @ 2021-05-27 12:37 ` Vincent Guittot 0 siblings, 0 replies; 24+ messages in thread From: Vincent Guittot @ 2021-05-27 12:37 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Thu, 27 May 2021 at 13:04, Odin Ugedal <odin@uged.al> wrote: > > > 1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too > > 2nd : call update_tg_load_avg() during child update so we will be sure > > to update tg_load_avg_contrib before removing the cfs from the list > > Ahh, yeah, with "1st" that would work. Yeah, that was my initial > implementation of the change, but I thought that it was better to keep > the logic away from the "hot path". We can verify this in > update_tg_cfs_load(), and then force update_tg_load_avg() inside For 1st problem, the way we were updating load_avg and load_sum, we were losing the sync between both value > __update_blocked_fair() when avg.load_avg is 0. (Given that this is > the only place where we can end up in this situation. I can update > this patch to do that instead. In fact, the update was already there but not always called (see the patchset i just sent) > > Another solution is to update avg.load_avg > inside__update_blocked_fair() when load_sum is 0, and then propagate > that with update_tg_load_avg(). This removes the logic from the hot > path all together. > > Not sure what the preferred way is. I have not found any other places > where this situation _should_ occur, but who knows.. > > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 10:49 ` Vincent Guittot 2021-05-27 11:04 ` Odin Ugedal @ 2021-05-27 12:37 ` Odin Ugedal 2021-05-27 12:39 ` Odin Ugedal 2021-05-27 12:49 ` Vincent Guittot 1 sibling, 2 replies; 24+ messages in thread From: Odin Ugedal @ 2021-05-27 12:37 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi again, Saw your patchset now, and that is essentially the same as I did. I guess you want to keep that patchset instead of me updating this series then? Also, could you take a look at patch 2 and 3 in this series? Should I resend those in a new series, or? Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 12:37 ` Odin Ugedal @ 2021-05-27 12:39 ` Odin Ugedal 2021-05-27 12:49 ` Vincent Guittot 1 sibling, 0 replies; 24+ messages in thread From: Odin Ugedal @ 2021-05-27 12:39 UTC (permalink / raw) To: Odin Ugedal Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel > For 1st problem, the way we were updating load_avg and load_sum, we > were losing the sync between both value Yeah, that would be a natural way to fix that. > In fact, the update was already there but not always called (see the > patchset i just sent) Yeah, that was my exact point. Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking 2021-05-27 12:37 ` Odin Ugedal 2021-05-27 12:39 ` Odin Ugedal @ 2021-05-27 12:49 ` Vincent Guittot 1 sibling, 0 replies; 24+ messages in thread From: Vincent Guittot @ 2021-05-27 12:49 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Thu, 27 May 2021 at 14:38, Odin Ugedal <odin@uged.al> wrote: > > Hi again, > > Saw your patchset now, and that is essentially the same as I did. I > guess you want to keep that patchset instead of me updating this > series then? yes > > Also, could you take a look at patch 2 and 3 in this series? Should I Yes, I'm going to have a look at patch 2 and 3 > resend those in a new series, or? > > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-18 12:51 [PATCH 0/3] sched/fair: Fix load decay issues related to throttling Odin Ugedal 2021-05-18 12:52 ` [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking Odin Ugedal @ 2021-05-18 12:52 ` Odin Ugedal 2021-05-28 14:24 ` Vincent Guittot 2021-05-18 12:52 ` [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs Odin Ugedal 2 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-18 12:52 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira Cc: cgroups, linux-kernel, Odin Ugedal This fixes an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happen because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. This fix change this behavior to save what cfs_rq's was removed from the list, and readds them properly on unthrottle. Can often lead to sutiations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") Signed-off-by: Odin Ugedal <odin@uged.al> --- kernel/sched/fair.c | 11 ++++++----- kernel/sched/sched.h | 1 + 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ceda53c2a87a..e7423d658389 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) return false; } -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) +/* Returns 1 if cfs_rq was present in the list and removed */ +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { if (cfs_rq->on_list) { struct rq *rq = rq_of(cfs_rq); @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) list_del_rcu(&cfs_rq->leaf_cfs_rq_list); cfs_rq->on_list = 0; + return 1; } + return 0; } static inline void assert_list_leaf_cfs_rq(struct rq *rq) @@ -4742,9 +4745,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - cfs_rq->throttled_clock_task; - - /* Add cfs_rq with already running entity in the list */ - if (cfs_rq->nr_running >= 1) + if (cfs_rq->insert_on_unthrottle) list_add_leaf_cfs_rq(cfs_rq); } @@ -4759,7 +4760,7 @@ static int tg_throttle_down(struct task_group *tg, void *data) /* group is entering throttled state, stop time */ if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_task = rq_clock_task(rq); - list_del_leaf_cfs_rq(cfs_rq); + cfs_rq->insert_on_unthrottle = list_del_leaf_cfs_rq(cfs_rq); } cfs_rq->throttle_count++; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a189bec13729..12a707d99ee6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -602,6 +602,7 @@ struct cfs_rq { u64 throttled_clock_task_time; int throttled; int throttle_count; + int insert_on_unthrottle; struct list_head throttled_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-18 12:52 ` [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle Odin Ugedal @ 2021-05-28 14:24 ` Vincent Guittot 2021-05-28 15:06 ` Odin Ugedal 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-28 14:24 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Tue, 18 May 2021 at 14:54, Odin Ugedal <odin@uged.al> wrote: > > This fixes an issue where fairness is decreased since cfs_rq's can > end up not being decayed properly. For two sibling control groups with > the same priority, this can often lead to a load ratio of 99/1 (!!). > > This happen because when a cfs_rq is throttled, all the descendant cfs_rq's > will be removed from the leaf list. When they initial cfs_rq is > unthrottled, it will currently only re add descendant cfs_rq's if they > have one or more entities enqueued. This is not a perfect heuristic. What would be the other condition in addition to the current one :cfs_rq->nr_running >= 1 ? We need to add a cfs_rq in the list if it still contributes to the tg->load_avg and the split of the share. Can't we add a condition for this instead of adding a new field ? > > This fix change this behavior to save what cfs_rq's was removed from the > list, and readds them properly on unthrottle. > > Can often lead to sutiations like this for equally weighted control > groups: > > $ ps u -C stress > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 > root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 > > Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") > Signed-off-by: Odin Ugedal <odin@uged.al> > --- > kernel/sched/fair.c | 11 ++++++----- > kernel/sched/sched.h | 1 + > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ceda53c2a87a..e7423d658389 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > return false; > } > > -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > +/* Returns 1 if cfs_rq was present in the list and removed */ > +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > { > if (cfs_rq->on_list) { > struct rq *rq = rq_of(cfs_rq); > @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > > list_del_rcu(&cfs_rq->leaf_cfs_rq_list); > cfs_rq->on_list = 0; > + return 1; > } > + return 0; > } > > static inline void assert_list_leaf_cfs_rq(struct rq *rq) > @@ -4742,9 +4745,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > if (!cfs_rq->throttle_count) { > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - > cfs_rq->throttled_clock_task; > - > - /* Add cfs_rq with already running entity in the list */ > - if (cfs_rq->nr_running >= 1) > + if (cfs_rq->insert_on_unthrottle) > list_add_leaf_cfs_rq(cfs_rq); > } > > @@ -4759,7 +4760,7 @@ static int tg_throttle_down(struct task_group *tg, void *data) > /* group is entering throttled state, stop time */ > if (!cfs_rq->throttle_count) { > cfs_rq->throttled_clock_task = rq_clock_task(rq); > - list_del_leaf_cfs_rq(cfs_rq); > + cfs_rq->insert_on_unthrottle = list_del_leaf_cfs_rq(cfs_rq); > } > cfs_rq->throttle_count++; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index a189bec13729..12a707d99ee6 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -602,6 +602,7 @@ struct cfs_rq { > u64 throttled_clock_task_time; > int throttled; > int throttle_count; > + int insert_on_unthrottle; > struct list_head throttled_list; > #endif /* CONFIG_CFS_BANDWIDTH */ > #endif /* CONFIG_FAIR_GROUP_SCHED */ > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-28 14:24 ` Vincent Guittot @ 2021-05-28 15:06 ` Odin Ugedal 2021-05-28 15:27 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-28 15:06 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi, > What would be the other condition in addition to the current one > :cfs_rq->nr_running >= 1 ? The condition is that if it has load, we should add it (I don't have 100% control on util_avg and runnable_avg tho.). Using "!cfs_rq_is_decayed()" is another way, but imo. that is a bit overkill. > We need to add a cfs_rq in the list if it still contributes to the > tg->load_avg and the split of the share. Can't we add a condition for > this instead of adding a new field ? Yes, using cfs_rq->tg_load_avg_contrib as below would also work the same way. I still think being explicit that we insert it if we have removed it is cleaner in a way, as it makes it consistent with the other use of list_add_leaf_cfs_rq() and list_del_leaf_cfs_rq(), but that is about preference I guess. I do however think that using tg_load_avg_contrib will work just fine, as it should always be positive in case the cfs_rq has some load. I can resent v2 of this patch using this instead; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad7556f99b4a..969ae7f930f5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4720,7 +4720,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) cfs_rq->throttled_clock_task; /* Add cfs_rq with already running entity in the list */ - if (cfs_rq->nr_running >= 1) + if (cfs_rq->tg_load_avg_contrib) list_add_leaf_cfs_rq(cfs_rq); } ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-28 15:06 ` Odin Ugedal @ 2021-05-28 15:27 ` Vincent Guittot 2021-05-29 9:33 ` Odin Ugedal 0 siblings, 1 reply; 24+ messages in thread From: Vincent Guittot @ 2021-05-28 15:27 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Fri, 28 May 2021 at 17:07, Odin Ugedal <odin@uged.al> wrote: > > Hi, > > > What would be the other condition in addition to the current one > > :cfs_rq->nr_running >= 1 ? > > The condition is that if it has load, we should add it (I don't have > 100% control on util_avg and runnable_avg tho.). Using > "!cfs_rq_is_decayed()" is another way, but imo. that is a bit > overkill. normally tg_load_avg_contrib should be null when cfs_rq_is_decayed() > > > We need to add a cfs_rq in the list if it still contributes to the > > tg->load_avg and the split of the share. Can't we add a condition for > > this instead of adding a new field ? > > Yes, using cfs_rq->tg_load_avg_contrib as below would also work the > same way. I still think being explicit that we insert it if we have > removed it is cleaner in a way, as it makes it consistent with the > other use of list_add_leaf_cfs_rq() and list_del_leaf_cfs_rq(), but The reason of this list is to ensure that the load of all cfs_rq are periodically updated as it is then used to share the runtime between groups so we should keep to use the rule whenever possible. > that is about preference I guess. I do however think that using > tg_load_avg_contrib will work just fine, as it should always be > positive in case the cfs_rq has some load. I can resent v2 of this > patch using this instead; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ad7556f99b4a..969ae7f930f5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4720,7 +4720,7 @@ static int tg_unthrottle_up(struct task_group > *tg, void *data) > cfs_rq->throttled_clock_task; > > /* Add cfs_rq with already running entity in the list */ > - if (cfs_rq->nr_running >= 1) > + if (cfs_rq->tg_load_avg_contrib) we probably need to keep (cfs_rq->nr_running >= 1) as we can have case where tg_load_avg_contrib is null but a task is enqueued > list_add_leaf_cfs_rq(cfs_rq); > } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-28 15:27 ` Vincent Guittot @ 2021-05-29 9:33 ` Odin Ugedal 2021-05-31 12:14 ` Vincent Guittot 0 siblings, 1 reply; 24+ messages in thread From: Odin Ugedal @ 2021-05-29 9:33 UTC (permalink / raw) To: Vincent Guittot Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel Hi, > normally tg_load_avg_contrib should be null when cfs_rq_is_decayed() Yeah, I think that is an ok assumption of how it _should_ work (given the other patches in flight are merged). > The reason of this list is to ensure that the load of all cfs_rq are > periodically updated as it is then used to share the runtime between > groups so we should keep to use the rule whenever possible. Yeah, right. > we probably need to keep (cfs_rq->nr_running >= 1) as we can have case > where tg_load_avg_contrib is null but a task is enqueued Yeah, there is probably a chance of enqueuing a task without any load, and then a parent gets throttled. So (cfs_rq->tg_load_avg_contrib || cfs_rq->nr_running) is probably the way to go if we want to avoid a new field. Will resend a patch with that instead. In case the new field is the main issue with the original solution, we could also change the on_list int to have three modes like; NO, YES, THROTTLED/PAUSED, but that would require a bigger rewrite of the other logic, so probably outside the scope of this patch. Thanks Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle 2021-05-29 9:33 ` Odin Ugedal @ 2021-05-31 12:14 ` Vincent Guittot 0 siblings, 0 replies; 24+ messages in thread From: Vincent Guittot @ 2021-05-31 12:14 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Sat, 29 May 2021 at 11:34, Odin Ugedal <odin@uged.al> wrote: > > Hi, > > > normally tg_load_avg_contrib should be null when cfs_rq_is_decayed() > > Yeah, I think that is an ok assumption of how it _should_ work (given > the other patches in flight are merged). > > > The reason of this list is to ensure that the load of all cfs_rq are > > periodically updated as it is then used to share the runtime between > > groups so we should keep to use the rule whenever possible. > > Yeah, right. > > > we probably need to keep (cfs_rq->nr_running >= 1) as we can have case > > where tg_load_avg_contrib is null but a task is enqueued > > Yeah, there is probably a chance of enqueuing a task without any load, > and then a parent gets throttled. > So (cfs_rq->tg_load_avg_contrib || cfs_rq->nr_running) is probably the > way to go if we want to avoid > a new field. Will resend a patch with that instead. Thanks > > In case the new field is the main issue with the original solution, we > could also change the on_list int to have three modes like; NO, YES, > THROTTLED/PAUSED, but that would require a bigger rewrite of the other > logic, so probably outside the scope of this patch. > > Thanks > Odin ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs 2021-05-18 12:51 [PATCH 0/3] sched/fair: Fix load decay issues related to throttling Odin Ugedal 2021-05-18 12:52 ` [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking Odin Ugedal 2021-05-18 12:52 ` [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle Odin Ugedal @ 2021-05-18 12:52 ` Odin Ugedal 2021-05-27 13:27 ` Vincent Guittot 2021-06-01 14:04 ` [tip: sched/core] " tip-bot2 for Odin Ugedal 2 siblings, 2 replies; 24+ messages in thread From: Odin Ugedal @ 2021-05-18 12:52 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira Cc: cgroups, linux-kernel, Odin Ugedal When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> --- kernel/sched/fair.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e7423d658389..c872e38ec32b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3142,7 +3142,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (1) - * \Sum grq->load.weight + * \Sum grq->load.weight * * Now, because computing that sum is prohibitively expensive to compute (been * there, done that) we approximate it with this average stuff. The average @@ -3156,7 +3156,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->avg.load_avg * ge->load.weight = ------------------------------ (3) - * tg->load_avg + * tg->load_avg * * Where: tg->load_avg ~= \Sum grq->avg.load_avg * @@ -3172,7 +3172,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- = tg->weight (4) - * grp->load.weight + * grp->load.weight * * That is, the sum collapses because all other CPUs are idle; the UP scenario. * @@ -3191,7 +3191,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (6) - * tg_load_avg' + * tg_load_avg' * * Where: * -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs 2021-05-18 12:52 ` [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs Odin Ugedal @ 2021-05-27 13:27 ` Vincent Guittot 2021-06-01 14:04 ` [tip: sched/core] " tip-bot2 for Odin Ugedal 1 sibling, 0 replies; 24+ messages in thread From: Vincent Guittot @ 2021-05-27 13:27 UTC (permalink / raw) To: Odin Ugedal Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP), linux-kernel On Tue, 18 May 2021 at 14:55, Odin Ugedal <odin@uged.al> wrote: > > When using something other than 8 spaces per tab, this ascii art > makes not sense, and the reader might end up wondering what this > advanced equation "is". Documentation/process/coding-style.rst says to use 8 characters for tab so we should not really consider other tab value That being said the numerators and other parts of the equation use spaces whereas denominators use tabs. so using space everywhere looks good for me Acked-by: Vincent Guittot <vincent.guittot@linaro.org> > > Signed-off-by: Odin Ugedal <odin@uged.al> > --- > kernel/sched/fair.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e7423d658389..c872e38ec32b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3142,7 +3142,7 @@ void reweight_task(struct task_struct *p, int prio) > * > * tg->weight * grq->load.weight > * ge->load.weight = ----------------------------- (1) > - * \Sum grq->load.weight > + * \Sum grq->load.weight > * > * Now, because computing that sum is prohibitively expensive to compute (been > * there, done that) we approximate it with this average stuff. The average > @@ -3156,7 +3156,7 @@ void reweight_task(struct task_struct *p, int prio) > * > * tg->weight * grq->avg.load_avg > * ge->load.weight = ------------------------------ (3) > - * tg->load_avg > + * tg->load_avg > * > * Where: tg->load_avg ~= \Sum grq->avg.load_avg > * > @@ -3172,7 +3172,7 @@ void reweight_task(struct task_struct *p, int prio) > * > * tg->weight * grq->load.weight > * ge->load.weight = ----------------------------- = tg->weight (4) > - * grp->load.weight > + * grp->load.weight > * > * That is, the sum collapses because all other CPUs are idle; the UP scenario. > * > @@ -3191,7 +3191,7 @@ void reweight_task(struct task_struct *p, int prio) > * > * tg->weight * grq->load.weight > * ge->load.weight = ----------------------------- (6) > - * tg_load_avg' > + * tg_load_avg' > * > * Where: > * > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* [tip: sched/core] sched/fair: Fix ascii art by relpacing tabs 2021-05-18 12:52 ` [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs Odin Ugedal 2021-05-27 13:27 ` Vincent Guittot @ 2021-06-01 14:04 ` tip-bot2 for Odin Ugedal 1 sibling, 0 replies; 24+ messages in thread From: tip-bot2 for Odin Ugedal @ 2021-06-01 14:04 UTC (permalink / raw) To: linux-tip-commits Cc: Odin Ugedal, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 08f7c2f4d0e9f4283f5796b8168044c034a1bfcb Gitweb: https://git.kernel.org/tip/08f7c2f4d0e9f4283f5796b8168044c034a1bfcb Author: Odin Ugedal <odin@uged.al> AuthorDate: Tue, 18 May 2021 14:52:02 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 01 Jun 2021 16:00:11 +02:00 sched/fair: Fix ascii art by relpacing tabs When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al --- kernel/sched/fair.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 161b92a..a2c30e5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3093,7 +3093,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (1) - * \Sum grq->load.weight + * \Sum grq->load.weight * * Now, because computing that sum is prohibitively expensive to compute (been * there, done that) we approximate it with this average stuff. The average @@ -3107,7 +3107,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->avg.load_avg * ge->load.weight = ------------------------------ (3) - * tg->load_avg + * tg->load_avg * * Where: tg->load_avg ~= \Sum grq->avg.load_avg * @@ -3123,7 +3123,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- = tg->weight (4) - * grp->load.weight + * grp->load.weight * * That is, the sum collapses because all other CPUs are idle; the UP scenario. * @@ -3142,7 +3142,7 @@ void reweight_task(struct task_struct *p, int prio) * * tg->weight * grq->load.weight * ge->load.weight = ----------------------------- (6) - * tg_load_avg' + * tg_load_avg' * * Where: * ^ permalink raw reply related [flat|nested] 24+ messages in thread
end of thread, other threads:[~2021-06-01 14:05 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-18 12:51 [PATCH 0/3] sched/fair: Fix load decay issues related to throttling Odin Ugedal 2021-05-18 12:52 ` [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking Odin Ugedal 2021-05-25 9:58 ` Vincent Guittot 2021-05-25 10:33 ` Odin Ugedal 2021-05-25 14:30 ` Vincent Guittot 2021-05-26 10:50 ` Vincent Guittot 2021-05-27 7:50 ` Odin Ugedal 2021-05-27 9:35 ` Vincent Guittot 2021-05-27 9:45 ` Odin Ugedal 2021-05-27 10:49 ` Vincent Guittot 2021-05-27 11:04 ` Odin Ugedal 2021-05-27 12:37 ` Vincent Guittot 2021-05-27 12:37 ` Odin Ugedal 2021-05-27 12:39 ` Odin Ugedal 2021-05-27 12:49 ` Vincent Guittot 2021-05-18 12:52 ` [PATCH 2/3] sched/fair: Correctly insert cfs_rq's to list on unthrottle Odin Ugedal 2021-05-28 14:24 ` Vincent Guittot 2021-05-28 15:06 ` Odin Ugedal 2021-05-28 15:27 ` Vincent Guittot 2021-05-29 9:33 ` Odin Ugedal 2021-05-31 12:14 ` Vincent Guittot 2021-05-18 12:52 ` [PATCH 3/3] sched/fair: Fix ascii art by relpacing tabs Odin Ugedal 2021-05-27 13:27 ` Vincent Guittot 2021-06-01 14:04 ` [tip: sched/core] " tip-bot2 for Odin Ugedal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).