Hello. (Getting back to this after some more analysis.) On Wed, Oct 13, 2021 at 04:26:43PM +0200, Michal Koutný wrote: > On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot wrote: > > This seems to closes the race window in your case but this could still > > happen AFAICT. > > You seem to be right. > Hopefully, I'll be able to collect more data evaluating this. I've observed that the window between unregister_fair_sched_group() and free_fair_sched_group() is commonly around 15 ms (based on kprobe tracing). I have a reproducer (attached) that can hit this window quite easily after tuning. I can observe consequences of it even with a recent 5.15 kernel. (And I also have reports from real world workloads failing due to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle").) My original patch was really an uninformed attempt given the length of the window. [snip] On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal wrote: > Ref. your comment about reverting a7b359fc6a37 > ("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I > think that is fine as long as we revert the commit it fixes as well, > to avoid a regression of that (but yeah, that regression itself is > less bad than your discovery). I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") (it solves reported performance issues, it's way too old :-). > set cfs_rq->on_list=2 inside that lock under your code change? If we > then treat on_list=2 > as "not on list, and do not add"? The possibilities for the current problem: 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups. (Not exclusive with the other suggestions, rather a stop-gap for the time being.) 2) Don't add offlined task_groups into the undecayed list - Your proposal with overloaded on_list=2 could serve as mark of that, but it's a hack IMO. - Proper way (tm) would be to use css_tryget_online() and css_put() when dealing with the list (my favorite at the moment). 3) Narrowing the race-window dramatically - that is by moving list removal from unregister_fair_sched_group() to free_fair_sched_group(), - or use list_empty(tg->list) as indicator whether we're working with onlined task_group. (won't work for RCU list) 4) Rework how throttled load is handled (hand waving) - There is remove_entity_load_avg() that moves the load to parent upon final removal. Maybe it could be generalized for temporary removals by throttling (so that unthrottling could again add only non-empty cfs_rqs to the list and undecayed load won't skew fairness). - or the way of [1]. 5) Opinions? Thanks, Michal [1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@mail.gmail.com/