* [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs @ 2017-05-09 16:17 Tejun Heo 2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo 2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen 0 siblings, 2 replies; 12+ messages in thread From: Tejun Heo @ 2017-05-09 16:17 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all live cfs_rqs which have ever been active on the CPU; unfortunately, this makes update_blocked_averages() O(total number of CPU cgroups) which isn't scalable at all. The next patch will make rq->leaf_cfs_rq_list only contain the cfs_rqs which are currently active. In preparation, this patch converts users which need to traverse all cfs_rqs to use task_groups list instead. task_groups list is protected by its own lock and allows RCU protected traversal and the order of operations guarantees that all online cfs_rqs will be visited, but holding rq->lock won't protect against iterating an already unregistered cfs_rq. However, the operations of the two users that get converted - update_runtime_enabled() and unthrottle_offline_cfs_rqs() - should be safe to perform on already dead cfs_rqs, so adding rcu read protection around them should be enough. Note that print_cfs_stats() is not converted. The next patch will change its behavior to print out only active cfs_rqs, which is intended as there's not much point in printing out idle cfs_rqs. v2: Dropped strong synchronization around removal and left print_cfs_stats() unchanged as suggested by Peterz. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org --- Peter, Updated the patch - pasted in your version and updated the description accordingly. Please feel free to use this one or whichever you like. Based on mainline and stable is cc'd. Thanks. kernel/sched/fair.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4644,22 +4644,32 @@ static void destroy_cfs_bandwidth(struct static void __maybe_unused update_runtime_enabled(struct rq *rq) { - struct cfs_rq *cfs_rq; + struct task_group *tg; - for_each_leaf_cfs_rq(rq, cfs_rq) { - struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth; + lockdep_assert_held(&rq->lock); + + rcu_read_lock(); + list_for_each_entry_rcu(tg, &task_groups, list) { + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; raw_spin_lock(&cfs_b->lock); cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF; raw_spin_unlock(&cfs_b->lock); } + rcu_read_unlock(); } static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) { - struct cfs_rq *cfs_rq; + struct task_group *tg; + + lockdep_assert_held(&rq->lock); + + rcu_read_lock(); + list_for_each_entry_rcu(tg, &task_groups, list) { + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; - for_each_leaf_cfs_rq(rq, cfs_rq) { if (!cfs_rq->runtime_enabled) continue; @@ -4677,6 +4687,7 @@ static void __maybe_unused unthrottle_of if (cfs_rq_throttled(cfs_rq)) unthrottle_cfs_rq(cfs_rq); } + rcu_read_unlock(); } #else /* CONFIG_CFS_BANDWIDTH */ ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo @ 2017-05-09 16:18 ` Tejun Heo 2017-05-10 6:50 ` Vincent Guittot 2017-05-10 17:36 ` [PATCH v3 " Tejun Heo 2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen 1 sibling, 2 replies; 12+ messages in thread From: Tejun Heo @ 2017-05-09 16:18 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all live cfs_rqs which have ever been active on the CPU; unfortunately, this makes update_blocked_averages() O(# total cgroups) which isn't scalable at all. This shows up as a small CPU consumption and scheduling latency increase in the load balancing path in systems with CPU controller enabled across most cgroups. In an edge case where temporary cgroups were leaking, this caused the kernel to consume good several tens of percents of CPU cycles running update_blocked_averages(), each run taking multiple millisecs. This patch fixes the issue by taking empty and fully decayed cfs_rqs off the rq->leaf_cfs_rq_list. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org --- Just refreshed on top of the first patch. kernel/sched/fair.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq( } /* Iterate thr' all leaf cfs_rq's on a runqueue */ -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq( { } -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) static inline struct sched_entity *parent_entity(struct sched_entity *se) @@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env * static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; struct rq_flags rf; rq_lock_irqsave(rq, &rf); @@ -6994,7 +6995,7 @@ static void update_blocked_averages(int * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { struct sched_entity *se; /* throttled entities do not contribute to load */ @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int se = cfs_rq->tg->se[cpu]; if (se && !skip_blocked_update(se)) update_load_avg(se, 0); + + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) + list_del_leaf_cfs_rq(cfs_rq); } rq_unlock_irqrestore(rq, &rf); } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo @ 2017-05-10 6:50 ` Vincent Guittot 2017-05-10 14:44 ` Tejun Heo 2017-05-10 17:36 ` [PATCH v3 " Tejun Heo 1 sibling, 1 reply; 12+ messages in thread From: Vincent Guittot @ 2017-05-10 6:50 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Hi Tejun, On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote: > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all > live cfs_rqs which have ever been active on the CPU; unfortunately, > this makes update_blocked_averages() O(# total cgroups) which isn't > scalable at all. Dietmar raised similar optimization in the past. The only question was : what is the impact of re-adding the cfs_rq in leaf_cfs_rq_list on the wake up path ? Have you done some measurements ? > > This shows up as a small CPU consumption and scheduling latency > increase in the load balancing path in systems with CPU controller > enabled across most cgroups. In an edge case where temporary cgroups > were leaking, this caused the kernel to consume good several tens of > percents of CPU cycles running update_blocked_averages(), each run > taking multiple millisecs. > > This patch fixes the issue by taking empty and fully decayed cfs_rqs > off the rq->leaf_cfs_rq_list. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Mike Galbraith <efault@gmx.de> > Cc: Paul Turner <pjt@google.com> > Cc: Chris Mason <clm@fb.com> > Cc: stable@vger.kernel.org > --- > Just refreshed on top of the first patch. > > kernel/sched/fair.c | 19 ++++++++++++++----- > 1 file changed, 14 insertions(+), 5 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq( > } > > /* Iterate thr' all leaf cfs_rq's on a runqueue */ > -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ > - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) > +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ > + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ > + leaf_cfs_rq_list) > > /* Do the two (enqueued) entities belong to the same group ? */ > static inline struct cfs_rq * > @@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq( > { > } > > -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ > +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ > for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) > > static inline struct sched_entity *parent_entity(struct sched_entity *se) > @@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env * > static void update_blocked_averages(int cpu) > { > struct rq *rq = cpu_rq(cpu); > - struct cfs_rq *cfs_rq; > + struct cfs_rq *cfs_rq, *pos; > struct rq_flags rf; > > rq_lock_irqsave(rq, &rf); > @@ -6994,7 +6995,7 @@ static void update_blocked_averages(int > * Iterates the task_group tree in a bottom up fashion, see > * list_add_leaf_cfs_rq() for details. > */ > - for_each_leaf_cfs_rq(rq, cfs_rq) { > + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { > struct sched_entity *se; > > /* throttled entities do not contribute to load */ > @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int > se = cfs_rq->tg->se[cpu]; > if (se && !skip_blocked_update(se)) > update_load_avg(se, 0); > + > + /* > + * There can be a lot of idle CPU cgroups. Don't let fully > + * decayed cfs_rqs linger on the list. > + */ > + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && > + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) > + list_del_leaf_cfs_rq(cfs_rq); list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up. By removing cfs_rq, can't we break this assumption in some cases ? Regards, Vincent > } > rq_unlock_irqrestore(rq, &rf); > } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-10 6:50 ` Vincent Guittot @ 2017-05-10 14:44 ` Tejun Heo 2017-05-10 15:55 ` Tejun Heo 2017-05-11 7:02 ` Vincent Guittot 0 siblings, 2 replies; 12+ messages in thread From: Tejun Heo @ 2017-05-10 14:44 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Hello, On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote: > On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote: > > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all > > live cfs_rqs which have ever been active on the CPU; unfortunately, > > this makes update_blocked_averages() O(# total cgroups) which isn't > > scalable at all. > > Dietmar raised similar optimization in the past. The only question was > : what is the impact of re-adding the cfs_rq in leaf_cfs_rq_list on > the wake up path ? Have you done some measurements ? Didn't do a perf test yet but it's several more branches and a local list operation on enqueue, which is already pretty expensive vs. load balance being O(total number of cgroups on the system). Anyways, I'll do some hackbench tests with several levels of layering. > > @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int > > se = cfs_rq->tg->se[cpu]; > > if (se && !skip_blocked_update(se)) > > update_load_avg(se, 0); > > + > > + /* > > + * There can be a lot of idle CPU cgroups. Don't let fully > > + * decayed cfs_rqs linger on the list. > > + */ > > + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && > > + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) > > + list_del_leaf_cfs_rq(cfs_rq); > > list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up. > By removing cfs_rq, can't we break this assumption in some cases ? We queue a cfs_rq on the leaf list when the a se is queued on that cfs_rq for the first time, so queueing can happen in any order; otherwise, we'd simply be doing list_add_tail(). AFAICS, removing and re-adding shouldn't break anything if the code wasn't broken before. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-10 14:44 ` Tejun Heo @ 2017-05-10 15:55 ` Tejun Heo 2017-05-11 7:02 ` Vincent Guittot 1 sibling, 0 replies; 12+ messages in thread From: Tejun Heo @ 2017-05-10 15:55 UTC (permalink / raw) To: ying.huang, Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team On Wed, May 10, 2017 at 10:44:14AM -0400, Tejun Heo wrote: > Hello, > > On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote: > > On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote: > > > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all > > > live cfs_rqs which have ever been active on the CPU; unfortunately, > > > this makes update_blocked_averages() O(# total cgroups) which isn't > > > scalable at all. > > > > Dietmar raised similar optimization in the past. The only question was > > : what is the impact of re-adding the cfs_rq in leaf_cfs_rq_list on > > the wake up path ? Have you done some measurements ? > > Didn't do a perf test yet but it's several more branches and a local > list operation on enqueue, which is already pretty expensive vs. load > balance being O(total number of cgroups on the system). > > Anyways, I'll do some hackbench tests with several levels of layering. Oh, BTW, do we still need bc4278987e38 ("sched/fair: Fix FTQ noise bench regression") with this patch? That patch looks like it's just cutting down on the body of the O(#cgroups) loop. Ying, can you verify whether the following patch on top of the current linus#master 56868a460b83 also fixes the regression that you saw? Index: work/kernel/sched/fair.c =================================================================== --- work.orig/kernel/sched/fair.c +++ work/kernel/sched/fair.c @@ -372,6 +372,10 @@ static inline void list_del_leaf_cfs_rq( #define for_each_leaf_cfs_rq(rq, cfs_rq) \ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) + /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * is_same_group(struct sched_entity *se, struct sched_entity *pse) @@ -466,6 +470,9 @@ static inline void list_del_leaf_cfs_rq( #define for_each_leaf_cfs_rq(rq, cfs_rq) \ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) + static inline struct sched_entity *parent_entity(struct sched_entity *se) { return NULL; @@ -3170,36 +3177,6 @@ static inline int propagate_entity_load_ return 1; } -/* - * Check if we need to update the load and the utilization of a blocked - * group_entity: - */ -static inline bool skip_blocked_update(struct sched_entity *se) -{ - struct cfs_rq *gcfs_rq = group_cfs_rq(se); - - /* - * If sched_entity still have not zero load or utilization, we have to - * decay it: - */ - if (se->avg.load_avg || se->avg.util_avg) - return false; - - /* - * If there is a pending propagation, we have to update the load and - * the utilization of the sched_entity: - */ - if (gcfs_rq->propagate_avg) - return false; - - /* - * Otherwise, the load and the utilization of the sched_entity is - * already zero and there is no pending propagation, so it will be a - * waste of time to try to decay it: - */ - return true; -} - #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {} @@ -4644,22 +4621,32 @@ static void destroy_cfs_bandwidth(struct static void __maybe_unused update_runtime_enabled(struct rq *rq) { - struct cfs_rq *cfs_rq; + struct task_group *tg; + + lockdep_assert_held(&rq->lock); - for_each_leaf_cfs_rq(rq, cfs_rq) { - struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth; + rcu_read_lock(); + list_for_each_entry_rcu(tg, &task_groups, list) { + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; raw_spin_lock(&cfs_b->lock); cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF; raw_spin_unlock(&cfs_b->lock); } + rcu_read_unlock(); } static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) { - struct cfs_rq *cfs_rq; + struct task_group *tg; + + lockdep_assert_held(&rq->lock); + + rcu_read_lock(); + list_for_each_entry_rcu(tg, &task_groups, list) { + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; - for_each_leaf_cfs_rq(rq, cfs_rq) { if (!cfs_rq->runtime_enabled) continue; @@ -4677,6 +4664,7 @@ static void __maybe_unused unthrottle_of if (cfs_rq_throttled(cfs_rq)) unthrottle_cfs_rq(cfs_rq); } + rcu_read_unlock(); } #else /* CONFIG_CFS_BANDWIDTH */ @@ -6973,7 +6961,7 @@ static void attach_tasks(struct lb_env * static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; struct rq_flags rf; rq_lock_irqsave(rq, &rf); @@ -6983,9 +6971,7 @@ static void update_blocked_averages(int * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { - struct sched_entity *se; - + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { /* throttled entities do not contribute to load */ if (throttled_hierarchy(cfs_rq)) continue; @@ -6993,10 +6979,17 @@ static void update_blocked_averages(int if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) update_tg_load_avg(cfs_rq, 0); - /* Propagate pending load changes to the parent, if any: */ - se = cfs_rq->tg->se[cpu]; - if (se && !skip_blocked_update(se)) - update_load_avg(se, 0); + /* Propagate pending load changes to the parent */ + if (cfs_rq->tg->se[cpu]) + update_load_avg(cfs_rq->tg->se[cpu], 0); + + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) + list_del_leaf_cfs_rq(cfs_rq); } rq_unlock_irqrestore(rq, &rf); } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-10 14:44 ` Tejun Heo 2017-05-10 15:55 ` Tejun Heo @ 2017-05-11 7:02 ` Vincent Guittot 2017-05-12 13:16 ` Tejun Heo 1 sibling, 1 reply; 12+ messages in thread From: Vincent Guittot @ 2017-05-11 7:02 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Hi Tejun, On 10 May 2017 at 16:44, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote: >> On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote: >> > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all >> > live cfs_rqs which have ever been active on the CPU; unfortunately, >> > this makes update_blocked_averages() O(# total cgroups) which isn't >> > scalable at all. >> >> Dietmar raised similar optimization in the past. The only question was >> : what is the impact of re-adding the cfs_rq in leaf_cfs_rq_list on >> the wake up path ? Have you done some measurements ? > > Didn't do a perf test yet but it's several more branches and a local > list operation on enqueue, which is already pretty expensive vs. load > balance being O(total number of cgroups on the system). > > Anyways, I'll do some hackbench tests with several levels of layering. > >> > @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int >> > se = cfs_rq->tg->se[cpu]; >> > if (se && !skip_blocked_update(se)) >> > update_load_avg(se, 0); >> > + >> > + /* >> > + * There can be a lot of idle CPU cgroups. Don't let fully >> > + * decayed cfs_rqs linger on the list. >> > + */ >> > + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && >> > + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) >> > + list_del_leaf_cfs_rq(cfs_rq); >> >> list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up. >> By removing cfs_rq, can't we break this assumption in some cases ? > > We queue a cfs_rq on the leaf list when the a se is queued on that > cfs_rq for the first time, so queueing can happen in any order; Sorry, what i mean is: When the group entity of a cfs_rq is enqueued, we are sure that either the parents is already enqueued or it will be enqueued in the same sequence. We must be sure that no other branch will be enqueued in the middle of the sequence and will reset tmp_alone_branch. This is true with current implementation but I wondered it can happen if we del/add the cfs_rq out of order That said i haven't find a use case that break the sequence > otherwise, we'd simply be doing list_add_tail(). AFAICS, removing and > re-adding shouldn't break anything if the code wasn't broken before. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-11 7:02 ` Vincent Guittot @ 2017-05-12 13:16 ` Tejun Heo 2017-05-12 14:36 ` Vincent Guittot 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2017-05-12 13:16 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Hello, Vincent. On Thu, May 11, 2017 at 09:02:22AM +0200, Vincent Guittot wrote: > Sorry, what i mean is: > When the group entity of a cfs_rq is enqueued, we are sure that either > the parents is already enqueued or it will be enqueued in the same > sequence. We must be sure that no other branch will be enqueued in the > middle of the sequence and will reset tmp_alone_branch. > This is true with current implementation but I wondered it can happen > if we del/add the cfs_rq out of order > > That said i haven't find a use case that break the sequence Hmm... a cfs_rq can be removed from leaf list iff it's empty and dequeued, and enqueueing is always from top down. If an ancestor is already enqueued, it's guaranteed to be on the leaf list; otherwise, it's guaranteed to be enqueued beforehand and thus put on the leaf list too. I think it should be fine. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-12 13:16 ` Tejun Heo @ 2017-05-12 14:36 ` Vincent Guittot 0 siblings, 0 replies; 12+ messages in thread From: Vincent Guittot @ 2017-05-12 14:36 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Mike Galbraith, Paul Turner, Chris Mason, kernel-team On 12 May 2017 at 15:16, Tejun Heo <tj@kernel.org> wrote: > Hello, Vincent. > > On Thu, May 11, 2017 at 09:02:22AM +0200, Vincent Guittot wrote: >> Sorry, what i mean is: >> When the group entity of a cfs_rq is enqueued, we are sure that either >> the parents is already enqueued or it will be enqueued in the same >> sequence. We must be sure that no other branch will be enqueued in the >> middle of the sequence and will reset tmp_alone_branch. >> This is true with current implementation but I wondered it can happen >> if we del/add the cfs_rq out of order >> >> That said i haven't find a use case that break the sequence > > Hmm... a cfs_rq can be removed from leaf list iff it's empty and > dequeued, and enqueueing is always from top down. If an ancestor is > already enqueued, it's guaranteed to be on the leaf list; otherwise, > it's guaranteed to be enqueued beforehand and thus put on the leaf > list too. I think it should be fine. I agree FWIW Acked-by: Vincent Guittot <vincent.guittot@linaro.org> > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path 2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo 2017-05-10 6:50 ` Vincent Guittot @ 2017-05-10 17:36 ` Tejun Heo 1 sibling, 0 replies; 12+ messages in thread From: Tejun Heo @ 2017-05-10 17:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all live cfs_rqs which have ever been active on the CPU; unfortunately, this makes update_blocked_averages() O(# total cgroups) which isn't scalable at all. This shows up as a small CPU consumption and scheduling latency increase in the load balancing path in systems with CPU controller enabled across most cgroups. In an edge case where temporary cgroups were leaking, this caused the kernel to consume good several tens of percents of CPU cycles running update_blocked_averages(), each run taking multiple millisecs. This patch fixes the issue by taking empty and fully decayed cfs_rqs off the rq->leaf_cfs_rq_list. v3: The debug path still uses for_each_leaf_cfs_rq(). Keep it around. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org --- Hello, The previous patch caused build error if SCHED_DEBUG is set as that still uses for_each_leaf_cfs_rq(). Patch updated to keep that around. Thanks. kernel/sched/fair.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -372,6 +372,10 @@ static inline void list_del_leaf_cfs_rq( #define for_each_leaf_cfs_rq(rq, cfs_rq) \ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) + /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * is_same_group(struct sched_entity *se, struct sched_entity *pse) @@ -466,6 +470,9 @@ static inline void list_del_leaf_cfs_rq( #define for_each_leaf_cfs_rq(rq, cfs_rq) \ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) + static inline struct sched_entity *parent_entity(struct sched_entity *se) { return NULL; @@ -6984,7 +6991,7 @@ static void attach_tasks(struct lb_env * static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; struct rq_flags rf; rq_lock_irqsave(rq, &rf); @@ -6994,7 +7001,7 @@ static void update_blocked_averages(int * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { struct sched_entity *se; /* throttled entities do not contribute to load */ @@ -7008,6 +7015,14 @@ static void update_blocked_averages(int se = cfs_rq->tg->se[cpu]; if (se && !skip_blocked_update(se)) update_load_avg(se, 0); + + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum && + !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum) + list_del_leaf_cfs_rq(cfs_rq); } rq_unlock_irqrestore(rq, &rf); } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs 2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo 2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo @ 2017-05-24 23:40 ` Tim Chen 2017-05-25 14:39 ` Tejun Heo 1 sibling, 1 reply; 12+ messages in thread From: Tim Chen @ 2017-05-24 23:40 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team, mohini.narkhede On 05/09/2017 09:17 AM, Tejun Heo wrote: > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all > live cfs_rqs which have ever been active on the CPU; unfortunately, > this makes update_blocked_averages() O(total number of CPU cgroups) > which isn't scalable at all. > > The next patch will make rq->leaf_cfs_rq_list only contain the cfs_rqs > which are currently active. In preparation, this patch converts users > which need to traverse all cfs_rqs to use task_groups list instead. > > task_groups list is protected by its own lock and allows RCU protected > traversal and the order of operations guarantees that all online > cfs_rqs will be visited, but holding rq->lock won't protect against > iterating an already unregistered cfs_rq. However, the operations of > the two users that get converted - update_runtime_enabled() and > unthrottle_offline_cfs_rqs() - should be safe to perform on already > dead cfs_rqs, so adding rcu read protection around them should be > enough. > > Note that print_cfs_stats() is not converted. The next patch will > change its behavior to print out only active cfs_rqs, which is > intended as there's not much point in printing out idle cfs_rqs. > > v2: Dropped strong synchronization around removal and left > print_cfs_stats() unchanged as suggested by Peterz. > > Tejun, We did some preliminary testing of this patchset for a well known database benchmark on a 4 socket Skylake server system. It provides a 3.7% throughput boost which is significant for this benchmark. Thanks. Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs 2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen @ 2017-05-25 14:39 ` Tejun Heo 2017-05-26 23:04 ` Tim Chen 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2017-05-25 14:39 UTC (permalink / raw) To: Tim Chen Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team, mohini.narkhede On Wed, May 24, 2017 at 04:40:34PM -0700, Tim Chen wrote: > We did some preliminary testing of this patchset for a well > known database benchmark on a 4 socket Skylake server system. > It provides a 3.7% throughput boost which is significant for > this benchmark. That's great to hear. Yeah, the walk can be noticeably expensive even with moderate number of cgroups. Thanks for sharing the result. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs 2017-05-25 14:39 ` Tejun Heo @ 2017-05-26 23:04 ` Tim Chen 0 siblings, 0 replies; 12+ messages in thread From: Tim Chen @ 2017-05-26 23:04 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason, kernel-team, mohini.narkhede On 05/25/2017 07:39 AM, Tejun Heo wrote: > On Wed, May 24, 2017 at 04:40:34PM -0700, Tim Chen wrote: >> We did some preliminary testing of this patchset for a well >> known database benchmark on a 4 socket Skylake server system. >> It provides a 3.7% throughput boost which is significant for >> this benchmark. > > That's great to hear. Yeah, the walk can be noticeably expensive even > with moderate number of cgroups. Thanks for sharing the result. > Yes, the walk in update_blocked_averages has bad scaling property as it iterates over *all* cfs_rq's leaf tasks, making it very expensive. It consumes 11.7% of our cpu cycles for this benchmark when CGROUP is on. Your patchset skips unused cgroup and reduce the overhead to 10.4%. CPU cycles profile is attached below for your reference. The scheduler's frequent update of cgroup's laod averages, and having to iterate all the leaf tasks for each load balance causes update_blocked_averages to be one of the most expensive functions in the system, making CGROUP costly. Without CGROUP, schedule only cost 3.3% of cpu cycles vs 16.4% with CGROUP turned on. Your patchset does reduce it to 14.9%. This benchmark has thousands of running tasks, so it puts a good deal of stress to the scheduler. Tim CPU cycles profile: 4.11 Before your patchset with CGROUP: --------------------------------------- 16.42% 0.03% 280 [kernel.vmlinux] [k] schedule | --16.39%--schedule | --16.31%--__sched_text_start | |--12.85%--pick_next_task_fair | | | --11.71%--update_blocked_averages | | | --5.00%--update_load_avg | |--2.04%--finish_task_switch | | | |--0.85%--ret_from_intr | | | | | --0.85%--do_IRQ | | | --0.75%--apic_timer_interrupt | | | --0.75%--smp_apic_timer_interrupt | | | --0.55%--irq_exit | | | --0.55%--__do_softirq | --0.51%--deactivate_task 4.11 After your patchset with CGROUP: ------------------------------------- 14.90% 0.04% 337 [kernel.vmlinux] [k] schedule | --14.86%--schedule | --14.78%--__sched_text_start | |--11.51%--pick_next_task_fair | | | --10.37%--update_blocked_averages | | | --4.55%--update_load_avg | |--1.79%--finish_task_switch | | | |--0.77%--ret_from_intr | | | | | --0.77%--do_IRQ | | | --0.65%--apic_timer_interrupt | | | --0.65%--smp_apic_timer_interrupt | --0.53%--deactivate_task 4.11 with No CGROUP: -------------------- 3.33% 0.04% 336 [kernel.vmlinux] [k] schedule | --3.29%--schedule | --3.19%--__sched_text_start | --1.45%--pick_next_task_fair | --1.15%--load_balance | --0.87%--find_busiest_group | --0.82%--update_sd_lb_stats ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-05-27 1:19 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo 2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo 2017-05-10 6:50 ` Vincent Guittot 2017-05-10 14:44 ` Tejun Heo 2017-05-10 15:55 ` Tejun Heo 2017-05-11 7:02 ` Vincent Guittot 2017-05-12 13:16 ` Tejun Heo 2017-05-12 14:36 ` Vincent Guittot 2017-05-10 17:36 ` [PATCH v3 " Tejun Heo 2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen 2017-05-25 14:39 ` Tejun Heo 2017-05-26 23:04 ` Tim Chen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.