* [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path @ 2019-02-06 16:14 Vincent Guittot 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw) To: linux-kernel, mingo, peterz Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot This patchset adds missing pieces in the management of leaf_cfs_rq_list to ensure that cfs_rqs are always correctly ordered before re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path") Vincent Guittot (2): sched/fair: optimization of update_blocked_averages() sched/fair: Fix O(nr_cgroups) in load balance path kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 14 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot @ 2019-02-06 16:14 ` Vincent Guittot 2019-02-08 15:40 ` Peter Zijlstra ` (2 more replies) 2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2 siblings, 3 replies; 14+ messages in thread From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw) To: linux-kernel, mingo, peterz Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too so they will be added back in the right order next time. With a normal decay of pelt, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the pelt of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ffd1ae7..badf8173 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { if (cfs_rq->on_list) { + struct rq *rq = rq_of(cfs_rq); + + /* + * With cfs_rq being unthrottled/throttled during an enqueue, + * it can happen the tmp_alone_branch points the a leaf that + * we finally want to del. In this case, tmp_alone_branch moves + * to the prev element but it will point to rq->leaf_cfs_rq_list + * at the end of the enqueue. + */ + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list) + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev; + list_del_rcu(&cfs_rq->leaf_cfs_rq_list); cfs_rq->on_list = 0; } @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) /* adjust cfs_rq_clock_task() */ cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - cfs_rq->throttled_clock_task; + + /* Add cfs_rq with already running entity in the list */ + if (cfs_rq->nr_running >= 1) + list_add_leaf_cfs_rq(cfs_rq); } return 0; @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data) struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; /* group is entering throttled state, stop time */ - if (!cfs_rq->throttle_count) + if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_task = rq_clock_task(rq); + list_del_leaf_cfs_rq(cfs_rq); + } cfs_rq->throttle_count++; return 0; @@ -7699,10 +7717,6 @@ static void update_blocked_averages(int cpu) for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; - /* throttled entities do not contribute to load */ - if (throttled_hierarchy(cfs_rq)) - continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) update_tg_load_avg(cfs_rq, 0); -- 2.7.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot @ 2019-02-08 15:40 ` Peter Zijlstra 2019-02-08 15:44 ` Vincent Guittot 2019-02-08 16:30 ` Peter Zijlstra 2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2019-02-08 15:40 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds Argh head hurts!! On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > /* adjust cfs_rq_clock_task() */ > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - > cfs_rq->throttled_clock_task; > + > + /* Add cfs_rq with already running entity in the list */ > + if (cfs_rq->nr_running >= 1) > + list_add_leaf_cfs_rq(cfs_rq); > } > > return 0; Do we want the below to go with the above change? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 38d4669aa2ef..0bd80a891025 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) /* update hierarchical throttle state */ walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq); + assert_list_leaf_cfs_rq(rq); + if (!cfs_rq->load.weight) return; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 15:40 ` Peter Zijlstra @ 2019-02-08 15:44 ` Vincent Guittot 2019-02-08 16:30 ` Peter Zijlstra 0 siblings, 1 reply; 14+ messages in thread From: Vincent Guittot @ 2019-02-08 15:44 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote: > > > Argh head hurts!! > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > > /* adjust cfs_rq_clock_task() */ > > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - > > cfs_rq->throttled_clock_task; > > + > > + /* Add cfs_rq with already running entity in the list */ > > + if (cfs_rq->nr_running >= 1) > > + list_add_leaf_cfs_rq(cfs_rq); > > } > > > > return 0; > > Do we want the below to go with the above change? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 38d4669aa2ef..0bd80a891025 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > /* update hierarchical throttle state */ > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq); > > + assert_list_leaf_cfs_rq(rq); > + Good point but this should go after the for_each_sched_entity() loop > if (!cfs_rq->load.weight) > return; > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 15:44 ` Vincent Guittot @ 2019-02-08 16:30 ` Peter Zijlstra 2019-02-08 16:47 ` Vincent Guittot 0 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote: > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Argh head hurts!! > > > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > > > /* adjust cfs_rq_clock_task() */ > > > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - > > > cfs_rq->throttled_clock_task; > > > + > > > + /* Add cfs_rq with already running entity in the list */ > > > + if (cfs_rq->nr_running >= 1) > > > + list_add_leaf_cfs_rq(cfs_rq); > > > } > > > > > > return 0; > > > > Do we want the below to go with the above change? > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 38d4669aa2ef..0bd80a891025 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > /* update hierarchical throttle state */ > > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq); > > > > + assert_list_leaf_cfs_rq(rq); > > + > > Good point but this should go after the for_each_sched_entity() loop Indeed, but that loop does enqueue and can throttle again, should that not also get that additinoal list_add_leaf_cfs_rq() loop we added to enqueue_task_fair() to finish the add? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 16:30 ` Peter Zijlstra @ 2019-02-08 16:47 ` Vincent Guittot 2019-02-08 16:51 ` Peter Zijlstra 0 siblings, 1 reply; 14+ messages in thread From: Vincent Guittot @ 2019-02-08 16:47 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote: > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > Argh head hurts!! > > > > > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > > > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > > > > /* adjust cfs_rq_clock_task() */ > > > > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - > > > > cfs_rq->throttled_clock_task; > > > > + > > > > + /* Add cfs_rq with already running entity in the list */ > > > > + if (cfs_rq->nr_running >= 1) > > > > + list_add_leaf_cfs_rq(cfs_rq); > > > > } > > > > > > > > return 0; > > > > > > Do we want the below to go with the above change? > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 38d4669aa2ef..0bd80a891025 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > > /* update hierarchical throttle state */ > > > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq); > > > > > > + assert_list_leaf_cfs_rq(rq); > > > + > > > > Good point but this should go after the for_each_sched_entity() loop > > Indeed, but that loop does enqueue and can throttle again, should that > not also get that additional list_add_leaf_cfs_rq() loop we added to > enqueue_task_fair() to finish the add? Initially, I added this additional loop but finally removed it because I didn't hit it during my tests. IIRC, we are protected by throttle_count in such case, which is not the case when we enqueue a task ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 16:47 ` Vincent Guittot @ 2019-02-08 16:51 ` Peter Zijlstra 2019-02-08 16:51 ` Vincent Guittot 0 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2019-02-08 16:51 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote: > On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote: > > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote: > > > Good point but this should go after the for_each_sched_entity() loop > > > > Indeed, but that loop does enqueue and can throttle again, should that > > not also get that additional list_add_leaf_cfs_rq() loop we added to > > enqueue_task_fair() to finish the add? > > Initially, I added this additional loop but finally removed it because > I didn't hit it during my tests. IIRC, we are protected by > throttle_count in such case, which is not the case when we enqueue a > task Fair enough; and the to-be added assert will notify us if we got that wrong :-) I'll add the assert, no need to re-send. Thanks! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 16:51 ` Peter Zijlstra @ 2019-02-08 16:51 ` Vincent Guittot 0 siblings, 0 replies; 14+ messages in thread From: Vincent Guittot @ 2019-02-08 16:51 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, 8 Feb 2019 at 17:51, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote: > > On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote: > > > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Good point but this should go after the for_each_sched_entity() loop > > > > > > Indeed, but that loop does enqueue and can throttle again, should that > > > not also get that additional list_add_leaf_cfs_rq() loop we added to > > > enqueue_task_fair() to finish the add? > > > > Initially, I added this additional loop but finally removed it because > > I didn't hit it during my tests. IIRC, we are protected by > > throttle_count in such case, which is not the case when we enqueue a > > task > > Fair enough; and the to-be added assert will notify us if we got that > wrong :-) > > I'll add the assert, no need to re-send. Thanks > > Thanks! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot 2019-02-08 15:40 ` Peter Zijlstra @ 2019-02-08 16:30 ` Peter Zijlstra 2019-02-08 16:48 ` Vincent Guittot 2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > { > if (cfs_rq->on_list) { > + struct rq *rq = rq_of(cfs_rq); > + > + /* > + * With cfs_rq being unthrottled/throttled during an enqueue, > + * it can happen the tmp_alone_branch points the a leaf that > + * we finally want to del. In this case, tmp_alone_branch moves > + * to the prev element but it will point to rq->leaf_cfs_rq_list > + * at the end of the enqueue. > + */ > + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list) > + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev; > + > list_del_rcu(&cfs_rq->leaf_cfs_rq_list); > cfs_rq->on_list = 0; > } So that is: enqueue_task_fair() enqueue_entity() list_add_lead_cfs_rq() check_enqueue_throttle() throttle_cfs_rq() walk_tg_tree_from() tg_throttle_down() list_del_leaf_cfs_rq() Which can try and remove a cfs_rq which we just added. And because the list is a bottom-up order, and the deletion is a downward operation, we must go back (prev) in the list. So far so good I suppose. > @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data) > struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; > > /* group is entering throttled state, stop time */ > - if (!cfs_rq->throttle_count) > + if (!cfs_rq->throttle_count) { > cfs_rq->throttled_clock_task = rq_clock_task(rq); > + list_del_leaf_cfs_rq(cfs_rq); > + } > cfs_rq->throttle_count++; > > return 0; ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages() 2019-02-08 16:30 ` Peter Zijlstra @ 2019-02-08 16:48 ` Vincent Guittot 0 siblings, 0 replies; 14+ messages in thread From: Vincent Guittot @ 2019-02-08 16:48 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Fri, 8 Feb 2019 at 17:31, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote: > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) > > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) > > { > > if (cfs_rq->on_list) { > > + struct rq *rq = rq_of(cfs_rq); > > + > > + /* > > + * With cfs_rq being unthrottled/throttled during an enqueue, > > + * it can happen the tmp_alone_branch points the a leaf that > > + * we finally want to del. In this case, tmp_alone_branch moves > > + * to the prev element but it will point to rq->leaf_cfs_rq_list > > + * at the end of the enqueue. > > + */ > > + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list) > > + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev; > > + > > list_del_rcu(&cfs_rq->leaf_cfs_rq_list); > > cfs_rq->on_list = 0; > > } > > So that is: > > enqueue_task_fair() > enqueue_entity() > list_add_lead_cfs_rq() > check_enqueue_throttle() > throttle_cfs_rq() > walk_tg_tree_from() > tg_throttle_down() > list_del_leaf_cfs_rq() > > Which can try and remove a cfs_rq which we just added. > > And because the list is a bottom-up order, and the deletion is a > downward operation, we must go back (prev) in the list. yes exactly > > So far so good I suppose. > > > @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data) > > struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; > > > > /* group is entering throttled state, stop time */ > > - if (!cfs_rq->throttle_count) > > + if (!cfs_rq->throttle_count) { > > cfs_rq->throttled_clock_task = rq_clock_task(rq); > > + list_del_leaf_cfs_rq(cfs_rq); > > + } > > cfs_rq->throttle_count++; > > > > return 0; > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/fair: Optimize update_blocked_averages() 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot 2019-02-08 15:40 ` Peter Zijlstra 2019-02-08 16:30 ` Peter Zijlstra @ 2019-02-11 10:47 ` tip-bot for Vincent Guittot 2 siblings, 0 replies; 14+ messages in thread From: tip-bot for Vincent Guittot @ 2019-02-11 10:47 UTC (permalink / raw) To: linux-tip-commits Cc: vincent.guittot, linux-kernel, hpa, mingo, peterz, torvalds, tglx Commit-ID: 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c Gitweb: https://git.kernel.org/tip/31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Wed, 6 Feb 2019 17:14:21 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 11 Feb 2019 08:02:12 +0100 sched/fair: Optimize update_blocked_averages() Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 38d4669aa2ef..027f8e1b5b66 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq) { if (cfs_rq->on_list) { + struct rq *rq = rq_of(cfs_rq); + + /* + * With cfs_rq being unthrottled/throttled during an enqueue, + * it can happen the tmp_alone_branch points the a leaf that + * we finally want to del. In this case, tmp_alone_branch moves + * to the prev element but it will point to rq->leaf_cfs_rq_list + * at the end of the enqueue. + */ + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list) + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev; + list_del_rcu(&cfs_rq->leaf_cfs_rq_list); cfs_rq->on_list = 0; } @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) /* adjust cfs_rq_clock_task() */ cfs_rq->throttled_clock_task_time += rq_clock_task(rq) - cfs_rq->throttled_clock_task; + + /* Add cfs_rq with already running entity in the list */ + if (cfs_rq->nr_running >= 1) + list_add_leaf_cfs_rq(cfs_rq); } return 0; @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data) struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; /* group is entering throttled state, stop time */ - if (!cfs_rq->throttle_count) + if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_task = rq_clock_task(rq); + list_del_leaf_cfs_rq(cfs_rq); + } cfs_rq->throttle_count++; return 0; @@ -4553,6 +4571,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) break; } + assert_list_leaf_cfs_rq(rq); + if (!se) add_nr_running(rq, task_delta); @@ -7700,10 +7720,6 @@ static void update_blocked_averages(int cpu) for_each_leaf_cfs_rq(rq, cfs_rq) { struct sched_entity *se; - /* throttled entities do not contribute to load */ - if (throttled_hierarchy(cfs_rq)) - continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) update_tg_load_avg(cfs_rq, 0); ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path 2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot @ 2019-02-06 16:14 ` Vincent Guittot 2019-02-11 10:48 ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot 2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2 siblings, 1 reply; 14+ messages in thread From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw) To: linux-kernel, mingo, peterz Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot This reverts: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") Now that cfs_rq can be safely removed/added in the list, we can re-apply commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index badf8173..c6167bb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); } -/* Iterate through all cfs_rq's on a runqueue in bottom-up order */ -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +/* Iterate thr' all leaf cfs_rq's on a runqueue */ +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) { } -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7699,10 +7700,27 @@ static inline bool others_have_blocked(struct rq *rq) #ifdef CONFIG_FAIR_GROUP_SCHED +static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->load.weight) + return false; + + if (cfs_rq->avg.load_sum) + return false; + + if (cfs_rq->avg.util_sum) + return false; + + if (cfs_rq->avg.runnable_load_sum) + return false; + + return true; +} + static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7714,7 +7732,7 @@ static void update_blocked_averages(int cpu) * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { struct sched_entity *se; if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) @@ -7725,6 +7743,13 @@ static void update_blocked_averages(int cpu) if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (cfs_rq_is_decayed(cfs_rq)) + list_del_leaf_cfs_rq(cfs_rq); + /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10606,10 +10631,10 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; rcu_read_lock(); - for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) + for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); } -- 2.7.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path 2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot @ 2019-02-11 10:48 ` tip-bot for Vincent Guittot 0 siblings, 0 replies; 14+ messages in thread From: tip-bot for Vincent Guittot @ 2019-02-11 10:48 UTC (permalink / raw) To: linux-tip-commits Cc: hpa, tglx, peterz, vincent.guittot, torvalds, mingo, linux-kernel Commit-ID: 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 Gitweb: https://git.kernel.org/tip/039ae8bcf7a5f4476f4487e6bf816885fb3fb617 Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Wed, 6 Feb 2019 17:14:22 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 11 Feb 2019 08:02:13 +0100 sched/fair: Fix O(nr_cgroups) in the load balancing path This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 027f8e1b5b66..17a961522d1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); } -/* Iterate through all cfs_rq's on a runqueue in bottom-up order */ -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) +/* Iterate thr' all leaf cfs_rq's on a runqueue */ +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ + leaf_cfs_rq_list) /* Do the two (enqueued) entities belong to the same group ? */ static inline struct cfs_rq * @@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq) { } -#define for_each_leaf_cfs_rq(rq, cfs_rq) \ - for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ + for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) static inline struct sched_entity *parent_entity(struct sched_entity *se) { @@ -7702,10 +7703,27 @@ static inline bool others_have_blocked(struct rq *rq) #ifdef CONFIG_FAIR_GROUP_SCHED +static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->load.weight) + return false; + + if (cfs_rq->avg.load_sum) + return false; + + if (cfs_rq->avg.util_sum) + return false; + + if (cfs_rq->avg.runnable_load_sum) + return false; + + return true; +} + static void update_blocked_averages(int cpu) { struct rq *rq = cpu_rq(cpu); - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; const struct sched_class *curr_class; struct rq_flags rf; bool done = true; @@ -7717,7 +7735,7 @@ static void update_blocked_averages(int cpu) * Iterates the task_group tree in a bottom up fashion, see * list_add_leaf_cfs_rq() for details. */ - for_each_leaf_cfs_rq(rq, cfs_rq) { + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { struct sched_entity *se; if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) @@ -7728,6 +7746,13 @@ static void update_blocked_averages(int cpu) if (se && !skip_blocked_update(se)) update_load_avg(cfs_rq_of(se), se, 0); + /* + * There can be a lot of idle CPU cgroups. Don't let fully + * decayed cfs_rqs linger on the list. + */ + if (cfs_rq_is_decayed(cfs_rq)) + list_del_leaf_cfs_rq(cfs_rq); + /* Don't need periodic decay once load/util_avg are null */ if (cfs_rq_has_blocked(cfs_rq)) done = false; @@ -10609,10 +10634,10 @@ const struct sched_class fair_sched_class = { #ifdef CONFIG_SCHED_DEBUG void print_cfs_stats(struct seq_file *m, int cpu) { - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq, *pos; rcu_read_lock(); - for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) + for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) print_cfs_rq(m, cpu, cfs_rq); rcu_read_unlock(); } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path 2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot 2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot @ 2019-02-06 16:23 ` Vincent Guittot 2 siblings, 0 replies; 14+ messages in thread From: Vincent Guittot @ 2019-02-06 16:23 UTC (permalink / raw) To: linux-kernel, Ingo Molnar, Peter Zijlstra Cc: Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds On Wed, 6 Feb 2019 at 17:14, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > This patchset adds missing pieces in the management of leaf_cfs_rq_list > to ensure that cfs_rqs are always correctly ordered before > re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path") I should have mentioned that this patchset applies on the latest tip/sched/core branch > > Vincent Guittot (2): > sched/fair: optimization of update_blocked_averages() > sched/fair: Fix O(nr_cgroups) in load balance path > > kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 53 insertions(+), 14 deletions(-) > > -- > 2.7.4 > ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-02-11 10:48 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot 2019-02-08 15:40 ` Peter Zijlstra 2019-02-08 15:44 ` Vincent Guittot 2019-02-08 16:30 ` Peter Zijlstra 2019-02-08 16:47 ` Vincent Guittot 2019-02-08 16:51 ` Peter Zijlstra 2019-02-08 16:51 ` Vincent Guittot 2019-02-08 16:30 ` Peter Zijlstra 2019-02-08 16:48 ` Vincent Guittot 2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot 2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot 2019-02-11 10:48 ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot 2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).