* [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
@ 2019-02-06 16:14 Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
To: linux-kernel, mingo, peterz
Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot
This patchset adds missing pieces in the management of leaf_cfs_rq_list
to ensure that cfs_rqs are always correctly ordered before
re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")
Vincent Guittot (2):
sched/fair: optimization of update_blocked_averages()
sched/fair: Fix O(nr_cgroups) in load balance path
kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 14 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
2019-02-08 15:40 ` Peter Zijlstra
` (2 more replies)
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
To: linux-kernel, mingo, peterz
Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot
Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too so they
will be added back in the right order next time.
With a normal decay of pelt, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")
In case of throttling, the pelt of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.
As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffd1ae7..badf8173 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (cfs_rq->on_list) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ /*
+ * With cfs_rq being unthrottled/throttled during an enqueue,
+ * it can happen the tmp_alone_branch points the a leaf that
+ * we finally want to del. In this case, tmp_alone_branch moves
+ * to the prev element but it will point to rq->leaf_cfs_rq_list
+ * at the end of the enqueue.
+ */
+ if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+ rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
cfs_rq->on_list = 0;
}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
/* adjust cfs_rq_clock_task() */
cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
cfs_rq->throttled_clock_task;
+
+ /* Add cfs_rq with already running entity in the list */
+ if (cfs_rq->nr_running >= 1)
+ list_add_leaf_cfs_rq(cfs_rq);
}
return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count)
+ if (!cfs_rq->throttle_count) {
cfs_rq->throttled_clock_task = rq_clock_task(rq);
+ list_del_leaf_cfs_rq(cfs_rq);
+ }
cfs_rq->throttle_count++;
return 0;
@@ -7699,10 +7717,6 @@ static void update_blocked_averages(int cpu)
for_each_leaf_cfs_rq(rq, cfs_rq) {
struct sched_entity *se;
- /* throttled entities do not contribute to load */
- if (throttled_hierarchy(cfs_rq))
- continue;
-
if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
update_tg_load_avg(cfs_rq, 0);
--
2.7.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
2019-02-11 10:48 ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
To: linux-kernel, mingo, peterz
Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot
This reverts:
commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
Now that cfs_rq can be safely removed/added in the list, we can re-apply
commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
1 file changed, 34 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index badf8173..c6167bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
}
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+ list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
+ leaf_cfs_rq_list)
/* Do the two (enqueued) entities belong to the same group ? */
static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
{
}
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+ for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
@@ -7699,10 +7700,27 @@ static inline bool others_have_blocked(struct rq *rq)
#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+ if (cfs_rq->load.weight)
+ return false;
+
+ if (cfs_rq->avg.load_sum)
+ return false;
+
+ if (cfs_rq->avg.util_sum)
+ return false;
+
+ if (cfs_rq->avg.runnable_load_sum)
+ return false;
+
+ return true;
+}
+
static void update_blocked_averages(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq, *pos;
const struct sched_class *curr_class;
struct rq_flags rf;
bool done = true;
@@ -7714,7 +7732,7 @@ static void update_blocked_averages(int cpu)
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
- for_each_leaf_cfs_rq(rq, cfs_rq) {
+ for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
struct sched_entity *se;
if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7725,6 +7743,13 @@ static void update_blocked_averages(int cpu)
if (se && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, 0);
+ /*
+ * There can be a lot of idle CPU cgroups. Don't let fully
+ * decayed cfs_rqs linger on the list.
+ */
+ if (cfs_rq_is_decayed(cfs_rq))
+ list_del_leaf_cfs_rq(cfs_rq);
+
/* Don't need periodic decay once load/util_avg are null */
if (cfs_rq_has_blocked(cfs_rq))
done = false;
@@ -10606,10 +10631,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SCHED_DEBUG
void print_cfs_stats(struct seq_file *m, int cpu)
{
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq, *pos;
rcu_read_lock();
- for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+ for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
print_cfs_rq(m, cpu, cfs_rq);
rcu_read_unlock();
}
--
2.7.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:23 ` Vincent Guittot
2 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:23 UTC (permalink / raw)
To: linux-kernel, Ingo Molnar, Peter Zijlstra
Cc: Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds
On Wed, 6 Feb 2019 at 17:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> This patchset adds missing pieces in the management of leaf_cfs_rq_list
> to ensure that cfs_rqs are always correctly ordered before
> re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")
I should have mentioned that this patchset applies on the latest
tip/sched/core branch
>
> Vincent Guittot (2):
> sched/fair: optimization of update_blocked_averages()
> sched/fair: Fix O(nr_cgroups) in load balance path
>
> kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 53 insertions(+), 14 deletions(-)
>
> --
> 2.7.4
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-08 15:40 ` Peter Zijlstra
2019-02-08 15:44 ` Vincent Guittot
2019-02-08 16:30 ` Peter Zijlstra
2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 15:40 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds
Argh head hurts!!
On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> /* adjust cfs_rq_clock_task() */
> cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> cfs_rq->throttled_clock_task;
> +
> + /* Add cfs_rq with already running entity in the list */
> + if (cfs_rq->nr_running >= 1)
> + list_add_leaf_cfs_rq(cfs_rq);
> }
>
> return 0;
Do we want the below to go with the above change?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..0bd80a891025 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/* update hierarchical throttle state */
walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
+ assert_list_leaf_cfs_rq(rq);
+
if (!cfs_rq->load.weight)
return;
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 15:40 ` Peter Zijlstra
@ 2019-02-08 15:44 ` Vincent Guittot
2019-02-08 16:30 ` Peter Zijlstra
0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 15:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Argh head hurts!!
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > /* adjust cfs_rq_clock_task() */
> > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > cfs_rq->throttled_clock_task;
> > +
> > + /* Add cfs_rq with already running entity in the list */
> > + if (cfs_rq->nr_running >= 1)
> > + list_add_leaf_cfs_rq(cfs_rq);
> > }
> >
> > return 0;
>
> Do we want the below to go with the above change?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 38d4669aa2ef..0bd80a891025 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> /* update hierarchical throttle state */
> walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>
> + assert_list_leaf_cfs_rq(rq);
> +
Good point but this should go after the for_each_sched_entity() loop
> if (!cfs_rq->load.weight)
> return;
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 15:44 ` Vincent Guittot
@ 2019-02-08 16:30 ` Peter Zijlstra
2019-02-08 16:47 ` Vincent Guittot
0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >
> > Argh head hurts!!
> >
> > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > > /* adjust cfs_rq_clock_task() */
> > > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > > cfs_rq->throttled_clock_task;
> > > +
> > > + /* Add cfs_rq with already running entity in the list */
> > > + if (cfs_rq->nr_running >= 1)
> > > + list_add_leaf_cfs_rq(cfs_rq);
> > > }
> > >
> > > return 0;
> >
> > Do we want the below to go with the above change?
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 38d4669aa2ef..0bd80a891025 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > /* update hierarchical throttle state */
> > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> >
> > + assert_list_leaf_cfs_rq(rq);
> > +
>
> Good point but this should go after the for_each_sched_entity() loop
Indeed, but that loop does enqueue and can throttle again, should that
not also get that additinoal list_add_leaf_cfs_rq() loop we added to
enqueue_task_fair() to finish the add?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-08 15:40 ` Peter Zijlstra
@ 2019-02-08 16:30 ` Peter Zijlstra
2019-02-08 16:48 ` Vincent Guittot
2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds
On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> {
> if (cfs_rq->on_list) {
> + struct rq *rq = rq_of(cfs_rq);
> +
> + /*
> + * With cfs_rq being unthrottled/throttled during an enqueue,
> + * it can happen the tmp_alone_branch points the a leaf that
> + * we finally want to del. In this case, tmp_alone_branch moves
> + * to the prev element but it will point to rq->leaf_cfs_rq_list
> + * at the end of the enqueue.
> + */
> + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> +
> list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> cfs_rq->on_list = 0;
> }
So that is:
enqueue_task_fair()
enqueue_entity()
list_add_lead_cfs_rq()
check_enqueue_throttle()
throttle_cfs_rq()
walk_tg_tree_from()
tg_throttle_down()
list_del_leaf_cfs_rq()
Which can try and remove a cfs_rq which we just added.
And because the list is a bottom-up order, and the deletion is a
downward operation, we must go back (prev) in the list.
So far so good I suppose.
> @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>
> /* group is entering throttled state, stop time */
> - if (!cfs_rq->throttle_count)
> + if (!cfs_rq->throttle_count) {
> cfs_rq->throttled_clock_task = rq_clock_task(rq);
> + list_del_leaf_cfs_rq(cfs_rq);
> + }
> cfs_rq->throttle_count++;
>
> return 0;
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 16:30 ` Peter Zijlstra
@ 2019-02-08 16:47 ` Vincent Guittot
2019-02-08 16:51 ` Peter Zijlstra
0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > >
> > > Argh head hurts!!
> > >
> > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > > > /* adjust cfs_rq_clock_task() */
> > > > cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > > > cfs_rq->throttled_clock_task;
> > > > +
> > > > + /* Add cfs_rq with already running entity in the list */
> > > > + if (cfs_rq->nr_running >= 1)
> > > > + list_add_leaf_cfs_rq(cfs_rq);
> > > > }
> > > >
> > > > return 0;
> > >
> > > Do we want the below to go with the above change?
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 38d4669aa2ef..0bd80a891025 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > > /* update hierarchical throttle state */
> > > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> > >
> > > + assert_list_leaf_cfs_rq(rq);
> > > +
> >
> > Good point but this should go after the for_each_sched_entity() loop
>
> Indeed, but that loop does enqueue and can throttle again, should that
> not also get that additional list_add_leaf_cfs_rq() loop we added to
> enqueue_task_fair() to finish the add?
Initially, I added this additional loop but finally removed it because
I didn't hit it during my tests. IIRC, we are protected by
throttle_count in such case, which is not the case when we enqueue a
task
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 16:30 ` Peter Zijlstra
@ 2019-02-08 16:48 ` Vincent Guittot
0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, 8 Feb 2019 at 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > {
> > if (cfs_rq->on_list) {
> > + struct rq *rq = rq_of(cfs_rq);
> > +
> > + /*
> > + * With cfs_rq being unthrottled/throttled during an enqueue,
> > + * it can happen the tmp_alone_branch points the a leaf that
> > + * we finally want to del. In this case, tmp_alone_branch moves
> > + * to the prev element but it will point to rq->leaf_cfs_rq_list
> > + * at the end of the enqueue.
> > + */
> > + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > + rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > +
> > list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > cfs_rq->on_list = 0;
> > }
>
> So that is:
>
> enqueue_task_fair()
> enqueue_entity()
> list_add_lead_cfs_rq()
> check_enqueue_throttle()
> throttle_cfs_rq()
> walk_tg_tree_from()
> tg_throttle_down()
> list_del_leaf_cfs_rq()
>
> Which can try and remove a cfs_rq which we just added.
>
> And because the list is a bottom-up order, and the deletion is a
> downward operation, we must go back (prev) in the list.
yes exactly
>
> So far so good I suppose.
>
> > @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> > struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> >
> > /* group is entering throttled state, stop time */
> > - if (!cfs_rq->throttle_count)
> > + if (!cfs_rq->throttle_count) {
> > cfs_rq->throttled_clock_task = rq_clock_task(rq);
> > + list_del_leaf_cfs_rq(cfs_rq);
> > + }
> > cfs_rq->throttle_count++;
> >
> > return 0;
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 16:47 ` Vincent Guittot
@ 2019-02-08 16:51 ` Peter Zijlstra
2019-02-08 16:51 ` Vincent Guittot
0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:51 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> > > Good point but this should go after the for_each_sched_entity() loop
> >
> > Indeed, but that loop does enqueue and can throttle again, should that
> > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > enqueue_task_fair() to finish the add?
>
> Initially, I added this additional loop but finally removed it because
> I didn't hit it during my tests. IIRC, we are protected by
> throttle_count in such case, which is not the case when we enqueue a
> task
Fair enough; and the to-be added assert will notify us if we got that
wrong :-)
I'll add the assert, no need to re-send.
Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
2019-02-08 16:51 ` Peter Zijlstra
@ 2019-02-08 16:51 ` Vincent Guittot
0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
xiezhipeng1, Linus Torvalds
On Fri, 8 Feb 2019 at 17:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > > Good point but this should go after the for_each_sched_entity() loop
> > >
> > > Indeed, but that loop does enqueue and can throttle again, should that
> > > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > > enqueue_task_fair() to finish the add?
> >
> > Initially, I added this additional loop but finally removed it because
> > I didn't hit it during my tests. IIRC, we are protected by
> > throttle_count in such case, which is not the case when we enqueue a
> > task
>
> Fair enough; and the to-be added assert will notify us if we got that
> wrong :-)
>
> I'll add the assert, no need to re-send.
Thanks
>
> Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/fair: Optimize update_blocked_averages()
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-08 15:40 ` Peter Zijlstra
2019-02-08 16:30 ` Peter Zijlstra
@ 2019-02-11 10:47 ` tip-bot for Vincent Guittot
2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: vincent.guittot, linux-kernel, hpa, mingo, peterz, torvalds, tglx
Commit-ID: 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Gitweb: https://git.kernel.org/tip/31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Author: Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:21 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:12 +0100
sched/fair: Optimize update_blocked_averages()
Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too, so they
will be added back in the right order next time.
With a normal decay of PELT, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")
In case of throttling, the PELT of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.
As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..027f8e1b5b66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (cfs_rq->on_list) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ /*
+ * With cfs_rq being unthrottled/throttled during an enqueue,
+ * it can happen the tmp_alone_branch points the a leaf that
+ * we finally want to del. In this case, tmp_alone_branch moves
+ * to the prev element but it will point to rq->leaf_cfs_rq_list
+ * at the end of the enqueue.
+ */
+ if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+ rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
cfs_rq->on_list = 0;
}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
/* adjust cfs_rq_clock_task() */
cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
cfs_rq->throttled_clock_task;
+
+ /* Add cfs_rq with already running entity in the list */
+ if (cfs_rq->nr_running >= 1)
+ list_add_leaf_cfs_rq(cfs_rq);
}
return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count)
+ if (!cfs_rq->throttle_count) {
cfs_rq->throttled_clock_task = rq_clock_task(rq);
+ list_del_leaf_cfs_rq(cfs_rq);
+ }
cfs_rq->throttle_count++;
return 0;
@@ -4553,6 +4571,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}
+ assert_list_leaf_cfs_rq(rq);
+
if (!se)
add_nr_running(rq, task_delta);
@@ -7700,10 +7720,6 @@ static void update_blocked_averages(int cpu)
for_each_leaf_cfs_rq(rq, cfs_rq) {
struct sched_entity *se;
- /* throttled entities do not contribute to load */
- if (throttled_hierarchy(cfs_rq))
- continue;
-
if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
update_tg_load_avg(cfs_rq, 0);
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-11 10:48 ` tip-bot for Vincent Guittot
0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:48 UTC (permalink / raw)
To: linux-tip-commits
Cc: hpa, tglx, peterz, vincent.guittot, torvalds, mingo, linux-kernel
Commit-ID: 039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Gitweb: https://git.kernel.org/tip/039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Author: Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:22 +0100
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:13 +0100
sched/fair: Fix O(nr_cgroups) in the load balancing path
This re-applies the commit reverted here:
commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply:
commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
1 file changed, 34 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 027f8e1b5b66..17a961522d1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
}
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+ list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
+ leaf_cfs_rq_list)
/* Do the two (enqueued) entities belong to the same group ? */
static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
{
}
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+ for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
@@ -7702,10 +7703,27 @@ static inline bool others_have_blocked(struct rq *rq)
#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+ if (cfs_rq->load.weight)
+ return false;
+
+ if (cfs_rq->avg.load_sum)
+ return false;
+
+ if (cfs_rq->avg.util_sum)
+ return false;
+
+ if (cfs_rq->avg.runnable_load_sum)
+ return false;
+
+ return true;
+}
+
static void update_blocked_averages(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq, *pos;
const struct sched_class *curr_class;
struct rq_flags rf;
bool done = true;
@@ -7717,7 +7735,7 @@ static void update_blocked_averages(int cpu)
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
- for_each_leaf_cfs_rq(rq, cfs_rq) {
+ for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
struct sched_entity *se;
if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7728,6 +7746,13 @@ static void update_blocked_averages(int cpu)
if (se && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, 0);
+ /*
+ * There can be a lot of idle CPU cgroups. Don't let fully
+ * decayed cfs_rqs linger on the list.
+ */
+ if (cfs_rq_is_decayed(cfs_rq))
+ list_del_leaf_cfs_rq(cfs_rq);
+
/* Don't need periodic decay once load/util_avg are null */
if (cfs_rq_has_blocked(cfs_rq))
done = false;
@@ -10609,10 +10634,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SCHED_DEBUG
void print_cfs_stats(struct seq_file *m, int cpu)
{
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq, *pos;
rcu_read_lock();
- for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+ for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
print_cfs_rq(m, cpu, cfs_rq);
rcu_read_unlock();
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-02-11 10:48 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-08 15:40 ` Peter Zijlstra
2019-02-08 15:44 ` Vincent Guittot
2019-02-08 16:30 ` Peter Zijlstra
2019-02-08 16:47 ` Vincent Guittot
2019-02-08 16:51 ` Peter Zijlstra
2019-02-08 16:51 ` Vincent Guittot
2019-02-08 16:30 ` Peter Zijlstra
2019-02-08 16:48 ` Vincent Guittot
2019-02-11 10:47 ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-11 10:48 ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).