[PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
@ 2019-02-06 16:14 Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

This patchset adds missing pieces in the management of leaf_cfs_rq_list 
to ensure that cfs_rqs are always correctly ordered before
re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")

Vincent Guittot (2):
  sched/fair: optimization of update_blocked_averages()
  sched/fair: Fix O(nr_cgroups) in load balance path

 kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
                     ` (2 more replies)
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too so they
will be added back in the right order next time.

With a normal decay of pelt, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
  commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")

In case of throttling, the pelt of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.

As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffd1ae7..badf8173 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		/*
+		 * With cfs_rq being unthrottled/throttled during an enqueue,
+		 * it can happen the tmp_alone_branch points the a leaf that
+		 * we finally want to del. In this case, tmp_alone_branch moves
+		 * to the prev element but it will point to rq->leaf_cfs_rq_list
+		 * at the end of the enqueue.
+		 */
+		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
 					     cfs_rq->throttled_clock_task;
+
+		/* Add cfs_rq with already running entity in the list */
+		if (cfs_rq->nr_running >= 1)
+			list_add_leaf_cfs_rq(cfs_rq);
 	}
 
 	return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		list_del_leaf_cfs_rq(cfs_rq);
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -7699,10 +7717,6 @@ static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
  2019-02-11 10:48   ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
  2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

This reverts:
  commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

Now that cfs_rq can be safely removed/added in the list, we can re-apply
 commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index badf8173..c6167bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
+				 leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 {
 }
 
-#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
-		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
+		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7699,10 +7700,27 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->load.weight)
+		return false;
+
+	if (cfs_rq->avg.load_sum)
+		return false;
+
+	if (cfs_rq->avg.util_sum)
+		return false;
+
+	if (cfs_rq->avg.runnable_load_sum)
+		return false;
+
+	return true;
+}
+
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7714,7 +7732,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7725,6 +7743,13 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (cfs_rq_is_decayed(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10606,10 +10631,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:23 ` Vincent Guittot
  2 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:23 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra
  Cc: Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds

On Wed, 6 Feb 2019 at 17:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> This patchset adds missing pieces in the management of leaf_cfs_rq_list
> to ensure that cfs_rqs are always correctly ordered before
> re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")

I should have mentioned that this patchset applies on the latest
tip/sched/core branch

>
> Vincent Guittot (2):
>   sched/fair: optimization of update_blocked_averages()
>   sched/fair: Fix O(nr_cgroups) in load balance path
>
>  kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 14 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-08 15:40   ` Peter Zijlstra
  2019-02-08 15:44     ` Vincent Guittot
  2019-02-08 16:30   ` Peter Zijlstra
  2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 15:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds


Argh head hurts!!

On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  		/* adjust cfs_rq_clock_task() */
>  		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
>  					     cfs_rq->throttled_clock_task;
> +
> +		/* Add cfs_rq with already running entity in the list */
> +		if (cfs_rq->nr_running >= 1)
> +			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>  
>  	return 0;

Do we want the below to go with the above change?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..0bd80a891025 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
+	assert_list_leaf_cfs_rq(rq);
+
 	if (!cfs_rq->load.weight)
 		return;
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 15:40   ` Peter Zijlstra
@ 2019-02-08 15:44     ` Vincent Guittot
  2019-02-08 16:30       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Argh head hurts!!
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >               /* adjust cfs_rq_clock_task() */
> >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> >                                            cfs_rq->throttled_clock_task;
> > +
> > +             /* Add cfs_rq with already running entity in the list */
> > +             if (cfs_rq->nr_running >= 1)
> > +                     list_add_leaf_cfs_rq(cfs_rq);
> >       }
> >
> >       return 0;
>
> Do we want the below to go with the above change?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 38d4669aa2ef..0bd80a891025 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>         /* update hierarchical throttle state */
>         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>
> +       assert_list_leaf_cfs_rq(rq);
> +

Good point but this should go after the for_each_sched_entity() loop

>         if (!cfs_rq->load.weight)
>                 return;
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 15:44     ` Vincent Guittot
@ 2019-02-08 16:30       ` Peter Zijlstra
  2019-02-08 16:47         ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >
> > Argh head hurts!!
> >
> > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > >               /* adjust cfs_rq_clock_task() */
> > >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > >                                            cfs_rq->throttled_clock_task;
> > > +
> > > +             /* Add cfs_rq with already running entity in the list */
> > > +             if (cfs_rq->nr_running >= 1)
> > > +                     list_add_leaf_cfs_rq(cfs_rq);
> > >       }
> > >
> > >       return 0;
> >
> > Do we want the below to go with the above change?
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 38d4669aa2ef..0bd80a891025 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >         /* update hierarchical throttle state */
> >         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> >
> > +       assert_list_leaf_cfs_rq(rq);
> > +
> 
> Good point but this should go after the for_each_sched_entity() loop

Indeed, but that loop does enqueue and can throttle again, should that
not also get that additinoal list_add_leaf_cfs_rq() loop we added to
enqueue_task_fair() to finish the add?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
@ 2019-02-08 16:30   ` Peter Zijlstra
  2019-02-08 16:48     ` Vincent Guittot
  2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds

On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	if (cfs_rq->on_list) {
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		/*
> +		 * With cfs_rq being unthrottled/throttled during an enqueue,
> +		 * it can happen the tmp_alone_branch points the a leaf that
> +		 * we finally want to del. In this case, tmp_alone_branch moves
> +		 * to the prev element but it will point to rq->leaf_cfs_rq_list
> +		 * at the end of the enqueue.
> +		 */
> +		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> +			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> +
>  		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
>  		cfs_rq->on_list = 0;
>  	}

So that is:

  enqueue_task_fair()
    enqueue_entity()
      list_add_lead_cfs_rq()
      check_enqueue_throttle()
        throttle_cfs_rq()
	  walk_tg_tree_from()
	    tg_throttle_down()
	      list_del_leaf_cfs_rq()

Which can try and remove a cfs_rq which we just added.

And because the list is a bottom-up order, and the deletion is a
downward operation, we must go back (prev) in the list.

So far so good I suppose.

> @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
>  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>  
>  	/* group is entering throttled state, stop time */
> -	if (!cfs_rq->throttle_count)
> +	if (!cfs_rq->throttle_count) {
>  		cfs_rq->throttled_clock_task = rq_clock_task(rq);
> +		list_del_leaf_cfs_rq(cfs_rq);
> +	}
>  	cfs_rq->throttle_count++;
>  
>  	return 0;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:30       ` Peter Zijlstra
@ 2019-02-08 16:47         ` Vincent Guittot
  2019-02-08 16:51           ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > >
> > > Argh head hurts!!
> > >
> > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > > >               /* adjust cfs_rq_clock_task() */
> > > >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > > >                                            cfs_rq->throttled_clock_task;
> > > > +
> > > > +             /* Add cfs_rq with already running entity in the list */
> > > > +             if (cfs_rq->nr_running >= 1)
> > > > +                     list_add_leaf_cfs_rq(cfs_rq);
> > > >       }
> > > >
> > > >       return 0;
> > >
> > > Do we want the below to go with the above change?
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 38d4669aa2ef..0bd80a891025 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > >         /* update hierarchical throttle state */
> > >         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> > >
> > > +       assert_list_leaf_cfs_rq(rq);
> > > +
> >
> > Good point but this should go after the for_each_sched_entity() loop
>
> Indeed, but that loop does enqueue and can throttle again, should that
> not also get that additional list_add_leaf_cfs_rq() loop we added to
> enqueue_task_fair() to finish the add?

Initially, I added this additional loop but finally removed it because
I didn't hit it during my tests. IIRC, we are protected by
throttle_count in such case, which is not the case when we enqueue a
task

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:30   ` Peter Zijlstra
@ 2019-02-08 16:48     ` Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >       if (cfs_rq->on_list) {
> > +             struct rq *rq = rq_of(cfs_rq);
> > +
> > +             /*
> > +              * With cfs_rq being unthrottled/throttled during an enqueue,
> > +              * it can happen the tmp_alone_branch points the a leaf that
> > +              * we finally want to del. In this case, tmp_alone_branch moves
> > +              * to the prev element but it will point to rq->leaf_cfs_rq_list
> > +              * at the end of the enqueue.
> > +              */
> > +             if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > +                     rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > +
> >               list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> >               cfs_rq->on_list = 0;
> >       }
>
> So that is:
>
>   enqueue_task_fair()
>     enqueue_entity()
>       list_add_lead_cfs_rq()
>       check_enqueue_throttle()
>         throttle_cfs_rq()
>           walk_tg_tree_from()
>             tg_throttle_down()
>               list_del_leaf_cfs_rq()
>
> Which can try and remove a cfs_rq which we just added.
>
> And because the list is a bottom-up order, and the deletion is a
> downward operation, we must go back (prev) in the list.

yes exactly

>
> So far so good I suppose.
>
> > @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> >       struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> >
> >       /* group is entering throttled state, stop time */
> > -     if (!cfs_rq->throttle_count)
> > +     if (!cfs_rq->throttle_count) {
> >               cfs_rq->throttled_clock_task = rq_clock_task(rq);
> > +             list_del_leaf_cfs_rq(cfs_rq);
> > +     }
> >       cfs_rq->throttle_count++;
> >
> >       return 0;
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:47         ` Vincent Guittot
@ 2019-02-08 16:51           ` Peter Zijlstra
  2019-02-08 16:51             ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:

> > > Good point but this should go after the for_each_sched_entity() loop
> >
> > Indeed, but that loop does enqueue and can throttle again, should that
> > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > enqueue_task_fair() to finish the add?
> 
> Initially, I added this additional loop but finally removed it because
> I didn't hit it during my tests. IIRC, we are protected by
> throttle_count in such case, which is not the case when we enqueue a
> task

Fair enough; and the to-be added assert will notify us if we got that
wrong :-)

I'll add the assert, no need to re-send.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:51           ` Peter Zijlstra
@ 2019-02-08 16:51             ` Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > > Good point but this should go after the for_each_sched_entity() loop
> > >
> > > Indeed, but that loop does enqueue and can throttle again, should that
> > > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > > enqueue_task_fair() to finish the add?
> >
> > Initially, I added this additional loop but finally removed it because
> > I didn't hit it during my tests. IIRC, we are protected by
> > throttle_count in such case, which is not the case when we enqueue a
> > task
>
> Fair enough; and the to-be added assert will notify us if we got that
> wrong :-)
>
> I'll add the assert, no need to re-send.

Thanks

>
> Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/fair: Optimize update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
  2019-02-08 16:30   ` Peter Zijlstra
@ 2019-02-11 10:47   ` tip-bot for Vincent Guittot
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, linux-kernel, hpa, mingo, peterz, torvalds, tglx

Commit-ID:  31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Gitweb:     https://git.kernel.org/tip/31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:12 +0100

sched/fair: Optimize update_blocked_averages()

Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too, so they
will be added back in the right order next time.

With a normal decay of PELT, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :

  commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")

In case of throttling, the PELT of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.

As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..027f8e1b5b66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		/*
+		 * With cfs_rq being unthrottled/throttled during an enqueue,
+		 * it can happen the tmp_alone_branch points the a leaf that
+		 * we finally want to del. In this case, tmp_alone_branch moves
+		 * to the prev element but it will point to rq->leaf_cfs_rq_list
+		 * at the end of the enqueue.
+		 */
+		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
 					     cfs_rq->throttled_clock_task;
+
+		/* Add cfs_rq with already running entity in the list */
+		if (cfs_rq->nr_running >= 1)
+			list_add_leaf_cfs_rq(cfs_rq);
 	}
 
 	return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		list_del_leaf_cfs_rq(cfs_rq);
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -4553,6 +4571,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			break;
 	}
 
+	assert_list_leaf_cfs_rq(rq);
+
 	if (!se)
 		add_nr_running(rq, task_delta);
 
@@ -7700,10 +7720,6 @@ static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-11 10:48   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, tglx, peterz, vincent.guittot, torvalds, mingo, linux-kernel

Commit-ID:  039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Gitweb:     https://git.kernel.org/tip/039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:13 +0100

sched/fair: Fix O(nr_cgroups) in the load balancing path

This re-applies the commit reverted here:

  commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply:

 commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 027f8e1b5b66..17a961522d1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
+				 leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 {
 }
 
-#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
-		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
+		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7702,10 +7703,27 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->load.weight)
+		return false;
+
+	if (cfs_rq->avg.load_sum)
+		return false;
+
+	if (cfs_rq->avg.util_sum)
+		return false;
+
+	if (cfs_rq->avg.runnable_load_sum)
+		return false;
+
+	return true;
+}
+
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7717,7 +7735,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7728,6 +7746,13 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (cfs_rq_is_decayed(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10609,10 +10634,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-02-11 10:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-08 15:40   ` Peter Zijlstra
2019-02-08 15:44     ` Vincent Guittot
2019-02-08 16:30       ` Peter Zijlstra
2019-02-08 16:47         ` Vincent Guittot
2019-02-08 16:51           ` Peter Zijlstra
2019-02-08 16:51             ` Vincent Guittot
2019-02-08 16:30   ` Peter Zijlstra
2019-02-08 16:48     ` Vincent Guittot
2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-11 10:48   ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.