[PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
@ 2019-02-06 16:14 Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

This patchset adds missing pieces in the management of leaf_cfs_rq_list 
to ensure that cfs_rqs are always correctly ordered before
re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")

Vincent Guittot (2):
  sched/fair: optimization of update_blocked_averages()
  sched/fair: Fix O(nr_cgroups) in load balance path

 kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
                     ` (2 more replies)
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2 siblings, 3 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too so they
will be added back in the right order next time.

With a normal decay of pelt, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
  commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")

In case of throttling, the pelt of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.

As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffd1ae7..badf8173 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		/*
+		 * With cfs_rq being unthrottled/throttled during an enqueue,
+		 * it can happen the tmp_alone_branch points the a leaf that
+		 * we finally want to del. In this case, tmp_alone_branch moves
+		 * to the prev element but it will point to rq->leaf_cfs_rq_list
+		 * at the end of the enqueue.
+		 */
+		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
 					     cfs_rq->throttled_clock_task;
+
+		/* Add cfs_rq with already running entity in the list */
+		if (cfs_rq->nr_running >= 1)
+			list_add_leaf_cfs_rq(cfs_rq);
 	}
 
 	return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		list_del_leaf_cfs_rq(cfs_rq);
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -7699,10 +7717,6 @@ static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-08 15:40   ` Peter Zijlstra
  2019-02-08 15:44     ` Vincent Guittot
  2019-02-08 16:30   ` Peter Zijlstra
  2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 15:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds


Argh head hurts!!

On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  		/* adjust cfs_rq_clock_task() */
>  		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
>  					     cfs_rq->throttled_clock_task;
> +
> +		/* Add cfs_rq with already running entity in the list */
> +		if (cfs_rq->nr_running >= 1)
> +			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>  
>  	return 0;

Do we want the below to go with the above change?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..0bd80a891025 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
+	assert_list_leaf_cfs_rq(rq);
+
 	if (!cfs_rq->load.weight)
 		return;
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 15:40   ` Peter Zijlstra
@ 2019-02-08 15:44     ` Vincent Guittot
  2019-02-08 16:30       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Argh head hurts!!
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >               /* adjust cfs_rq_clock_task() */
> >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> >                                            cfs_rq->throttled_clock_task;
> > +
> > +             /* Add cfs_rq with already running entity in the list */
> > +             if (cfs_rq->nr_running >= 1)
> > +                     list_add_leaf_cfs_rq(cfs_rq);
> >       }
> >
> >       return 0;
>
> Do we want the below to go with the above change?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 38d4669aa2ef..0bd80a891025 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>         /* update hierarchical throttle state */
>         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>
> +       assert_list_leaf_cfs_rq(rq);
> +

Good point but this should go after the for_each_sched_entity() loop

>         if (!cfs_rq->load.weight)
>                 return;
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 15:44     ` Vincent Guittot
@ 2019-02-08 16:30       ` Peter Zijlstra
  2019-02-08 16:47         ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >
> > Argh head hurts!!
> >
> > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > >               /* adjust cfs_rq_clock_task() */
> > >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > >                                            cfs_rq->throttled_clock_task;
> > > +
> > > +             /* Add cfs_rq with already running entity in the list */
> > > +             if (cfs_rq->nr_running >= 1)
> > > +                     list_add_leaf_cfs_rq(cfs_rq);
> > >       }
> > >
> > >       return 0;
> >
> > Do we want the below to go with the above change?
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 38d4669aa2ef..0bd80a891025 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >         /* update hierarchical throttle state */
> >         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> >
> > +       assert_list_leaf_cfs_rq(rq);
> > +
> 
> Good point but this should go after the for_each_sched_entity() loop

Indeed, but that loop does enqueue and can throttle again, should that
not also get that additinoal list_add_leaf_cfs_rq() loop we added to
enqueue_task_fair() to finish the add?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:30       ` Peter Zijlstra
@ 2019-02-08 16:47         ` Vincent Guittot
  2019-02-08 16:51           ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > >
> > > Argh head hurts!!
> > >
> > > On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > > > @@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> > > >               /* adjust cfs_rq_clock_task() */
> > > >               cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
> > > >                                            cfs_rq->throttled_clock_task;
> > > > +
> > > > +             /* Add cfs_rq with already running entity in the list */
> > > > +             if (cfs_rq->nr_running >= 1)
> > > > +                     list_add_leaf_cfs_rq(cfs_rq);
> > > >       }
> > > >
> > > >       return 0;
> > >
> > > Do we want the below to go with the above change?
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 38d4669aa2ef..0bd80a891025 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4536,6 +4536,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > >         /* update hierarchical throttle state */
> > >         walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> > >
> > > +       assert_list_leaf_cfs_rq(rq);
> > > +
> >
> > Good point but this should go after the for_each_sched_entity() loop
>
> Indeed, but that loop does enqueue and can throttle again, should that
> not also get that additional list_add_leaf_cfs_rq() loop we added to
> enqueue_task_fair() to finish the add?

Initially, I added this additional loop but finally removed it because
I didn't hit it during my tests. IIRC, we are protected by
throttle_count in such case, which is not the case when we enqueue a
task

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:47         ` Vincent Guittot
@ 2019-02-08 16:51           ` Peter Zijlstra
  2019-02-08 16:51             ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:

> > > Good point but this should go after the for_each_sched_entity() loop
> >
> > Indeed, but that loop does enqueue and can throttle again, should that
> > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > enqueue_task_fair() to finish the add?
> 
> Initially, I added this additional loop but finally removed it because
> I didn't hit it during my tests. IIRC, we are protected by
> throttle_count in such case, which is not the case when we enqueue a
> task

Fair enough; and the to-be added assert will notify us if we got that
wrong :-)

I'll add the assert, no need to re-send.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:51           ` Peter Zijlstra
@ 2019-02-08 16:51             ` Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 08, 2019 at 05:47:53PM +0100, Vincent Guittot wrote:
> > On Fri, 8 Feb 2019 at 17:30, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Feb 08, 2019 at 04:44:53PM +0100, Vincent Guittot wrote:
> > > > On Fri, 8 Feb 2019 at 16:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > > Good point but this should go after the for_each_sched_entity() loop
> > >
> > > Indeed, but that loop does enqueue and can throttle again, should that
> > > not also get that additional list_add_leaf_cfs_rq() loop we added to
> > > enqueue_task_fair() to finish the add?
> >
> > Initially, I added this additional loop but finally removed it because
> > I didn't hit it during my tests. IIRC, we are protected by
> > throttle_count in such case, which is not the case when we enqueue a
> > task
>
> Fair enough; and the to-be added assert will notify us if we got that
> wrong :-)
>
> I'll add the assert, no need to re-send.

Thanks

>
> Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
@ 2019-02-08 16:30   ` Peter Zijlstra
  2019-02-08 16:48     ` Vincent Guittot
  2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2019-02-08 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, tj, sargun, xiexiuqi, xiezhipeng1, torvalds

On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	if (cfs_rq->on_list) {
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		/*
> +		 * With cfs_rq being unthrottled/throttled during an enqueue,
> +		 * it can happen the tmp_alone_branch points the a leaf that
> +		 * we finally want to del. In this case, tmp_alone_branch moves
> +		 * to the prev element but it will point to rq->leaf_cfs_rq_list
> +		 * at the end of the enqueue.
> +		 */
> +		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> +			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> +
>  		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
>  		cfs_rq->on_list = 0;
>  	}

So that is:

  enqueue_task_fair()
    enqueue_entity()
      list_add_lead_cfs_rq()
      check_enqueue_throttle()
        throttle_cfs_rq()
	  walk_tg_tree_from()
	    tg_throttle_down()
	      list_del_leaf_cfs_rq()

Which can try and remove a cfs_rq which we just added.

And because the list is a bottom-up order, and the deletion is a
downward operation, we must go back (prev) in the list.

So far so good I suppose.

> @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
>  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>  
>  	/* group is entering throttled state, stop time */
> -	if (!cfs_rq->throttle_count)
> +	if (!cfs_rq->throttle_count) {
>  		cfs_rq->throttled_clock_task = rq_clock_task(rq);
> +		list_del_leaf_cfs_rq(cfs_rq);
> +	}
>  	cfs_rq->throttle_count++;
>  
>  	return 0;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: optimization of update_blocked_averages()
  2019-02-08 16:30   ` Peter Zijlstra
@ 2019-02-08 16:48     ` Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-08 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Tejun Heo, Sargun Dhillon, Xie XiuQi,
	xiezhipeng1, Linus Torvalds

On Fri, 8 Feb 2019 at 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 06, 2019 at 05:14:21PM +0100, Vincent Guittot wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >       if (cfs_rq->on_list) {
> > +             struct rq *rq = rq_of(cfs_rq);
> > +
> > +             /*
> > +              * With cfs_rq being unthrottled/throttled during an enqueue,
> > +              * it can happen the tmp_alone_branch points the a leaf that
> > +              * we finally want to del. In this case, tmp_alone_branch moves
> > +              * to the prev element but it will point to rq->leaf_cfs_rq_list
> > +              * at the end of the enqueue.
> > +              */
> > +             if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > +                     rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > +
> >               list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> >               cfs_rq->on_list = 0;
> >       }
>
> So that is:
>
>   enqueue_task_fair()
>     enqueue_entity()
>       list_add_lead_cfs_rq()
>       check_enqueue_throttle()
>         throttle_cfs_rq()
>           walk_tg_tree_from()
>             tg_throttle_down()
>               list_del_leaf_cfs_rq()
>
> Which can try and remove a cfs_rq which we just added.
>
> And because the list is a bottom-up order, and the deletion is a
> downward operation, we must go back (prev) in the list.

yes exactly

>
> So far so good I suppose.
>
> > @@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> >       struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> >
> >       /* group is entering throttled state, stop time */
> > -     if (!cfs_rq->throttle_count)
> > +     if (!cfs_rq->throttle_count) {
> >               cfs_rq->throttled_clock_task = rq_clock_task(rq);
> > +             list_del_leaf_cfs_rq(cfs_rq);
> > +     }
> >       cfs_rq->throttle_count++;
> >
> >       return 0;
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/fair: Optimize update_blocked_averages()
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-08 15:40   ` Peter Zijlstra
  2019-02-08 16:30   ` Peter Zijlstra
@ 2019-02-11 10:47   ` tip-bot for Vincent Guittot
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, linux-kernel, hpa, mingo, peterz, torvalds, tglx

Commit-ID:  31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Gitweb:     https://git.kernel.org/tip/31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:12 +0100

sched/fair: Optimize update_blocked_averages()

Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too, so they
will be added back in the right order next time.

With a normal decay of PELT, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :

  commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")

In case of throttling, the PELT of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.

As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38d4669aa2ef..027f8e1b5b66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -346,6 +346,18 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		/*
+		 * With cfs_rq being unthrottled/throttled during an enqueue,
+		 * it can happen the tmp_alone_branch points the a leaf that
+		 * we finally want to del. In this case, tmp_alone_branch moves
+		 * to the prev element but it will point to rq->leaf_cfs_rq_list
+		 * at the end of the enqueue.
+		 */
+		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
@@ -4438,6 +4450,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
 					     cfs_rq->throttled_clock_task;
+
+		/* Add cfs_rq with already running entity in the list */
+		if (cfs_rq->nr_running >= 1)
+			list_add_leaf_cfs_rq(cfs_rq);
 	}
 
 	return 0;
@@ -4449,8 +4465,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		list_del_leaf_cfs_rq(cfs_rq);
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -4553,6 +4571,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			break;
 	}
 
+	assert_list_leaf_cfs_rq(rq);
+
 	if (!se)
 		add_nr_running(rq, task_delta);
 
@@ -7700,10 +7720,6 @@ static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
@ 2019-02-06 16:14 ` Vincent Guittot
  2019-02-11 10:48   ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
  2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:14 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz
  Cc: tj, sargun, xiexiuqi, xiezhipeng1, torvalds, Vincent Guittot

This reverts:
  commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

Now that cfs_rq can be safely removed/added in the list, we can re-apply
 commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index badf8173..c6167bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
+				 leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 {
 }
 
-#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
-		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
+		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7699,10 +7700,27 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->load.weight)
+		return false;
+
+	if (cfs_rq->avg.load_sum)
+		return false;
+
+	if (cfs_rq->avg.util_sum)
+		return false;
+
+	if (cfs_rq->avg.runnable_load_sum)
+		return false;
+
+	return true;
+}
+
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7714,7 +7732,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7725,6 +7743,13 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (cfs_rq_is_decayed(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10606,10 +10631,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-11 10:48   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Vincent Guittot @ 2019-02-11 10:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, tglx, peterz, vincent.guittot, torvalds, mingo, linux-kernel

Commit-ID:  039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Gitweb:     https://git.kernel.org/tip/039ae8bcf7a5f4476f4487e6bf816885fb3fb617
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 6 Feb 2019 17:14:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 11 Feb 2019 08:02:13 +0100

sched/fair: Fix O(nr_cgroups) in the load balancing path

This re-applies the commit reverted here:

  commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply:

 commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 027f8e1b5b66..17a961522d1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -368,9 +368,10 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
-/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
+				 leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -461,8 +462,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 {
 }
 
-#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
-		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
+		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7702,10 +7703,27 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->load.weight)
+		return false;
+
+	if (cfs_rq->avg.load_sum)
+		return false;
+
+	if (cfs_rq->avg.util_sum)
+		return false;
+
+	if (cfs_rq->avg.runnable_load_sum)
+		return false;
+
+	return true;
+}
+
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7717,7 +7735,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7728,6 +7746,13 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (cfs_rq_is_decayed(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10609,10 +10634,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path
  2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
  2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
  2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
@ 2019-02-06 16:23 ` Vincent Guittot
  2 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2019-02-06 16:23 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra
  Cc: Tejun Heo, Sargun Dhillon, Xie XiuQi, xiezhipeng1, Linus Torvalds

On Wed, 6 Feb 2019 at 17:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> This patchset adds missing pieces in the management of leaf_cfs_rq_list
> to ensure that cfs_rqs are always correctly ordered before
> re-enabling ("sched/fair: Fix O(nr_cgroups) in load balance path")

I should have mentioned that this patchset applies on the latest
tip/sched/core branch

>
> Vincent Guittot (2):
>   sched/fair: optimization of update_blocked_averages()
>   sched/fair: Fix O(nr_cgroups) in load balance path
>
>  kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 14 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-02-11 10:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-06 16:14 [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-06 16:14 ` [PATCH 1/2] sched/fair: optimization of update_blocked_averages() Vincent Guittot
2019-02-08 15:40   ` Peter Zijlstra
2019-02-08 15:44     ` Vincent Guittot
2019-02-08 16:30       ` Peter Zijlstra
2019-02-08 16:47         ` Vincent Guittot
2019-02-08 16:51           ` Peter Zijlstra
2019-02-08 16:51             ` Vincent Guittot
2019-02-08 16:30   ` Peter Zijlstra
2019-02-08 16:48     ` Vincent Guittot
2019-02-11 10:47   ` [tip:sched/core] sched/fair: Optimize update_blocked_averages() tip-bot for Vincent Guittot
2019-02-06 16:14 ` [PATCH 2/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot
2019-02-11 10:48   ` [tip:sched/core] sched/fair: Fix O(nr_cgroups) in the load balancing path tip-bot for Vincent Guittot
2019-02-06 16:23 ` [PATCH 0/2] sched/fair: Fix O(nr_cgroups) in load balance path Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).