[PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
@ 2017-05-09 16:17 Tejun Heo
  2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo
  2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2017-05-09 16:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(total number of CPU cgroups)
which isn't scalable at all.

The next patch will make rq->leaf_cfs_rq_list only contain the cfs_rqs
which are currently active.  In preparation, this patch converts users
which need to traverse all cfs_rqs to use task_groups list instead.

task_groups list is protected by its own lock and allows RCU protected
traversal and the order of operations guarantees that all online
cfs_rqs will be visited, but holding rq->lock won't protect against
iterating an already unregistered cfs_rq.  However, the operations of
the two users that get converted - update_runtime_enabled() and
unthrottle_offline_cfs_rqs() - should be safe to perform on already
dead cfs_rqs, so adding rcu read protection around them should be
enough.

Note that print_cfs_stats() is not converted.  The next patch will
change its behavior to print out only active cfs_rqs, which is
intended as there's not much point in printing out idle cfs_rqs.

v2: Dropped strong synchronization around removal and left
    print_cfs_stats() unchanged as suggested by Peterz.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org
---
Peter,

Updated the patch - pasted in your version and updated the description
accordingly.  Please feel free to use this one or whichever you like.

Based on mainline and stable is cc'd.

Thanks.

 kernel/sched/fair.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4644,22 +4644,32 @@ static void destroy_cfs_bandwidth(struct

 static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq;
+	struct task_group *tg;

-	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+	lockdep_assert_held(&rq->lock);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(tg, &task_groups, list) {
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];

 		raw_spin_lock(&cfs_b->lock);
 		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
 		raw_spin_unlock(&cfs_b->lock);
 	}
+	rcu_read_unlock();
 }

 static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq;
+	struct task_group *tg;
+
+	lockdep_assert_held(&rq->lock);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(tg, &task_groups, list) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];

-	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		if (!cfs_rq->runtime_enabled)
 			continue;

@@ -4677,6 +4687,7 @@ static void __maybe_unused unthrottle_of
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
 	}
+	rcu_read_unlock();
 }

 #else /* CONFIG_CFS_BANDWIDTH */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo
@ 2017-05-09 16:18 ` Tejun Heo
  2017-05-10  6:50   ` Vincent Guittot
  2017-05-10 17:36   ` [PATCH v3 " Tejun Heo
  2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen
  1 sibling, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2017-05-09 16:18 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.

This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups.  In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.

This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org
---
Just refreshed on top of the first patch.

 kernel/sched/fair.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq(
 }
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
+				 leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq(
 {
 }
 
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
@@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env *
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
@@ -6994,7 +6995,7 @@ static void update_blocked_averages(int
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
 			update_load_avg(se, 0);
+
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
+		    !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
+			list_del_leaf_cfs_rq(cfs_rq);
 	}
 	rq_unlock_irqrestore(rq, &rf);
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo
@ 2017-05-10  6:50   ` Vincent Guittot
  2017-05-10 14:44     ` Tejun Heo
  2017-05-10 17:36   ` [PATCH v3 " Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2017-05-10  6:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote:
> Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> live cfs_rqs which have ever been active on the CPU; unfortunately,
> this makes update_blocked_averages() O(# total cgroups) which isn't
> scalable at all.

Dietmar raised similar optimization in the past. The only question was
: what is the impact of  re-adding the cfs_rq in leaf_cfs_rq_list on
the wake up path ? Have you done some measurements ?

>
> This shows up as a small CPU consumption and scheduling latency
> increase in the load balancing path in systems with CPU controller
> enabled across most cgroups.  In an edge case where temporary cgroups
> were leaking, this caused the kernel to consume good several tens of
> percents of CPU cycles running update_blocked_averages(), each run
> taking multiple millisecs.
>
> This patch fixes the issue by taking empty and fully decayed cfs_rqs
> off the rq->leaf_cfs_rq_list.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> Cc: Chris Mason <clm@fb.com>
> Cc: stable@vger.kernel.org
> ---
> Just refreshed on top of the first patch.
>
>  kernel/sched/fair.c |   19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq(
>  }
>
>  /* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> -       list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)                     \
> +       list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,    \
> +                                leaf_cfs_rq_list)
>
>  /* Do the two (enqueued) entities belong to the same group ? */
>  static inline struct cfs_rq *
> @@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq(
>  {
>  }
>
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)     \
>                 for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
>  static inline struct sched_entity *parent_entity(struct sched_entity *se)
> @@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env *
>  static void update_blocked_averages(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq, *pos;
>         struct rq_flags rf;
>
>         rq_lock_irqsave(rq, &rf);
> @@ -6994,7 +6995,7 @@ static void update_blocked_averages(int
>          * Iterates the task_group tree in a bottom up fashion, see
>          * list_add_leaf_cfs_rq() for details.
>          */
> -       for_each_leaf_cfs_rq(rq, cfs_rq) {
> +       for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
>                 struct sched_entity *se;
>
>                 /* throttled entities do not contribute to load */
> @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
>                 se = cfs_rq->tg->se[cpu];
>                 if (se && !skip_blocked_update(se))
>                         update_load_avg(se, 0);
> +
> +               /*
> +                * There can be a lot of idle CPU cgroups.  Don't let fully
> +                * decayed cfs_rqs linger on the list.
> +                */
> +               if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
> +                   !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
> +                       list_del_leaf_cfs_rq(cfs_rq);

list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up.
By removing  cfs_rq, can't we break this assumption in some cases ?

Regards,
Vincent

>         }
>         rq_unlock_irqrestore(rq, &rf);
>  }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-10  6:50   ` Vincent Guittot
@ 2017-05-10 14:44     ` Tejun Heo
  2017-05-10 15:55       ` Tejun Heo
  2017-05-11  7:02       ` Vincent Guittot
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2017-05-10 14:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote:
> On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote:
> > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> > live cfs_rqs which have ever been active on the CPU; unfortunately,
> > this makes update_blocked_averages() O(# total cgroups) which isn't
> > scalable at all.
> 
> Dietmar raised similar optimization in the past. The only question was
> : what is the impact of  re-adding the cfs_rq in leaf_cfs_rq_list on
> the wake up path ? Have you done some measurements ?

Didn't do a perf test yet but it's several more branches and a local
list operation on enqueue, which is already pretty expensive vs. load
balance being O(total number of cgroups on the system).

Anyways, I'll do some hackbench tests with several levels of layering.

> > @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
> >                 se = cfs_rq->tg->se[cpu];
> >                 if (se && !skip_blocked_update(se))
> >                         update_load_avg(se, 0);
> > +
> > +               /*
> > +                * There can be a lot of idle CPU cgroups.  Don't let fully
> > +                * decayed cfs_rqs linger on the list.
> > +                */
> > +               if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
> > +                   !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
> > +                       list_del_leaf_cfs_rq(cfs_rq);
> 
> list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up.
> By removing  cfs_rq, can't we break this assumption in some cases ?

We queue a cfs_rq on the leaf list when the a se is queued on that
cfs_rq for the first time, so queueing can happen in any order;
otherwise, we'd simply be doing list_add_tail().  AFAICS, removing and
re-adding shouldn't break anything if the code wasn't broken before.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-10 14:44     ` Tejun Heo
@ 2017-05-10 15:55       ` Tejun Heo
  2017-05-11  7:02       ` Vincent Guittot
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2017-05-10 15:55 UTC (permalink / raw)
  To: ying.huang, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 10, 2017 at 10:44:14AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote:
> > On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote:
> > > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> > > live cfs_rqs which have ever been active on the CPU; unfortunately,
> > > this makes update_blocked_averages() O(# total cgroups) which isn't
> > > scalable at all.
> > 
> > Dietmar raised similar optimization in the past. The only question was
> > : what is the impact of  re-adding the cfs_rq in leaf_cfs_rq_list on
> > the wake up path ? Have you done some measurements ?
> 
> Didn't do a perf test yet but it's several more branches and a local
> list operation on enqueue, which is already pretty expensive vs. load
> balance being O(total number of cgroups on the system).
> 
> Anyways, I'll do some hackbench tests with several levels of layering.

Oh, BTW, do we still need bc4278987e38 ("sched/fair: Fix FTQ noise
bench regression") with this patch?  That patch looks like it's just
cutting down on the body of the O(#cgroups) loop.

Ying, can you verify whether the following patch on top of the current
linus#master 56868a460b83 also fixes the regression that you saw?

Index: work/kernel/sched/fair.c
===================================================================
--- work.orig/kernel/sched/fair.c
+++ work/kernel/sched/fair.c
@@ -372,6 +372,10 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
+				 leaf_cfs_rq_list)
+
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
@@ -466,6 +470,9 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
 	return NULL;
@@ -3170,36 +3177,6 @@ static inline int propagate_entity_load_
 	return 1;
 }
 
-/*
- * Check if we need to update the load and the utilization of a blocked
- * group_entity:
- */
-static inline bool skip_blocked_update(struct sched_entity *se)
-{
-	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-
-	/*
-	 * If sched_entity still have not zero load or utilization, we have to
-	 * decay it:
-	 */
-	if (se->avg.load_avg || se->avg.util_avg)
-		return false;
-
-	/*
-	 * If there is a pending propagation, we have to update the load and
-	 * the utilization of the sched_entity:
-	 */
-	if (gcfs_rq->propagate_avg)
-		return false;
-
-	/*
-	 * Otherwise, the load and the utilization of the sched_entity is
-	 * already zero and there is no pending propagation, so it will be a
-	 * waste of time to try to decay it:
-	 */
-	return true;
-}
-
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
@@ -4644,22 +4621,32 @@ static void destroy_cfs_bandwidth(struct
 
 static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq;
+	struct task_group *tg;
+
+	lockdep_assert_held(&rq->lock);
 
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+	rcu_read_lock();
+	list_for_each_entry_rcu(tg, &task_groups, list) {
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 		raw_spin_lock(&cfs_b->lock);
 		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
 		raw_spin_unlock(&cfs_b->lock);
 	}
+	rcu_read_unlock();
 }
 
 static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
-	struct cfs_rq *cfs_rq;
+	struct task_group *tg;
+
+	lockdep_assert_held(&rq->lock);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(tg, &task_groups, list) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		if (!cfs_rq->runtime_enabled)
 			continue;
 
@@ -4677,6 +4664,7 @@ static void __maybe_unused unthrottle_of
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
 	}
+	rcu_read_unlock();
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
@@ -6973,7 +6961,7 @@ static void attach_tasks(struct lb_env *
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
@@ -6983,9 +6971,7 @@ static void update_blocked_averages(int
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		struct sched_entity *se;
-
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		/* throttled entities do not contribute to load */
 		if (throttled_hierarchy(cfs_rq))
 			continue;
@@ -6993,10 +6979,17 @@ static void update_blocked_averages(int
 		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
 			update_tg_load_avg(cfs_rq, 0);
 
-		/* Propagate pending load changes to the parent, if any: */
-		se = cfs_rq->tg->se[cpu];
-		if (se && !skip_blocked_update(se))
-			update_load_avg(se, 0);
+		/* Propagate pending load changes to the parent */
+		if (cfs_rq->tg->se[cpu])
+			update_load_avg(cfs_rq->tg->se[cpu], 0);
+
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
+		    !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
+			list_del_leaf_cfs_rq(cfs_rq);
 	}
 	rq_unlock_irqrestore(rq, &rf);
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo
  2017-05-10  6:50   ` Vincent Guittot
@ 2017-05-10 17:36   ` Tejun Heo
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2017-05-10 17:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
live cfs_rqs which have ever been active on the CPU; unfortunately,
this makes update_blocked_averages() O(# total cgroups) which isn't
scalable at all.

This shows up as a small CPU consumption and scheduling latency
increase in the load balancing path in systems with CPU controller
enabled across most cgroups.  In an edge case where temporary cgroups
were leaking, this caused the kernel to consume good several tens of
percents of CPU cycles running update_blocked_averages(), each run
taking multiple millisecs.

This patch fixes the issue by taking empty and fully decayed cfs_rqs
off the rq->leaf_cfs_rq_list.

v3: The debug path still uses for_each_leaf_cfs_rq().  Keep it around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org
---
Hello,

The previous patch caused build error if SCHED_DEBUG is set as that
still uses for_each_leaf_cfs_rq().  Patch updated to keep that around.

Thanks.

 kernel/sched/fair.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -372,6 +372,10 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
+				 leaf_cfs_rq_list)
+
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
@@ -466,6 +470,9 @@ static inline void list_del_leaf_cfs_rq(
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
+#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
 	return NULL;
@@ -6984,7 +6991,7 @@ static void attach_tasks(struct lb_env *
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq, *pos;
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
@@ -6994,7 +7001,7 @@ static void update_blocked_averages(int
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7008,6 +7015,14 @@ static void update_blocked_averages(int
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
 			update_load_avg(se, 0);
+
+		/*
+		 * There can be a lot of idle CPU cgroups.  Don't let fully
+		 * decayed cfs_rqs linger on the list.
+		 */
+		if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
+		    !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
+			list_del_leaf_cfs_rq(cfs_rq);
 	}
 	rq_unlock_irqrestore(rq, &rf);
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-10 14:44     ` Tejun Heo
  2017-05-10 15:55       ` Tejun Heo
@ 2017-05-11  7:02       ` Vincent Guittot
  2017-05-12 13:16         ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2017-05-11  7:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 10 May 2017 at 16:44, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, May 10, 2017 at 08:50:14AM +0200, Vincent Guittot wrote:
>> On 9 May 2017 at 18:18, Tejun Heo <tj@kernel.org> wrote:
>> > Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
>> > live cfs_rqs which have ever been active on the CPU; unfortunately,
>> > this makes update_blocked_averages() O(# total cgroups) which isn't
>> > scalable at all.
>>
>> Dietmar raised similar optimization in the past. The only question was
>> : what is the impact of  re-adding the cfs_rq in leaf_cfs_rq_list on
>> the wake up path ? Have you done some measurements ?
>
> Didn't do a perf test yet but it's several more branches and a local
> list operation on enqueue, which is already pretty expensive vs. load
> balance being O(total number of cgroups on the system).
>
> Anyways, I'll do some hackbench tests with several levels of layering.
>
>> > @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
>> >                 se = cfs_rq->tg->se[cpu];
>> >                 if (se && !skip_blocked_update(se))
>> >                         update_load_avg(se, 0);
>> > +
>> > +               /*
>> > +                * There can be a lot of idle CPU cgroups.  Don't let fully
>> > +                * decayed cfs_rqs linger on the list.
>> > +                */
>> > +               if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
>> > +                   !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
>> > +                       list_del_leaf_cfs_rq(cfs_rq);
>>
>> list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up.
>> By removing  cfs_rq, can't we break this assumption in some cases ?
>
> We queue a cfs_rq on the leaf list when the a se is queued on that
> cfs_rq for the first time, so queueing can happen in any order;

Sorry, what i mean is:
When the group entity of a cfs_rq is enqueued, we are sure that either
the parents is already enqueued or it will be enqueued in the same
sequence. We must be sure that no other branch will be enqueued in the
middle of the sequence and will reset tmp_alone_branch.
This is true with current implementation but I  wondered it can happen
if we del/add the cfs_rq out of order

That said i haven't find a use case that break the sequence


> otherwise, we'd simply be doing list_add_tail().  AFAICS, removing and
> re-adding shouldn't break anything if the code wasn't broken before.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-11  7:02       ` Vincent Guittot
@ 2017-05-12 13:16         ` Tejun Heo
  2017-05-12 14:36           ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-05-12 13:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Thu, May 11, 2017 at 09:02:22AM +0200, Vincent Guittot wrote:
> Sorry, what i mean is:
> When the group entity of a cfs_rq is enqueued, we are sure that either
> the parents is already enqueued or it will be enqueued in the same
> sequence. We must be sure that no other branch will be enqueued in the
> middle of the sequence and will reset tmp_alone_branch.
> This is true with current implementation but I  wondered it can happen
> if we del/add the cfs_rq out of order
> 
> That said i haven't find a use case that break the sequence

Hmm... a cfs_rq can be removed from leaf list iff it's empty and
dequeued, and enqueueing is always from top down.  If an ancestor is
already enqueued, it's guaranteed to be on the leaf list; otherwise,
it's guaranteed to be enqueued beforehand and thus put on the leaf
list too.  I think it should be fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path
  2017-05-12 13:16         ` Tejun Heo
@ 2017-05-12 14:36           ` Vincent Guittot
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2017-05-12 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 12 May 2017 at 15:16, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Thu, May 11, 2017 at 09:02:22AM +0200, Vincent Guittot wrote:
>> Sorry, what i mean is:
>> When the group entity of a cfs_rq is enqueued, we are sure that either
>> the parents is already enqueued or it will be enqueued in the same
>> sequence. We must be sure that no other branch will be enqueued in the
>> middle of the sequence and will reset tmp_alone_branch.
>> This is true with current implementation but I  wondered it can happen
>> if we del/add the cfs_rq out of order
>>
>> That said i haven't find a use case that break the sequence
>
> Hmm... a cfs_rq can be removed from leaf list iff it's empty and
> dequeued, and enqueueing is always from top down.  If an ancestor is
> already enqueued, it's guaranteed to be on the leaf list; otherwise,
> it's guaranteed to be enqueued beforehand and thus put on the leaf
> list too.  I think it should be fine.

I agree

FWIW
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
  2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo
  2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo
@ 2017-05-24 23:40 ` Tim Chen
  2017-05-25 14:39   ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Tim Chen @ 2017-05-24 23:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team, mohini.narkhede



On 05/09/2017 09:17 AM, Tejun Heo wrote:
> Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> live cfs_rqs which have ever been active on the CPU; unfortunately,
> this makes update_blocked_averages() O(total number of CPU cgroups)
> which isn't scalable at all.
> 
> The next patch will make rq->leaf_cfs_rq_list only contain the cfs_rqs
> which are currently active.  In preparation, this patch converts users
> which need to traverse all cfs_rqs to use task_groups list instead.
> 
> task_groups list is protected by its own lock and allows RCU protected
> traversal and the order of operations guarantees that all online
> cfs_rqs will be visited, but holding rq->lock won't protect against
> iterating an already unregistered cfs_rq.  However, the operations of
> the two users that get converted - update_runtime_enabled() and
> unthrottle_offline_cfs_rqs() - should be safe to perform on already
> dead cfs_rqs, so adding rcu read protection around them should be
> enough.
> 
> Note that print_cfs_stats() is not converted.  The next patch will
> change its behavior to print out only active cfs_rqs, which is
> intended as there's not much point in printing out idle cfs_rqs.
> 
> v2: Dropped strong synchronization around removal and left
>      print_cfs_stats() unchanged as suggested by Peterz.
> 
> 

Tejun,

We did some preliminary testing of this patchset for a well
known database benchmark on a 4 socket Skylake server system.
It provides a 3.7% throughput boost which is significant for
this benchmark.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
  2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen
@ 2017-05-25 14:39   ` Tejun Heo
  2017-05-26 23:04     ` Tim Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-05-25 14:39 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team, mohini.narkhede

On Wed, May 24, 2017 at 04:40:34PM -0700, Tim Chen wrote:
> We did some preliminary testing of this patchset for a well
> known database benchmark on a 4 socket Skylake server system.
> It provides a 3.7% throughput boost which is significant for
> this benchmark.

That's great to hear.  Yeah, the walk can be noticeably expensive even
with moderate number of cgroups.  Thanks for sharing the result.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs
  2017-05-25 14:39   ` Tejun Heo
@ 2017-05-26 23:04     ` Tim Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Tim Chen @ 2017-05-26 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team, mohini.narkhede



On 05/25/2017 07:39 AM, Tejun Heo wrote:
> On Wed, May 24, 2017 at 04:40:34PM -0700, Tim Chen wrote:
>> We did some preliminary testing of this patchset for a well
>> known database benchmark on a 4 socket Skylake server system.
>> It provides a 3.7% throughput boost which is significant for
>> this benchmark.
> 
> That's great to hear.  Yeah, the walk can be noticeably expensive even
> with moderate number of cgroups.  Thanks for sharing the result.
> 

Yes, the walk in update_blocked_averages has bad scaling property as it
iterates over *all* cfs_rq's leaf tasks, making it very expensive. It
consumes 11.7% of our cpu cycles for this benchmark when CGROUP
is on. Your patchset skips unused cgroup and reduce the overhead to
10.4%. CPU cycles profile is attached below for your reference.

The scheduler's frequent update of cgroup's laod averages, and
having to iterate all the leaf tasks for each load balance causes
update_blocked_averages to be one of the most expensive functions in the
system, making CGROUP costly.  Without CGROUP, schedule only cost 3.3%
of cpu cycles vs 16.4% with CGROUP turned on. Your patchset does reduce
it to 14.9%.

This benchmark has thousands of running tasks, so it puts a good
deal of stress to the scheduler.

Tim


CPU cycles profile:

4.11 Before your patchset with CGROUP:
---------------------------------------

     16.42%     0.03%           280  [kernel.vmlinux]                                [k] schedule
             |
              --16.39%--schedule
                        |
                         --16.31%--__sched_text_start
                                   |
                                   |--12.85%--pick_next_task_fair
                                   |          |
                                   |           --11.71%--update_blocked_averages
                                   |                     |
                                   |                      --5.00%--update_load_avg
                                   |
                                   |--2.04%--finish_task_switch
                                   |          |
                                   |          |--0.85%--ret_from_intr
                                   |          |          |
                                   |          |           --0.85%--do_IRQ
                                   |          |
                                   |           --0.75%--apic_timer_interrupt
                                   |                     |
                                   |                      --0.75%--smp_apic_timer_interrupt
                                   |                                |
                                   |                                 --0.55%--irq_exit
                                   |                                           |
                                   |                                            --0.55%--__do_softirq
                                   |
                                    --0.51%--deactivate_task


4.11 After your patchset with CGROUP:
-------------------------------------

     14.90%     0.04%           337  [kernel.vmlinux]                                [k] schedule
             |
              --14.86%--schedule
                        |
                         --14.78%--__sched_text_start
                                   |
                                   |--11.51%--pick_next_task_fair
                                   |          |
                                   |           --10.37%--update_blocked_averages
                                   |                     |
                                   |                      --4.55%--update_load_avg
                                   |
                                   |--1.79%--finish_task_switch
                                   |          |
                                   |          |--0.77%--ret_from_intr
                                   |          |          |
                                   |          |           --0.77%--do_IRQ
                                   |          |
                                   |           --0.65%--apic_timer_interrupt
                                   |                     |
                                   |                      --0.65%--smp_apic_timer_interrupt
                                   |
                                    --0.53%--deactivate_task

4.11 with No CGROUP:
--------------------

      3.33%     0.04%           336  [kernel.vmlinux]                                [k] schedule
             |
              --3.29%--schedule
                        |
                         --3.19%--__sched_text_start
                                   |
                                    --1.45%--pick_next_task_fair
                                              |
                                               --1.15%--load_balance
                                                         |
                                                          --0.87%--find_busiest_group
                                                                    |
                                                                     --0.82%--update_sd_lb_stats

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-27  1:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-09 16:17 [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tejun Heo
2017-05-09 16:18 ` [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path Tejun Heo
2017-05-10  6:50   ` Vincent Guittot
2017-05-10 14:44     ` Tejun Heo
2017-05-10 15:55       ` Tejun Heo
2017-05-11  7:02       ` Vincent Guittot
2017-05-12 13:16         ` Tejun Heo
2017-05-12 14:36           ` Vincent Guittot
2017-05-10 17:36   ` [PATCH v3 " Tejun Heo
2017-05-24 23:40 ` [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs Tim Chen
2017-05-25 14:39   ` Tejun Heo
2017-05-26 23:04     ` Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.