linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg
@ 2015-11-25 19:09 Waiman Long
  2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long

This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().

Waiman Long (3):
  sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
  sched/fair: Move hot load_avg into its own cacheline
  sched/fair: Use different cachelines for readers and writers of
    load_avg

 kernel/sched/core.c  |    9 +++++++++
 kernel/sched/fair.c  |   40 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |   15 ++++++++++++++-
 3 files changed, 58 insertions(+), 6 deletions(-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
  2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
@ 2015-11-25 19:09 ` Waiman Long
  2015-12-04 11:57   ` [tip:sched/core] " tip-bot for Waiman Long
  2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
  2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
  2 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/fair.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8..8f1eccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			bool *overload)
 {
 	unsigned long load;
-	int i;
+	int i, nr_running;
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
-		if (rq->nr_running > 1)
+		nr_running = rq->nr_running;
+		if (nr_running > 1)
 			*overload = true;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
-		if (idle_cpu(i))
+		/*
+		 * No need to call idle_cpu() if nr_running is not 0
+		 */
+		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
  2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
@ 2015-11-25 19:09 ` Waiman Long
  2015-11-30 10:23   ` Peter Zijlstra
  2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
  2 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
was enabled which was true for most distributions.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline separated
from the other frequently accessed variables. By doing so, the perf
profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/sched.h |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
 	unsigned long shares;
 
 #ifdef	CONFIG_SMP
-	atomic_long_t load_avg;
+	/*
+	 * load_avg can be heavily contended at clock tick time, so put
+	 * it in its own cacheline separated from the fields above which
+	 * will also be accessed at each tick.
+	 */
+	atomic_long_t load_avg ____cacheline_aligned;
 #endif
 #endif
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
  2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
  2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
@ 2015-11-25 19:09 ` Waiman Long
  2015-11-30 10:22   ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long

The load_avg statistical counter is only changed if the load on a CPU
deviates significantly from the previous tick. So there are usually
more readers than writers of load_avg. Still, on a large system,
the cacheline contention can cause significant slowdown and impact
performance.

This patch attempts to separate those load_avg readers
(update_cfs_shares) and writers (task_tick_fair) to use different
cachelines instead. Writers of load_avg will now accumulates the
load delta into load_avg_delta which sits in a different cacheline.
If load_avg_delta is sufficiently large (> load_avg/64), it will then
be added back to load_avg.

Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
480 threads), the perf profile before the patch was:

   9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares

After the patch, it became:

   2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
   2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
   2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
   1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
   0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
   0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares

The benchmark results before and after the patch were as follows:

  Before patch - Max-jOPs: 916011    Critical-jOps: 142366
  AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937

There was significant improvement in Critical-jOps which was latency
sensitive.

This patch does introduce additional delay in getting the real load
average reflected in load_avg. It may also incur additional overhead
if the number of CPUs in a task group is small. As a result, this
change is only activated when running on a 4-socket or larger systems
which can get the most benefit from it.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/core.c  |    9 +++++++++
 kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    8 ++++++++
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..f3075da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7356,6 +7356,12 @@ void __init sched_init(void)
 		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
 		ptr += nr_cpu_ids * sizeof(void **);
 
+#ifdef CONFIG_SMP
+		/*
+		 * Use load_avg_delta if not 2P or less
+		 */
+		root_task_group.use_la_delta = (num_possible_nodes() > 2);
+#endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	tg->use_la_delta = root_task_group.use_la_delta;
+#endif
 	return tg;
 
 err:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..44732cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
- * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
  * and effective_load (which is not done because it is too costly).
+ *
+ * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
+ * accumulated into the load_avg_delta variable instead to reduce cacheline
+ * contention on load_avg at the expense of more delay in reflecting the real
+ * load_avg. The tg's load_avg and load_avg_delta variables are in separate
+ * cachelines. With that flag set, load_avg will be read mostly whereas
+ * load_avg_delta will be write mostly.
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 
 	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
-		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		struct task_group *tg = cfs_rq->tg;
+		long load_avg, tot_delta;
+
+		if (!tg->use_la_delta) {
+			/*
+			 * If the use_la_delta isn't set, just add the
+			 * delta directly into load_avg.
+			 */
+			atomic_long_add(delta, &tg->load_avg);
+			goto set_contrib;
+		}
+
+		tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
+		load_avg = atomic_long_read(&tg->load_avg);
+		if (abs(tot_delta) > load_avg / 64) {
+			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
+			if (tot_delta)
+				atomic_long_add(tot_delta, &tg->load_avg);
+		}
+set_contrib:
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e679895..aef4e4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -252,8 +252,16 @@ struct task_group {
 	 * load_avg can be heavily contended at clock tick time, so put
 	 * it in its own cacheline separated from the fields above which
 	 * will also be accessed at each tick.
+	 *
+	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
+	 * to accumulate the delta and only change load_avg when the delta
+	 * is big enough. This reduces the cacheline contention on load_avg.
+	 * This flag will be set at allocation time depending on the system
+	 * configuration.
 	 */
+	int use_la_delta;
 	atomic_long_t load_avg ____cacheline_aligned;
+	atomic_long_t load_avg_delta ____cacheline_aligned;
 #endif
 #endif
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
@ 2015-11-30 10:22   ` Peter Zijlstra
  2015-11-30 19:13     ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-11-30 10:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du


Please always Cc the people who wrote the code.

+CC pjt, ben, morten, yuyang

On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
> The load_avg statistical counter is only changed if the load on a CPU
> deviates significantly from the previous tick. So there are usually
> more readers than writers of load_avg. Still, on a large system,
> the cacheline contention can cause significant slowdown and impact
> performance.
> 
> This patch attempts to separate those load_avg readers
> (update_cfs_shares) and writers (task_tick_fair) to use different
> cachelines instead. Writers of load_avg will now accumulates the
> load delta into load_avg_delta which sits in a different cacheline.
> If load_avg_delta is sufficiently large (> load_avg/64), it will then
> be added back to load_avg.
> 
> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
> 480 threads), the perf profile before the patch was:
> 
>    9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>    7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
>    7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>    7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>    5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
>    4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares
> 
> After the patch, it became:
> 
>    2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>    2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
>    2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>    1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>    0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
>    0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares

This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.

That is; isn't this simply a problem we should avoid?

> The benchmark results before and after the patch were as follows:
> 
>   Before patch - Max-jOPs: 916011    Critical-jOps: 142366
>   AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937
> 
> There was significant improvement in Critical-jOps which was latency
> sensitive.
> 
> This patch does introduce additional delay in getting the real load
> average reflected in load_avg. It may also incur additional overhead
> if the number of CPUs in a task group is small. As a result, this
> change is only activated when running on a 4-socket or larger systems
> which can get the most benefit from it.

So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.

> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  kernel/sched/core.c  |    9 +++++++++
>  kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |    8 ++++++++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..f3075da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>  		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>  		ptr += nr_cpu_ids * sizeof(void **);
>  
> +#ifdef CONFIG_SMP
> +		/*
> +		 * Use load_avg_delta if not 2P or less
> +		 */
> +		root_task_group.use_la_delta = (num_possible_nodes() > 2);
> +#endif /* CONFIG_SMP */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>  	if (!alloc_rt_sched_group(tg, parent))
>  		goto err;
>  
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> +	tg->use_la_delta = root_task_group.use_la_delta;
> +#endif
>  	return tg;
>  
>  err:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..44732cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>   * and effective_load (which is not done because it is too costly).
> + *
> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
> + * accumulated into the load_avg_delta variable instead to reduce cacheline
> + * contention on load_avg at the expense of more delay in reflecting the real
> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
> + * cachelines. With that flag set, load_avg will be read mostly whereas
> + * load_avg_delta will be write mostly.
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
>  	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  
>  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> -		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		struct task_group *tg = cfs_rq->tg;
> +		long load_avg, tot_delta;
> +
> +		if (!tg->use_la_delta) {
> +			/*
> +			 * If the use_la_delta isn't set, just add the
> +			 * delta directly into load_avg.
> +			 */
> +			atomic_long_add(delta, &tg->load_avg);
> +			goto set_contrib;
> +		}
> +
> +		tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
> +		load_avg = atomic_long_read(&tg->load_avg);
> +		if (abs(tot_delta) > load_avg / 64) {
> +			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
> +			if (tot_delta)
> +				atomic_long_add(tot_delta, &tg->load_avg);
> +		}
> +set_contrib:
>  		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>  	}
>  }

I'm thinking that its now far too big to retain the inline qualifier.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e679895..aef4e4e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -252,8 +252,16 @@ struct task_group {
>  	 * load_avg can be heavily contended at clock tick time, so put
>  	 * it in its own cacheline separated from the fields above which
>  	 * will also be accessed at each tick.
> +	 *
> +	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
> +	 * to accumulate the delta and only change load_avg when the delta
> +	 * is big enough. This reduces the cacheline contention on load_avg.
> +	 * This flag will be set at allocation time depending on the system
> +	 * configuration.
>  	 */
> +	int use_la_delta;
>  	atomic_long_t load_avg ____cacheline_aligned;
> +	atomic_long_t load_avg_delta ____cacheline_aligned;

This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.

Also, you unconditionally grow the structure by a whole cacheline.

>  #endif
>  #endif
>  
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
@ 2015-11-30 10:23   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-11-30 10:23 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch

On Wed, Nov 25, 2015 at 02:09:39PM -0500, Waiman Long wrote:
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
>  	unsigned long shares;
>  
>  #ifdef	CONFIG_SMP
> -	atomic_long_t load_avg;
> +	/*
> +	 * load_avg can be heavily contended at clock tick time, so put
> +	 * it in its own cacheline separated from the fields above which
> +	 * will also be accessed at each tick.
> +	 */
> +	atomic_long_t load_avg ____cacheline_aligned;

Same as with the other patch; this only works if the structure itself is
cacheline aligned, which I don't think it is.

>  #endif
>  #endif
>  
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 10:22   ` Peter Zijlstra
@ 2015-11-30 19:13     ` Waiman Long
  2015-11-30 22:09       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Waiman Long @ 2015-11-30 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> Please always Cc the people who wrote the code.
>
> +CC pjt, ben, morten, yuyang

Sorry for that. Their names didn't show up when I did get_maintainer.pl.

> On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
>> The load_avg statistical counter is only changed if the load on a CPU
>> deviates significantly from the previous tick. So there are usually
>> more readers than writers of load_avg. Still, on a large system,
>> the cacheline contention can cause significant slowdown and impact
>> performance.
>>
>> This patch attempts to separate those load_avg readers
>> (update_cfs_shares) and writers (task_tick_fair) to use different
>> cachelines instead. Writers of load_avg will now accumulates the
>> load delta into load_avg_delta which sits in a different cacheline.
>> If load_avg_delta is sufficiently large (>  load_avg/64), it will then
>> be added back to load_avg.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
>> 480 threads), the perf profile before the patch was:
>>
>>     9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares
>>
>> After the patch, it became:
>>
>>     2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares
> This begs the question tough; why are you running a global load in a
> cgroup; and do we really need to update this for the root cgroup? It
> seems to me we don't need calc_tg_weight() for the root cgroup, it
> doesn't need to normalize its weight numbers.
>
> That is; isn't this simply a problem we should avoid?

I didn't use any cgroup in my test setup. Autogroup was enabled, though. 
Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 
times.

>> The benchmark results before and after the patch were as follows:
>>
>>    Before patch - Max-jOPs: 916011    Critical-jOps: 142366
>>    AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937
>>
>> There was significant improvement in Critical-jOps which was latency
>> sensitive.
>>
>> This patch does introduce additional delay in getting the real load
>> average reflected in load_avg. It may also incur additional overhead
>> if the number of CPUs in a task group is small. As a result, this
>> change is only activated when running on a 4-socket or larger systems
>> which can get the most benefit from it.
> So I'm not particularly charmed by this; it rather makes a mess of
> things. Also this really wants a run of the cgroup fairness test thingy
> pjt/ben have somewhere.

I will be glad to run any additional tests, if necessary. I do need 
pointers to those test, though.

>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/sched/core.c  |    9 +++++++++
>>   kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h |    8 ++++++++
>>   3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..f3075da 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>>   		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>>   		ptr += nr_cpu_ids * sizeof(void **);
>>
>> +#ifdef CONFIG_SMP
>> +		/*
>> +		 * Use load_avg_delta if not 2P or less
>> +		 */
>> +		root_task_group.use_la_delta = (num_possible_nodes()>  2);
>> +#endif /* CONFIG_SMP */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>   #ifdef CONFIG_RT_GROUP_SCHED
>>   		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   	if (!alloc_rt_sched_group(tg, parent))
>>   		goto err;
>>
>> +#if defined(CONFIG_FAIR_GROUP_SCHED)&&  defined(CONFIG_SMP)
>> +	tg->use_la_delta = root_task_group.use_la_delta;
>> +#endif
>>   	return tg;
>>
>>   err:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8f1eccc..44732cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   /*
>> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
>> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>>    * and effective_load (which is not done because it is too costly).
>> + *
>> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
>> + * accumulated into the load_avg_delta variable instead to reduce cacheline
>> + * contention on load_avg at the expense of more delay in reflecting the real
>> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
>> + * cachelines. With that flag set, load_avg will be read mostly whereas
>> + * load_avg_delta will be write mostly.
>>    */
>>   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>>   {
>>   	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>
>>   	if (force || abs(delta)>  cfs_rq->tg_load_avg_contrib / 64) {
>> -		atomic_long_add(delta,&cfs_rq->tg->load_avg);
>> +		struct task_group *tg = cfs_rq->tg;
>> +		long load_avg, tot_delta;
>> +
>> +		if (!tg->use_la_delta) {
>> +			/*
>> +			 * If the use_la_delta isn't set, just add the
>> +			 * delta directly into load_avg.
>> +			 */
>> +			atomic_long_add(delta,&tg->load_avg);
>> +			goto set_contrib;
>> +		}
>> +
>> +		tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
>> +		load_avg = atomic_long_read(&tg->load_avg);
>> +		if (abs(tot_delta)>  load_avg / 64) {
>> +			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
>> +			if (tot_delta)
>> +				atomic_long_add(tot_delta,&tg->load_avg);
>> +		}
>> +set_contrib:
>>   		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>>   	}
>>   }
> I'm thinking that its now far too big to retain the inline qualifier.

I can take the inline keyword out.

>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e679895..aef4e4e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -252,8 +252,16 @@ struct task_group {
>>   	 * load_avg can be heavily contended at clock tick time, so put
>>   	 * it in its own cacheline separated from the fields above which
>>   	 * will also be accessed at each tick.
>> +	 *
>> +	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
>> +	 * to accumulate the delta and only change load_avg when the delta
>> +	 * is big enough. This reduces the cacheline contention on load_avg.
>> +	 * This flag will be set at allocation time depending on the system
>> +	 * configuration.
>>   	 */
>> +	int use_la_delta;
>>   	atomic_long_t load_avg ____cacheline_aligned;
>> +	atomic_long_t load_avg_delta ____cacheline_aligned;
> This would only work if the structure itself is allocated with cacheline
> alignment, and looking at sched_create_group(), we use a plain kzalloc()
> for this, which doesn't guarantee any sort of alignment beyond machine
> word size IIRC.

With a RHEL 6 derived .config file, the size of the task_group structure 
was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag 
increase the size to 512 bytes. So it did make the structure a multiple 
of the cacheline size. With both slub and slab, the allocated task group 
pointers from kzalloc() in sched_create_group() were all multiples of 
0x200. So they were properly aligned for the ____cacheline_aligned tag 
to work.

> Also, you unconditionally grow the structure by a whole cacheline.

I know it is a drawback of using ____cacheline_aligned tag. However, we 
probably won't create too many task groups in normal use. So the 
increase in memory consumption shouldn't be noticeable.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 19:13     ` Waiman Long
@ 2015-11-30 22:09       ` Peter Zijlstra
  2015-12-01  3:55         ` Waiman Long
  2015-11-30 22:29       ` Peter Zijlstra
  2015-11-30 22:32       ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-11-30 22:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This begs the question tough; why are you running a global load in a
> >cgroup; and do we really need to update this for the root cgroup? It
> >seems to me we don't need calc_tg_weight() for the root cgroup, it
> >doesn't need to normalize its weight numbers.
> >
> >That is; isn't this simply a problem we should avoid?
> 
> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> times.

Yeah, can you kill autogroup and see if that helps? If not, we probably
should add some code to avoid calculating things for the root group.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 19:13     ` Waiman Long
  2015-11-30 22:09       ` Peter Zijlstra
@ 2015-11-30 22:29       ` Peter Zijlstra
  2015-12-01  4:00         ` Waiman Long
  2015-11-30 22:32       ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-11-30 22:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >This would only work if the structure itself is allocated with cacheline
> >alignment, and looking at sched_create_group(), we use a plain kzalloc()
> >for this, which doesn't guarantee any sort of alignment beyond machine
> >word size IIRC.
> 
> With a RHEL 6 derived .config file, the size of the task_group structure was
> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
> increase the size to 512 bytes. So it did make the structure a multiple of
> the cacheline size. With both slub and slab, the allocated task group
> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
> So they were properly aligned for the ____cacheline_aligned tag to work.

Not sure we should rely on sl*b doing the right thing here.
KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
explicit alignment, one should use KMEM_CACHE().

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 19:13     ` Waiman Long
  2015-11-30 22:09       ` Peter Zijlstra
  2015-11-30 22:29       ` Peter Zijlstra
@ 2015-11-30 22:32       ` Peter Zijlstra
  2 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-11-30 22:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> >Please always Cc the people who wrote the code.
> >
> >+CC pjt, ben, morten, yuyang
> 
> Sorry for that. Their names didn't show up when I did get_maintainer.pl.

Ah, I never much use get_maintainers.pl, but it might that --git-blame
would yield some of the people who poked at this code. But I suspect its
all yuyang now, seeing how he recently reworked it all.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 22:09       ` Peter Zijlstra
@ 2015-12-01  3:55         ` Waiman Long
  2015-12-01  8:49           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2015-12-01  3:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This begs the question tough; why are you running a global load in a
>>> cgroup; and do we really need to update this for the root cgroup? It
>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>> doesn't need to normalize its weight numbers.
>>>
>>> That is; isn't this simply a problem we should avoid?
>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>> times.
> Yeah, can you kill autogroup and see if that helps? If not, we probably
> should add some code to avoid calculating things for the root group.

I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in 
the distribution kernels. So we still need to look at that with 
autogroup enabled.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-11-30 22:29       ` Peter Zijlstra
@ 2015-12-01  4:00         ` Waiman Long
  2015-12-01  8:47           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2015-12-01  4:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On 11/30/2015 05:29 PM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>> This would only work if the structure itself is allocated with cacheline
>>> alignment, and looking at sched_create_group(), we use a plain kzalloc()
>>> for this, which doesn't guarantee any sort of alignment beyond machine
>>> word size IIRC.
>> With a RHEL 6 derived .config file, the size of the task_group structure was
>> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag
>> increase the size to 512 bytes. So it did make the structure a multiple of
>> the cacheline size. With both slub and slab, the allocated task group
>> pointers from kzalloc() in sched_create_group() were all multiples of 0x200.
>> So they were properly aligned for the ____cacheline_aligned tag to work.
> Not sure we should rely on sl*b doing the right thing here.
> KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want
> explicit alignment, one should use KMEM_CACHE().

I think the current kernel use power-of-2 kmemcaches to satisfy kalloc() 
requests except when the size is less than or equal to 192 where there 
are some non-power-of-2 kmemcaches available. Given that the task_group 
structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't 
hit the case that the allocated buffer is not cacheline aligned.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-12-01  4:00         ` Waiman Long
@ 2015-12-01  8:47           ` Peter Zijlstra
  2015-12-02 18:44             ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2015-12-01  8:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:

> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
> requests except when the size is less than or equal to 192 where there are
> some non-power-of-2 kmemcaches available. Given that the task_group
> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
> the case that the allocated buffer is not cacheline aligned.

Using out-of-object storage is allowed (none of the existing sl*b
allocators do so iirc).

That is, its perfectly valid for a sl*b allocator for 64 byte objects to
allocate 72 bytes for each object and use the 'spare' 8 bytes for object
tracking or whatnot.

That would respect the minimum alignment guarantee of 8 bytes but not
provide the 'expected' object size alignment you're assuming.

Also, we have the proper interfaces to request the explicit alignment
for a reason. So if you need the alignment for correctness, use those.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-12-01  3:55         ` Waiman Long
@ 2015-12-01  8:49           ` Peter Zijlstra
  2015-12-01 10:44             ` Mike Galbraith
  2015-12-02 18:48             ` Waiman Long
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2015-12-01  8:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> >>>This begs the question tough; why are you running a global load in a
> >>>cgroup; and do we really need to update this for the root cgroup? It
> >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> >>>doesn't need to normalize its weight numbers.
> >>>
> >>>That is; isn't this simply a problem we should avoid?
> >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> >>times.
> >Yeah, can you kill autogroup and see if that helps? If not, we probably
> >should add some code to avoid calculating things for the root group.
> 
> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> distribution kernels. So we still need to look at that with autogroup
> enabled.

Meh, or just tell the people that have stupid large machines to use
noautogroup on boot (its of questionable benefit in the first place imo,
esp. on servers).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-12-01  8:49           ` Peter Zijlstra
@ 2015-12-01 10:44             ` Mike Galbraith
  2015-12-02 18:48             ` Waiman Long
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2015-12-01 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Scott J Norton,
	Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen,
	Yuyang Du

On Tue, 2015-12-01 at 09:49 +0100, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
> > On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
> > >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
> > >>>This begs the question tough; why are you running a global load in a
> > >>>cgroup; and do we really need to update this for the root cgroup? It
> > >>>seems to me we don't need calc_tg_weight() for the root cgroup, it
> > >>>doesn't need to normalize its weight numbers.
> > >>>
> > >>>That is; isn't this simply a problem we should avoid?
> > >>I didn't use any cgroup in my test setup. Autogroup was enabled, though.
> > >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
> > >>times.
> > >Yeah, can you kill autogroup and see if that helps? If not, we probably
> > >should add some code to avoid calculating things for the root group.
> > 
> > I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
> > distribution kernels. So we still need to look at that with autogroup
> > enabled.
> 
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).

Yup (and yup).  Someone should also suggest to the systemd(isease) folks
that they try actually measuring before turning everything in the world
on.  "Oo cgroups are cool" is NOT a good reason to turn it all on :)

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-12-01  8:47           ` Peter Zijlstra
@ 2015-12-02 18:44             ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2015-12-02 18:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On 12/01/2015 03:47 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote:
>
>> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc()
>> requests except when the size is less than or equal to 192 where there are
>> some non-power-of-2 kmemcaches available. Given that the task_group
>> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit
>> the case that the allocated buffer is not cacheline aligned.
> Using out-of-object storage is allowed (none of the existing sl*b
> allocators do so iirc).
>
> That is, its perfectly valid for a sl*b allocator for 64 byte objects to
> allocate 72 bytes for each object and use the 'spare' 8 bytes for object
> tracking or whatnot.
>
> That would respect the minimum alignment guarantee of 8 bytes but not
> provide the 'expected' object size alignment you're assuming.
>
> Also, we have the proper interfaces to request the explicit alignment
> for a reason. So if you need the alignment for correctness, use those.

Thanks for the tip. I have just sent out an updated patch set which 
create a cache-aligned memcache for task group. That should work under 
all kernel config setting.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
  2015-12-01  8:49           ` Peter Zijlstra
  2015-12-01 10:44             ` Mike Galbraith
@ 2015-12-02 18:48             ` Waiman Long
  1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2015-12-02 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch,
	Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du

On 12/01/2015 03:49 AM, Peter Zijlstra wrote:
> On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote:
>> On 11/30/2015 05:09 PM, Peter Zijlstra wrote:
>>> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote:
>>>>> This begs the question tough; why are you running a global load in a
>>>>> cgroup; and do we really need to update this for the root cgroup? It
>>>>> seems to me we don't need calc_tg_weight() for the root cgroup, it
>>>>> doesn't need to normalize its weight numbers.
>>>>>
>>>>> That is; isn't this simply a problem we should avoid?
>>>> I didn't use any cgroup in my test setup. Autogroup was enabled, though.
>>>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56
>>>> times.
>>> Yeah, can you kill autogroup and see if that helps? If not, we probably
>>> should add some code to avoid calculating things for the root group.
>> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the
>> distribution kernels. So we still need to look at that with autogroup
>> enabled.
> Meh, or just tell the people that have stupid large machines to use
> noautogroup on boot (its of questionable benefit in the first place imo,
> esp. on servers).

Yes, I was able to recover most of the lost performance by disabling 
autogroup. I did send out a new patch to disable load_avg update for 
root_task_group. I need that for backporting to earlier kernels which 
was forced to update load_avg for every clock tick even for root_task_group.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip:sched/core] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
  2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
@ 2015-12-04 11:57   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot for Waiman Long @ 2015-12-04 11:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Waiman.Long, scott.norton, hpa, efault, mingo, peterz, torvalds,
	tglx, linux-kernel, doug.hatch

Commit-ID:  a426f99c91d1036767a7819aaaba6bd3191b7f06
Gitweb:     http://git.kernel.org/tip/a426f99c91d1036767a7819aaaba6bd3191b7f06
Author:     Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Wed, 25 Nov 2015 14:09:38 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:34:47 +0100

sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efd664c..4b0e8b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6398,7 +6398,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			bool *overload)
 {
 	unsigned long load;
-	int i;
+	int i, nr_running;
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -6415,7 +6415,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
-		if (rq->nr_running > 1)
+		nr_running = rq->nr_running;
+		if (nr_running > 1)
 			*overload = true;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -6423,7 +6424,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
-		if (idle_cpu(i))
+		/*
+		 * No need to call idle_cpu() if nr_running is not 0
+		 */
+		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
 	}
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-04 11:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-04 11:57   ` [tip:sched/core] " tip-bot for Waiman Long
2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-11-30 10:23   ` Peter Zijlstra
2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
2015-11-30 10:22   ` Peter Zijlstra
2015-11-30 19:13     ` Waiman Long
2015-11-30 22:09       ` Peter Zijlstra
2015-12-01  3:55         ` Waiman Long
2015-12-01  8:49           ` Peter Zijlstra
2015-12-01 10:44             ` Mike Galbraith
2015-12-02 18:48             ` Waiman Long
2015-11-30 22:29       ` Peter Zijlstra
2015-12-01  4:00         ` Waiman Long
2015-12-01  8:47           ` Peter Zijlstra
2015-12-02 18:44             ` Waiman Long
2015-11-30 22:32       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).