linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg
@ 2015-12-02 18:41 Waiman Long
  2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-02 18:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch, Waiman Long

v1->v2:
 - Make a memcache for task_group to make sure that the allocated
   task_group object will always be on cacheline boundary even if
   debugging is turned on.
 - Scrap the original patch 3 and replace it with another one to
   disable load_avg update for root_task_group.

This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().

Waiman Long (3):
  sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
  sched/fair: Move hot load_avg into its own cacheline
  sched/fair: Disable tg load_avg update for root_task_group

 kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c  |   16 +++++++++++++---
 kernel/sched/sched.h |    7 ++++++-
 3 files changed, 53 insertions(+), 6 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
  2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
@ 2015-12-02 18:41 ` Waiman Long
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
  2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
  2 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-02 18:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch, Waiman Long

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/fair.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8..8f1eccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			bool *overload)
 {
 	unsigned long load;
-	int i;
+	int i, nr_running;
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
-		if (rq->nr_running > 1)
+		nr_running = rq->nr_running;
+		if (nr_running > 1)
 			*overload = true;
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
 		sgs->sum_weighted_load += weighted_cpuload(i);
-		if (idle_cpu(i))
+		/*
+		 * No need to call idle_cpu() if nr_running is not 0
+		 */
+		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
  2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
@ 2015-12-02 18:41 ` Waiman Long
  2015-12-02 20:02   ` bsegall
                     ` (4 more replies)
  2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
  2 siblings, 5 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-02 18:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch, Waiman Long

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    7 ++++++-
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..e39204f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
+#endif
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7356,6 +7361,7 @@ void __init sched_init(void)
 		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
 		ptr += nr_cpu_ids * sizeof(void **);
 
+		task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+	kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+	kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 static void free_sched_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
-	kfree(tg);
+	free_task_group(tg);
 }
 
 /* allocate runqueue etc for a new task group */
@@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 
-	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+	tg = alloc_task_group();
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efd3bfc..e679895 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
 	unsigned long shares;
 
 #ifdef	CONFIG_SMP
-	atomic_long_t load_avg;
+	/*
+	 * load_avg can be heavily contended at clock tick time, so put
+	 * it in its own cacheline separated from the fields above which
+	 * will also be accessed at each tick.
+	 */
+	atomic_long_t load_avg ____cacheline_aligned;
 #endif
 #endif
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group
  2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
  2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
@ 2015-12-02 18:41 ` Waiman Long
  2015-12-02 19:55   ` bsegall
  2015-12-04 11:58   ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long
  2 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-02 18:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch, Waiman Long

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 kernel/sched/fair.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..4607cb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2670,6 +2670,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 
+	/*
+	 * No need to update load_avg for root_task_group as it is not used.
+	 */
+	if (cfs_rq->tg == &root_task_group)
+		return;
+
 	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group
  2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
@ 2015-12-02 19:55   ` bsegall
  2015-12-04 11:58   ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long
  1 sibling, 0 replies; 22+ messages in thread
From: bsegall @ 2015-12-02 19:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Morten Rasmussen, Scott J Norton, Douglas Hatch

Waiman Long <Waiman.Long@hpe.com> writes:

> Currently, the update_tg_load_avg() function attempts to update the
> tg's load_avg value whenever the load changes even for root_task_group
> where the load_avg value will never be used. This patch will disable
> the load_avg update when the given task group is the root_task_group.
>
> Running a Java benchmark with noautogroup and a 4.3 kernel on a
> 16-socket IvyBridge-EX system, the amount of CPU time (as reported by
> perf) consumed by task_tick_fair() which includes update_tg_load_avg()
> decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
> results also increased slightly from 983015 to 986449.
>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>

Reviewed-by: Ben Segall <bsegall@google.com>
> ---
>  kernel/sched/fair.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..4607cb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2670,6 +2670,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
>  	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  
> +	/*
> +	 * No need to update load_avg for root_task_group as it is not used.
> +	 */
> +	if (cfs_rq->tg == &root_task_group)
> +		return;
> +
>  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>  		atomic_long_add(delta, &cfs_rq->tg->load_avg);
>  		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
@ 2015-12-02 20:02   ` bsegall
  2015-12-03 19:26     ` Waiman Long
  2015-12-03  4:32   ` Mike Galbraith
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: bsegall @ 2015-12-02 20:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Morten Rasmussen, Scott J Norton, Douglas Hatch

Waiman Long <Waiman.Long@hpe.com> writes:

> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
>   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>    8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>    8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>    8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>    6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>    5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq & se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
>    9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>    7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>    7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>    7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>    5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>    4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
>   Before patch - Max-jOPs: 907533    Critical-jOps: 134877
>   After patch  - Max-jOPs: 916011    Critical-jOps: 142366
>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |    7 ++++++-
>  2 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..e39204f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>   */
>  struct task_group root_task_group;
>  LIST_HEAD(task_groups);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
> +#endif
>  #endif
>  
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>  		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>  		ptr += nr_cpu_ids * sizeof(void **);
>  
> +		task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
____cacheline_aligned_in_smp to the struct definition instead.

>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>  
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> +	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> +	kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> +	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> +	kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  static void free_sched_group(struct task_group *tg)
>  {
>  	free_fair_sched_group(tg);
>  	free_rt_sched_group(tg);
>  	autogroup_free(tg);
> -	kfree(tg);
> +	free_task_group(tg);
>  }
>  
>  /* allocate runqueue etc for a new task group */
> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>  {
>  	struct task_group *tg;
>  
> -	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> +	tg = alloc_task_group();
>  	if (!tg)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..e679895 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
>  	unsigned long shares;
>  
>  #ifdef	CONFIG_SMP
> -	atomic_long_t load_avg;
> +	/*
> +	 * load_avg can be heavily contended at clock tick time, so put
> +	 * it in its own cacheline separated from the fields above which
> +	 * will also be accessed at each tick.
> +	 */
> +	atomic_long_t load_avg ____cacheline_aligned;
>  #endif
>  #endif

I suppose the question is if it would be better to just move this to
wind up on a separate cacheline without the extra empty space, though it
would likely be more fragile and unclear.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
  2015-12-02 20:02   ` bsegall
@ 2015-12-03  4:32   ` Mike Galbraith
  2015-12-03 19:34     ` Waiman Long
  2015-12-03 10:56   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2015-12-03  4:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Ben Segall, Morten Rasmussen, Scott J Norton,
	Douglas Hatch

On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:

> By doing so, the perf profile became:
> 
>    9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>    7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>    7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>    7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>    5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>    4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
> 
> The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

	-Mike


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
  2015-12-02 20:02   ` bsegall
  2015-12-03  4:32   ` Mike Galbraith
@ 2015-12-03 10:56   ` Peter Zijlstra
  2015-12-03 19:38     ` Waiman Long
  2015-12-03 11:12   ` Peter Zijlstra
  2015-12-04 11:57   ` [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable " tip-bot for Waiman Long
  4 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-03 10:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> +	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> +	kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> +	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> +	kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
                     ` (2 preceding siblings ...)
  2015-12-03 10:56   ` Peter Zijlstra
@ 2015-12-03 11:12   ` Peter Zijlstra
  2015-12-03 17:56     ` bsegall
  2015-12-03 19:56     ` Waiman Long
  2015-12-04 11:57   ` [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable " tip-bot for Waiman Long
  4 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-03 11:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch



I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long <Waiman.Long@hpe.com>
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
---
 kernel/sched/core.c  |   10 +++++++---
 kernel/sched/sched.h |    7 ++++++-
 2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+	task_group_cache = KMEM_CACHE(task_group, 0);
+
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
-
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -7687,7 +7691,7 @@ static void free_sched_group(struct task
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
-	kfree(tg);
+	kmem_cache_free(task_group_cache, tg);
 }
 
 /* allocate runqueue etc for a new task group */
@@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st
 {
 	struct task_group *tg;
 
-	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+	tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
 	unsigned long shares;
 
 #ifdef	CONFIG_SMP
-	atomic_long_t load_avg;
+	/*
+	 * load_avg can be heavily contended at clock tick time, so put
+	 * it in its own cacheline separated from the fields above which
+	 * will also be accessed at each tick.
+	 */
+	atomic_long_t load_avg ____cacheline_aligned;
 #endif
 #endif
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 11:12   ` Peter Zijlstra
@ 2015-12-03 17:56     ` bsegall
  2015-12-03 18:17       ` Peter Zijlstra
  2015-12-03 19:56     ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: bsegall @ 2015-12-03 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

Peter Zijlstra <peterz@infradead.org> writes:

> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long <Waiman.Long@hpe.com>
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
[...]
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
>  #ifdef CONFIG_CGROUP_SCHED
> +	task_group_cache = KMEM_CACHE(task_group, 0);
> +
>  	list_add(&root_task_group.list, &task_groups);
>  	INIT_LIST_HEAD(&root_task_group.children);
>  	INIT_LIST_HEAD(&root_task_group.siblings);
>  	autogroup_init(&init_task);
> -
>  #endif /* CONFIG_CGROUP_SCHED */
>  
>  	for_each_possible_cpu(i) {
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
>  	unsigned long shares;
>  
>  #ifdef	CONFIG_SMP
> -	atomic_long_t load_avg;
> +	/*
> +	 * load_avg can be heavily contended at clock tick time, so put
> +	 * it in its own cacheline separated from the fields above which
> +	 * will also be accessed at each tick.
> +	 */
> +	atomic_long_t load_avg ____cacheline_aligned;
>  #endif
>  #endif
>  

This loses the cacheline-alignment for task_group, is that ok?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 17:56     ` bsegall
@ 2015-12-03 18:17       ` Peter Zijlstra
  2015-12-03 18:23         ` bsegall
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-03 18:17 UTC (permalink / raw)
  To: bsegall
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

On Thu, Dec 03, 2015 at 09:56:02AM -0800, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> >  #endif /* CONFIG_RT_GROUP_SCHED */
> >  
> >  #ifdef CONFIG_CGROUP_SCHED
> > +	task_group_cache = KMEM_CACHE(task_group, 0);
> > +
> >  	list_add(&root_task_group.list, &task_groups);
> >  	INIT_LIST_HEAD(&root_task_group.children);
> >  	INIT_LIST_HEAD(&root_task_group.siblings);
> >  	autogroup_init(&init_task);
> > -
> >  #endif /* CONFIG_CGROUP_SCHED */
> >  
> >  	for_each_possible_cpu(i) {
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -248,7 +248,12 @@ struct task_group {
> >  	unsigned long shares;
> >  
> >  #ifdef	CONFIG_SMP
> > -	atomic_long_t load_avg;
> > +	/*
> > +	 * load_avg can be heavily contended at clock tick time, so put
> > +	 * it in its own cacheline separated from the fields above which
> > +	 * will also be accessed at each tick.
> > +	 */
> > +	atomic_long_t load_avg ____cacheline_aligned;
> >  #endif
> >  #endif
> >  
> 
> This loses the cacheline-alignment for task_group, is that ok?

I'm a bit dense (its late) can you spell that out? Did you mean me
killing SLAB_HWCACHE_ALIGN? That should not matter because:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
		sizeof(struct __struct), __alignof__(struct __struct),\
		(__flags), NULL)

picks up the alignment explicitly.

And struct task_group having one cacheline aligned member, means that
the alignment of the composite object (the struct proper) must be an
integer multiple of this (typically 1).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 18:17       ` Peter Zijlstra
@ 2015-12-03 18:23         ` bsegall
  0 siblings, 0 replies; 22+ messages in thread
From: bsegall @ 2015-12-03 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Dec 03, 2015 at 09:56:02AM -0800, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>
>> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>> >  #endif /* CONFIG_RT_GROUP_SCHED */
>> >  
>> >  #ifdef CONFIG_CGROUP_SCHED
>> > +	task_group_cache = KMEM_CACHE(task_group, 0);
>> > +
>> >  	list_add(&root_task_group.list, &task_groups);
>> >  	INIT_LIST_HEAD(&root_task_group.children);
>> >  	INIT_LIST_HEAD(&root_task_group.siblings);
>> >  	autogroup_init(&init_task);
>> > -
>> >  #endif /* CONFIG_CGROUP_SCHED */
>> >  
>> >  	for_each_possible_cpu(i) {
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -248,7 +248,12 @@ struct task_group {
>> >  	unsigned long shares;
>> >  
>> >  #ifdef	CONFIG_SMP
>> > -	atomic_long_t load_avg;
>> > +	/*
>> > +	 * load_avg can be heavily contended at clock tick time, so put
>> > +	 * it in its own cacheline separated from the fields above which
>> > +	 * will also be accessed at each tick.
>> > +	 */
>> > +	atomic_long_t load_avg ____cacheline_aligned;
>> >  #endif
>> >  #endif
>> >  
>> 
>> This loses the cacheline-alignment for task_group, is that ok?
>
> I'm a bit dense (its late) can you spell that out? Did you mean me
> killing SLAB_HWCACHE_ALIGN? That should not matter because:
>
> #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
> 		sizeof(struct __struct), __alignof__(struct __struct),\
> 		(__flags), NULL)
>
> picks up the alignment explicitly.
>
> And struct task_group having one cacheline aligned member, means that
> the alignment of the composite object (the struct proper) must be an
> integer multiple of this (typically 1).

Ah, yeah, I forgot about this, my fault.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-02 20:02   ` bsegall
@ 2015-12-03 19:26     ` Waiman Long
  2015-12-03 19:41       ` bsegall
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-12-03 19:26 UTC (permalink / raw)
  To: bsegall
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Morten Rasmussen, Scott J Norton, Douglas Hatch

On 12/02/2015 03:02 PM, bsegall@google.com wrote:
> Waiman Long<Waiman.Long@hpe.com>  writes:
>
>> If a system with large number of sockets was driven to full
>> utilization, it was found that the clock tick handling occupied a
>> rather significant proportion of CPU time when fair group scheduling
>> and autogroup were enabled.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
>> profile looked like:
>>
>>    10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>>     8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>>     8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>>     8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>>     6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>>     5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>>
>> In particular, the high CPU time consumed by update_cfs_shares()
>> was mostly due to contention on the cacheline that contained the
>> task_group's load_avg statistical counter. This cacheline may also
>> contains variables like shares, cfs_rq&  se which are accessed rather
>> frequently during clock tick processing.
>>
>> This patch moves the load_avg variable into another cacheline
>> separated from the other frequently accessed variables. It also
>> creates a cacheline aligned kmemcache for task_group to make sure
>> that all the allocated task_group's are cacheline aligned.
>>
>> By doing so, the perf profile became:
>>
>>     9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before. The
>> benchmark results before and after the patch was as follows:
>>
>>    Before patch - Max-jOPs: 907533    Critical-jOps: 134877
>>    After patch  - Max-jOPs: 916011    Critical-jOps: 142366
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h |    7 ++++++-
>>   2 files changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..e39204f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>>    */
>>   struct task_group root_task_group;
>>   LIST_HEAD(task_groups);
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +/* Cacheline aligned slab cache for task_group */
>> +static struct kmem_cache *task_group_cache __read_mostly;
>> +#endif
>>   #endif
>>
>>   DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>>   		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>>   		ptr += nr_cpu_ids * sizeof(void **);
>>
>> +		task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
> The KMEM_CACHE macro suggests instead adding
> ____cacheline_aligned_in_smp to the struct definition instead.

The main goal is to have the load_avg placed in a new cacheline 
separated from the read-only fields above. That is why I placed 
____cacheline_aligned after load_avg. I omitted the in_smp part because 
it is in the SMP block already. Putting ____cacheline_aligned_in_smp 
won't guarantee alignment of any field within the structure.

I have done some test and having ____cacheline_aligned inside the 
structure has the same effect of forcing the whole structure in the 
cacheline aligned boundary.

>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>   #ifdef CONFIG_RT_GROUP_SCHED
>>   		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>>   /* task_group_lock serializes the addition/removal of task groups */
>>   static DEFINE_SPINLOCK(task_group_lock);
>>
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>>   static void free_sched_group(struct task_group *tg)
>>   {
>>   	free_fair_sched_group(tg);
>>   	free_rt_sched_group(tg);
>>   	autogroup_free(tg);
>> -	kfree(tg);
>> +	free_task_group(tg);
>>   }
>>
>>   /* allocate runqueue etc for a new task group */
>> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   {
>>   	struct task_group *tg;
>>
>> -	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>> +	tg = alloc_task_group();
>>   	if (!tg)
>>   		return ERR_PTR(-ENOMEM);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index efd3bfc..e679895 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -248,7 +248,12 @@ struct task_group {
>>   	unsigned long shares;
>>
>>   #ifdef	CONFIG_SMP
>> -	atomic_long_t load_avg;
>> +	/*
>> +	 * load_avg can be heavily contended at clock tick time, so put
>> +	 * it in its own cacheline separated from the fields above which
>> +	 * will also be accessed at each tick.
>> +	 */
>> +	atomic_long_t load_avg ____cacheline_aligned;
>>   #endif
>>   #endif
> I suppose the question is if it would be better to just move this to
> wind up on a separate cacheline without the extra empty space, though it
> would likely be more fragile and unclear.

I have been thinking about that too. The problem is anything that will 
be in the same cacheline as load_avg and have to be accessed at clock 
click time will cause the same contention problem. In the current 
layout, the fields after load_avg are the rt stuff as well some list 
head structure and pointers. The rt stuff should be kind of mutually 
exclusive of the CFS load_avg in term of usage. The list head structure 
and pointers don't seem to be that frequently accessed. So it is the 
right place to start a new cacheline boundary.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03  4:32   ` Mike Galbraith
@ 2015-12-03 19:34     ` Waiman Long
  2015-12-04  2:07       ` Mike Galbraith
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-12-03 19:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Ben Segall, Morten Rasmussen, Scott J Norton,
	Douglas Hatch

On 12/02/2015 11:32 PM, Mike Galbraith wrote:
> On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:
>
>> By doing so, the perf profile became:
>>
>>     9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before.
> Is that with the box booted skew_tick=1?
>
> 	-Mike
>

I haven't tried that kernel parameter. I will try it to see if it can 
improve the situation. BTW, will there be other undesirable side effects 
of using this other than the increased power consumption as said in the 
kernel-parameters.txt file?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 10:56   ` Peter Zijlstra
@ 2015-12-03 19:38     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-03 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

On 12/03/2015 05:56 AM, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> I think we can simply always use the kmem_cache, both slab and slub
> merge slabcaches where appropriate.

I did that as I was not sure how much overhead would the introduction of 
a new kmem_cache bring. It seems like it is not really an issue. So I am 
fine with making the kmem_cache change permanent.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 19:26     ` Waiman Long
@ 2015-12-03 19:41       ` bsegall
  0 siblings, 0 replies; 22+ messages in thread
From: bsegall @ 2015-12-03 19:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Morten Rasmussen, Scott J Norton, Douglas Hatch

Waiman Long <waiman.long@hpe.com> writes:

> On 12/02/2015 03:02 PM, bsegall@google.com wrote:
>> Waiman Long<Waiman.Long@hpe.com>  writes:
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index efd3bfc..e679895 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -248,7 +248,12 @@ struct task_group {
>>>   	unsigned long shares;
>>>
>>>   #ifdef	CONFIG_SMP
>>> -	atomic_long_t load_avg;
>>> +	/*
>>> +	 * load_avg can be heavily contended at clock tick time, so put
>>> +	 * it in its own cacheline separated from the fields above which
>>> +	 * will also be accessed at each tick.
>>> +	 */
>>> +	atomic_long_t load_avg ____cacheline_aligned;
>>>   #endif
>>>   #endif
>> I suppose the question is if it would be better to just move this to
>> wind up on a separate cacheline without the extra empty space, though it
>> would likely be more fragile and unclear.
>
> I have been thinking about that too. The problem is anything that will be in the
> same cacheline as load_avg and have to be accessed at clock click time will
> cause the same contention problem. In the current layout, the fields after
> load_avg are the rt stuff as well some list head structure and pointers. The rt
> stuff should be kind of mutually exclusive of the CFS load_avg in term of usage.
> The list head structure and pointers don't seem to be that frequently accessed.
> So it is the right place to start a new cacheline boundary.
>
> Cheers,
> Longman

Yeah, this is a good place to start a new boundary, I was just saying
you could probably remove the empty space by reordering fields, but that
would be a less logical ordering in terms of programmer clarity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 11:12   ` Peter Zijlstra
  2015-12-03 17:56     ` bsegall
@ 2015-12-03 19:56     ` Waiman Long
  2015-12-03 20:03       ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-12-03 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

On 12/03/2015 06:12 AM, Peter Zijlstra wrote:
>
> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long<Waiman.Long@hpe.com>
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
>    10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>     9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>     8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>     8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>     8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>     6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>     5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq&  se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
>     9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>     8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>     7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>     7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>     7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>     5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>     4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
>    Before patch - Max-jOPs: 907533    Critical-jOps: 134877
>    After patch  - Max-jOPs: 916011    Critical-jOps: 142366
>
> Cc: Scott J Norton<scott.norton@hpe.com>
> Cc: Douglas Hatch<doug.hatch@hpe.com>
> Cc: Ingo Molnar<mingo@redhat.com>
> Cc: Yuyang Du<yuyang.du@intel.com>
> Cc: Paul Turner<pjt@google.com>
> Cc: Ben Segall<bsegall@google.com>
> Cc: Morten Rasmussen<morten.rasmussen@arm.com>
> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
> ---
>   kernel/sched/core.c  |   10 +++++++---
>   kernel/sched/sched.h |    7 ++++++-
>   2 files changed, 13 insertions(+), 4 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
>    */
>   struct task_group root_task_group;
>   LIST_HEAD(task_groups);
> +
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
>   #endif
>
>   DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>   #endif /* CONFIG_RT_GROUP_SCHED */
>
>   #ifdef CONFIG_CGROUP_SCHED
> +	task_group_cache = KMEM_CACHE(task_group, 0);
> +
Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper 
flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is 
defined. Other than that, I am fine with the change.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 19:56     ` Waiman Long
@ 2015-12-03 20:03       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-12-03 20:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, linux-kernel, Yuyang Du, Paul Turner, Ben Segall,
	Morten Rasmussen, Scott J Norton, Douglas Hatch

On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote:
> >  #ifdef CONFIG_CGROUP_SCHED
> >+	task_group_cache = KMEM_CACHE(task_group, 0);
> >+
> Thanks for making that change.
> 
> Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper
> flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined.
> Other than that, I am fine with the change.

I don't think we need that, see my reply earlier to Ben.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-03 19:34     ` Waiman Long
@ 2015-12-04  2:07       ` Mike Galbraith
  2015-12-04 20:19         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Galbraith @ 2015-12-04  2:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Ben Segall, Morten Rasmussen, Scott J Norton,
	Douglas Hatch

On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
> On 12/02/2015 11:32 PM, Mike Galbraith wrote:

> > Is that with the box booted skew_tick=1?

> I haven't tried that kernel parameter. I will try it to see if it can 
> improve the situation. BTW, will there be other undesirable side effects 
> of using this other than the increased power consumption as said in the 
> kernel-parameters.txt file?

Not that are known.  I kinda doubt you'd notice the power, but you
should see a notable performance boost.  Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

	-Mike 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline
  2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
                     ` (3 preceding siblings ...)
  2015-12-03 11:12   ` Peter Zijlstra
@ 2015-12-04 11:57   ` tip-bot for Waiman Long
  4 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Waiman Long @ 2015-12-04 11:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, mingo, pjt, peterz, Waiman.Long, efault, bsegall,
	linux-kernel, doug.hatch, tglx, yuyang.du, morten.rasmussen,
	scott.norton

Commit-ID:  b0367629acf62a78404c467cd09df447c2fea804
Gitweb:     http://git.kernel.org/tip/b0367629acf62a78404c467cd09df447c2fea804
Author:     Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Wed, 2 Dec 2015 13:41:49 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:34:48 +0100

sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 10 +++++++---
 kernel/sched/sched.h |  7 ++++++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d591db1..aa3f978 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7370,6 +7370,9 @@ int in_sched_functions(unsigned long addr)
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7427,11 +7430,12 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+	task_group_cache = KMEM_CACHE(task_group, 0);
+
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
-
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -7712,7 +7716,7 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
-	kfree(tg);
+	kmem_cache_free(task_group_cache, tg);
 }
 
 /* allocate runqueue etc for a new task group */
@@ -7720,7 +7724,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 
-	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+	tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 472cd14..a5a6b3e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
 	unsigned long shares;
 
 #ifdef	CONFIG_SMP
-	atomic_long_t load_avg;
+	/*
+	 * load_avg can be heavily contended at clock tick time, so put
+	 * it in its own cacheline separated from the fields above which
+	 * will also be accessed at each tick.
+	 */
+	atomic_long_t load_avg ____cacheline_aligned;
 #endif
 #endif
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group
  2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
  2015-12-02 19:55   ` bsegall
@ 2015-12-04 11:58   ` tip-bot for Waiman Long
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Waiman Long @ 2015-12-04 11:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, hpa, efault, torvalds, yuyang.du, pjt,
	linux-kernel, morten.rasmussen, doug.hatch, bsegall, tglx,
	scott.norton, Waiman.Long

Commit-ID:  aa0b7ae06387d40a988ce16a189082dee6e570bc
Gitweb:     http://git.kernel.org/tip/aa0b7ae06387d40a988ce16a189082dee6e570bc
Author:     Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Wed, 2 Dec 2015 13:41:50 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:34:49 +0100

sched/fair: Disable the task group load_avg update for the root_task_group

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b0e8b8..1093873 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2709,6 +2709,12 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 
+	/*
+	 * No need to update load_avg for root_task_group as it is not used.
+	 */
+	if (cfs_rq->tg == &root_task_group)
+		return;
+
 	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
  2015-12-04  2:07       ` Mike Galbraith
@ 2015-12-04 20:19         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-12-04 20:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Yuyang Du,
	Paul Turner, Ben Segall, Morten Rasmussen, Scott J Norton,
	Douglas Hatch

On 12/03/2015 09:07 PM, Mike Galbraith wrote:
> On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
>> On 12/02/2015 11:32 PM, Mike Galbraith wrote:
>>> Is that with the box booted skew_tick=1?
>> I haven't tried that kernel parameter. I will try it to see if it can
>> improve the situation. BTW, will there be other undesirable side effects
>> of using this other than the increased power consumption as said in the
>> kernel-parameters.txt file?
> Not that are known.  I kinda doubt you'd notice the power, but you
> should see a notable performance boost.  Who knows, with a big enough
> farm of busy big boxen, it may save power by needing fewer of them.
>
> 	-Mike
>

You are right. Use skew_tick=1 did improve performance and reduce the 
overhead of clock tick processing rather significantly. I think it 
should be the default for large SMP boxes. Thanks for the tip.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-12-04 20:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-12-02 20:02   ` bsegall
2015-12-03 19:26     ` Waiman Long
2015-12-03 19:41       ` bsegall
2015-12-03  4:32   ` Mike Galbraith
2015-12-03 19:34     ` Waiman Long
2015-12-04  2:07       ` Mike Galbraith
2015-12-04 20:19         ` Waiman Long
2015-12-03 10:56   ` Peter Zijlstra
2015-12-03 19:38     ` Waiman Long
2015-12-03 11:12   ` Peter Zijlstra
2015-12-03 17:56     ` bsegall
2015-12-03 18:17       ` Peter Zijlstra
2015-12-03 18:23         ` bsegall
2015-12-03 19:56     ` Waiman Long
2015-12-03 20:03       ` Peter Zijlstra
2015-12-04 11:57   ` [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable " tip-bot for Waiman Long
2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
2015-12-02 19:55   ` bsegall
2015-12-04 11:58   ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).