Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:28 [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
@ 2014-08-04 19:15 ` Yuyang Du
  2014-08-04 21:42   ` Yuyang Du
                     ` (2 more replies)
  2014-08-04 20:52 ` bsegall
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 16+ messages in thread
From: Yuyang Du @ 2014-08-04 19:15 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

Hi Jason,

I am not sure whether you noticed my latest work: rewriting per entity load average

http://article.gmane.org/gmane.linux.kernel/1760754
http://article.gmane.org/gmane.linux.kernel/1760755
http://article.gmane.org/gmane.linux.kernel/1760757
http://article.gmane.org/gmane.linux.kernel/1760756

which simply does not track blocked load average at all. Are you interested in
testing the patchset with the workload you have? The comparison can also help
us understand the rewrite. Overall, per our tests, the overhead should be less,
and perf should be better.

Thanks,
Yuyang

On Mon, Aug 04, 2014 at 01:28:38PM -0700, Jason Low wrote:
> When running workloads on 2+ socket systems, based on perf profiles, the
> update_cfs_rq_blocked_load function constantly shows up as taking up a
> noticeable % of run time. This is especially apparent on an 8 socket
> machine. For example, when running the AIM7 custom workload, we see:
> 
>    4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> 
> Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> update the tg load contribution stats.  However, it turns out that in many
> cases, they don't need to be updated and "tg_contrib" is 0.
> 
> This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> tg load contribution stats when nothing needs to be updated. This reduces the
> cacheline contention that would be unnecessary. In the above case, with the
> patch, perf reports the total time spent in this function went down by more
> than a factor of 3x:
> 
>    1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> 
> Signed-off-by: Jason Low <jason.low2@hp.com>
> ---
>  kernel/sched/fair.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..8d4cc72 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2377,6 +2377,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>  	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>  	tg_contrib -= cfs_rq->tg_load_contrib;
>  
> +	if (!tg_contrib)
> +		return;
> +
>  	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>  		atomic_long_add(tg_contrib, &tg->load_avg);
>  		cfs_rq->tg_load_contrib += tg_contrib;
> -- 
> 1.7.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
@ 2014-08-04 20:28 Jason Low
  2014-08-04 19:15 ` Yuyang Du
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Jason Low @ 2014-08-04 20:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jason Low
  Cc: linux-kernel, Ben Segall, Waiman Long, Mel Gorman,
	Mike Galbraith, Rik van Riel, Aswin Chandramouleeswaran,
	Chegu Vinod, Scott J Norton

When running workloads on 2+ socket systems, based on perf profiles, the
update_cfs_rq_blocked_load function constantly shows up as taking up a
noticeable % of run time. This is especially apparent on an 8 socket
machine. For example, when running the AIM7 custom workload, we see:

   4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load

Much of the contention is in __update_cfs_rq_tg_load_contrib when we
update the tg load contribution stats.  However, it turns out that in many
cases, they don't need to be updated and "tg_contrib" is 0.

This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
tg load contribution stats when nothing needs to be updated. This reduces the
cacheline contention that would be unnecessary. In the above case, with the
patch, perf reports the total time spent in this function went down by more
than a factor of 3x:

   1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load

Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..8d4cc72 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2377,6 +2377,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;

+	if (!tg_contrib)
+		return;
+
 	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
 		atomic_long_add(tg_contrib, &tg->load_avg);
 		cfs_rq->tg_load_contrib += tg_contrib;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:28 [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
  2014-08-04 19:15 ` Yuyang Du
@ 2014-08-04 20:52 ` bsegall
  2014-08-04 21:27   ` Jason Low
  2014-08-11 17:31   ` Jason Low
  2014-08-04 21:04 ` Peter Zijlstra
  2014-08-05 17:53 ` Waiman Long
  3 siblings, 2 replies; 16+ messages in thread
From: bsegall @ 2014-08-04 20:52 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton, pjt

Jason Low <jason.low2@hp.com> writes:

> When running workloads on 2+ socket systems, based on perf profiles, the
> update_cfs_rq_blocked_load function constantly shows up as taking up a
> noticeable % of run time. This is especially apparent on an 8 socket
> machine. For example, when running the AIM7 custom workload, we see:
>
>    4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
>
> Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> update the tg load contribution stats.  However, it turns out that in many
> cases, they don't need to be updated and "tg_contrib" is 0.
>
> This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> tg load contribution stats when nothing needs to be updated. This reduces the
> cacheline contention that would be unnecessary. In the above case, with the
> patch, perf reports the total time spent in this function went down by more
> than a factor of 3x:
>
>    1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
>
> Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Ben Segall <bsegall@google.com>

That said, it might be better to remove force_update for this function,
or make it just reduce the minimum to /64 or something. If the test is
easy to run it would be good to see what it's like just removing the
force_update param for this function to see if it's worth worrying
about or if the zero case catches ~all the perf gain. Paul, your thoughts?

> ---
>  kernel/sched/fair.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..8d4cc72 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2377,6 +2377,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>  	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>  	tg_contrib -= cfs_rq->tg_load_contrib;
>  
> +	if (!tg_contrib)
> +		return;
> +
>  	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>  		atomic_long_add(tg_contrib, &tg->load_avg);
>  		cfs_rq->tg_load_contrib += tg_contrib;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:28 [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
  2014-08-04 19:15 ` Yuyang Du
  2014-08-04 20:52 ` bsegall
@ 2014-08-04 21:04 ` Peter Zijlstra
  2014-08-05 17:53 ` Waiman Long
  3 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2014-08-04 21:04 UTC (permalink / raw)
  To: Jason Low
  Cc: Ingo Molnar, linux-kernel, Ben Segall, Waiman Long, Mel Gorman,
	Mike Galbraith, Rik van Riel, Aswin Chandramouleeswaran,
	Chegu Vinod, Scott J Norton

On Mon, Aug 04, 2014 at 01:28:38PM -0700, Jason Low wrote:
> When running workloads on 2+ socket systems, based on perf profiles, the
> update_cfs_rq_blocked_load function constantly shows up as taking up a
> noticeable % of run time. This is especially apparent on an 8 socket
> machine. For example, when running the AIM7 custom workload, we see:
> 
>    4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> 
> Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> update the tg load contribution stats.  However, it turns out that in many
> cases, they don't need to be updated and "tg_contrib" is 0.
> 
> This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> tg load contribution stats when nothing needs to be updated. This reduces the
> cacheline contention that would be unnecessary. In the above case, with the
> patch, perf reports the total time spent in this function went down by more
> than a factor of 3x:
> 
>    1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> 

Supposedly adding something like __cacheline_aligned to
task_group::load_avg is also rumoured to help.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:52 ` bsegall
@ 2014-08-04 21:27   ` Jason Low
  2014-08-11 17:31   ` Jason Low
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Low @ 2014-08-04 21:27 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton, pjt

On Mon, 2014-08-04 at 13:52 -0700, bsegall@google.com wrote:
> Jason Low <jason.low2@hp.com> writes:
> 
> > When running workloads on 2+ socket systems, based on perf profiles, the
> > update_cfs_rq_blocked_load function constantly shows up as taking up a
> > noticeable % of run time. This is especially apparent on an 8 socket
> > machine. For example, when running the AIM7 custom workload, we see:
> >
> >    4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> >
> > Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> > update the tg load contribution stats.  However, it turns out that in many
> > cases, they don't need to be updated and "tg_contrib" is 0.
> >
> > This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> > tg load contribution stats when nothing needs to be updated. This reduces the
> > cacheline contention that would be unnecessary. In the above case, with the
> > patch, perf reports the total time spent in this function went down by more
> > than a factor of 3x:
> >
> >    1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> >
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> 
> That said, it might be better to remove force_update for this function,
> or make it just reduce the minimum to /64 or something. If the test is
> easy to run it would be good to see what it's like just removing the
> force_update param for this function to see if it's worth worrying
> about or if the zero case catches ~all the perf gain.

Sure, I can test that out too. I did notice when running another AIM7
workload that !zero was the more common case, so this has the potential
to further reduce contention.

>  Paul, your thoughts?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 19:15 ` Yuyang Du
@ 2014-08-04 21:42   ` Yuyang Du
  2014-08-05 15:42   ` Jason Low
  2014-08-06 18:21   ` Jason Low
  2 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2014-08-04 21:42 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

On Tue, Aug 05, 2014 at 03:15:26AM +0800, Yuyang Du wrote:
> Hi Jason,
> 
> I am not sure whether you noticed my latest work: rewriting per entity load average
> 
> http://article.gmane.org/gmane.linux.kernel/1760754
> http://article.gmane.org/gmane.linux.kernel/1760755
> http://article.gmane.org/gmane.linux.kernel/1760757
> http://article.gmane.org/gmane.linux.kernel/1760756
> 
> which simply does not track blocked load average at all.

Actually, it tracks blocked load, but along with runnable load, so no this extra
overhead.

> Are you interested in
> testing the patchset with the workload you have? The comparison can also help
> us understand the rewrite. Overall, per our tests, the overhead should be less,
> and perf should be better.
> 
> 
> On Mon, Aug 04, 2014 at 01:28:38PM -0700, Jason Low wrote:
> > When running workloads on 2+ socket systems, based on perf profiles, the
> > update_cfs_rq_blocked_load function constantly shows up as taking up a
> > noticeable % of run time. This is especially apparent on an 8 socket
> > machine. For example, when running the AIM7 custom workload, we see:
> > 
> >    4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> > 
> > Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> > update the tg load contribution stats.  However, it turns out that in many
> > cases, they don't need to be updated and "tg_contrib" is 0.
> > 
> > This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> > tg load contribution stats when nothing needs to be updated. This reduces the
> > cacheline contention that would be unnecessary. In the above case, with the
> > patch, perf reports the total time spent in this function went down by more
> > than a factor of 3x:
> > 
> >    1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
> > 
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> > ---
> >  kernel/sched/fair.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bfa3c86..8d4cc72 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2377,6 +2377,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> >  	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> >  	tg_contrib -= cfs_rq->tg_load_contrib;
> >  
> > +	if (!tg_contrib)
> > +		return;
> > +
> >  	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> >  		atomic_long_add(tg_contrib, &tg->load_avg);
> >  		cfs_rq->tg_load_contrib += tg_contrib;
> > -- 
> > 1.7.1
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 19:15 ` Yuyang Du
  2014-08-04 21:42   ` Yuyang Du
@ 2014-08-05 15:42   ` Jason Low
  2014-08-06 18:21   ` Jason Low
  2 siblings, 0 replies; 16+ messages in thread
From: Jason Low @ 2014-08-05 15:42 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

On Tue, 2014-08-05 at 03:15 +0800, Yuyang Du wrote:
> Hi Jason,
> 
> I am not sure whether you noticed my latest work: rewriting per entity load average
> 
> http://article.gmane.org/gmane.linux.kernel/1760754
> http://article.gmane.org/gmane.linux.kernel/1760755
> http://article.gmane.org/gmane.linux.kernel/1760757
> http://article.gmane.org/gmane.linux.kernel/1760756
> 
> which simply does not track blocked load average at all. Are you interested in
> testing the patchset with the workload you have?

Hi Yuyang, yes I can also test your latest patchset with some of the
AIM7 workloads. Not needing extra overhead for the blocked load should
also address this contention in update_cfs_rq_blocked_load().

>  The comparison can also help
> us understand the rewrite. Overall, per our tests, the overhead should be less,
> and perf should be better.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:28 [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
                   ` (2 preceding siblings ...)
  2014-08-04 21:04 ` Peter Zijlstra
@ 2014-08-05 17:53 ` Waiman Long
  3 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2014-08-05 17:53 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

On 08/04/2014 04:28 PM, Jason Low wrote:
> When running workloads on 2+ socket systems, based on perf profiles, the
> update_cfs_rq_blocked_load function constantly shows up as taking up a
> noticeable % of run time. This is especially apparent on an 8 socket
> machine. For example, when running the AIM7 custom workload, we see:
>
>     4.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
>
> Much of the contention is in __update_cfs_rq_tg_load_contrib when we
> update the tg load contribution stats.  However, it turns out that in many
> cases, they don't need to be updated and "tg_contrib" is 0.
>
> This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
> tg load contribution stats when nothing needs to be updated. This reduces the
> cacheline contention that would be unnecessary. In the above case, with the
> patch, perf reports the total time spent in this function went down by more
> than a factor of 3x:
>
>     1.18%        reaim  [kernel.kallsyms]        [k] update_cfs_rq_blocked_load
>
> Signed-off-by: Jason Low<jason.low2@hp.com>
> ---
>   kernel/sched/fair.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..8d4cc72 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2377,6 +2377,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>   	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>   	tg_contrib -= cfs_rq->tg_load_contrib;
>
> +	if (!tg_contrib)
> +		return;
> +
>   	if (force_update || abs(tg_contrib)>  cfs_rq->tg_load_contrib / 8) {
>   		atomic_long_add(tg_contrib,&tg->load_avg);
>   		cfs_rq->tg_load_contrib += tg_contrib;
Reviewed-by: Waiman Long <Waiman.Long@hp.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 19:15 ` Yuyang Du
  2014-08-04 21:42   ` Yuyang Du
  2014-08-05 15:42   ` Jason Low
@ 2014-08-06 18:21   ` Jason Low
  2014-08-07 18:02     ` Yuyang Du
  2 siblings, 1 reply; 16+ messages in thread
From: Jason Low @ 2014-08-06 18:21 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

On Tue, 2014-08-05 at 03:15 +0800, Yuyang Du wrote:
> Hi Jason,
> 
> I am not sure whether you noticed my latest work: rewriting per entity load average
> 
> http://article.gmane.org/gmane.linux.kernel/1760754
> http://article.gmane.org/gmane.linux.kernel/1760755
> http://article.gmane.org/gmane.linux.kernel/1760757
> http://article.gmane.org/gmane.linux.kernel/1760756
> 
> which simply does not track blocked load average at all. Are you interested in
> testing the patchset with the workload you have?

Hi Yuyang,

I ran these tests with most of the AIM7 workloads to compare its
performance between a 3.16 kernel and the kernel with these patches
applied.

The table below contains the percent difference between the baseline
kernel and the kernel with the patches at various user counts. A
positive percent means the kernel with the patches performed better,
while a negative percent means the baseline performed better.

Based on these numbers, for many of the workloads, the change was
beneficial in those highly contended, while it had - impact in many
of the lightly/moderately contended case (10 to 90 users).

-----------------------------------------------------
              |   10-90   |  100-1000   |  1100-2000
              |   users   |   users     |   users
-----------------------------------------------------
alltests      |   -3.37%  |  -10.64%    |   -2.25%
-----------------------------------------------------
all_utime     |   +0.33%  |   +3.73%    |   +3.33%
-----------------------------------------------------
compute       |   -5.97%  |   +2.34%    |   +3.22%
-----------------------------------------------------
custom        |  -31.61%  |  -10.29%    |  +15.23%
-----------------------------------------------------
disk          |  +24.64%  |  +28.96%    |  +21.28%
-----------------------------------------------------
fserver       |   -1.35%  |   +4.82%    |   +9.35%
-----------------------------------------------------
high_systime  |   -6.73%  |   -6.28%    |  +12.36%
-----------------------------------------------------
shared        |  -28.31%  |  -19.99%    |   -7.10%
-----------------------------------------------------
short         |  -44.63%  |  -37.48%    |  -33.62%
-----------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-06 18:21   ` Jason Low
@ 2014-08-07 18:02     ` Yuyang Du
  2014-08-08  4:18       ` Jason Low
  0 siblings, 1 reply; 16+ messages in thread
From: Yuyang Du @ 2014-08-07 18:02 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton

On Wed, Aug 06, 2014 at 11:21:35AM -0700, Jason Low wrote:
> I ran these tests with most of the AIM7 workloads to compare its
> performance between a 3.16 kernel and the kernel with these patches
> applied.
> 
> The table below contains the percent difference between the baseline
> kernel and the kernel with the patches at various user counts. A
> positive percent means the kernel with the patches performed better,
> while a negative percent means the baseline performed better.
> 
> Based on these numbers, for many of the workloads, the change was
> beneficial in those highly contended, while it had - impact in many
> of the lightly/moderately contended case (10 to 90 users).
> 
> -----------------------------------------------------
>               |   10-90   |  100-1000   |  1100-2000
>               |   users   |   users     |   users
> -----------------------------------------------------
> alltests      |   -3.37%  |  -10.64%    |   -2.25%
> -----------------------------------------------------
> all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> -----------------------------------------------------
> compute       |   -5.97%  |   +2.34%    |   +3.22%
> -----------------------------------------------------
> custom        |  -31.61%  |  -10.29%    |  +15.23%
> -----------------------------------------------------
> disk          |  +24.64%  |  +28.96%    |  +21.28%
> -----------------------------------------------------
> fserver       |   -1.35%  |   +4.82%    |   +9.35%
> -----------------------------------------------------
> high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> -----------------------------------------------------
> shared        |  -28.31%  |  -19.99%    |   -7.10%
> -----------------------------------------------------
> short         |  -44.63%  |  -37.48%    |  -33.62%
> -----------------------------------------------------
> 
Thanks, Jason. Sorry for late response.

What about the variation of the tests? The machine you test on?

Yuyang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-08  4:18       ` Jason Low
@ 2014-08-07 22:30         ` Yuyang Du
  2014-08-08  7:11           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Yuyang Du @ 2014-08-07 22:30 UTC (permalink / raw)
  To: Jason Low
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran

On Thu, Aug 07, 2014 at 09:18:52PM -0700, Jason Low wrote:
> On Fri, 2014-08-08 at 02:02 +0800, Yuyang Du wrote:
> > On Wed, Aug 06, 2014 at 11:21:35AM -0700, Jason Low wrote:
> > > I ran these tests with most of the AIM7 workloads to compare its
> > > performance between a 3.16 kernel and the kernel with these patches
> > > applied.
> > > 
> > > The table below contains the percent difference between the baseline
> > > kernel and the kernel with the patches at various user counts. A
> > > positive percent means the kernel with the patches performed better,
> > > while a negative percent means the baseline performed better.
> > > 
> > > Based on these numbers, for many of the workloads, the change was
> > > beneficial in those highly contended, while it had - impact in many
> > > of the lightly/moderately contended case (10 to 90 users).
> > > 
> > > -----------------------------------------------------
> > >               |   10-90   |  100-1000   |  1100-2000
> > >               |   users   |   users     |   users
> > > -----------------------------------------------------
> > > alltests      |   -3.37%  |  -10.64%    |   -2.25%
> > > -----------------------------------------------------
> > > all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> > > -----------------------------------------------------
> > > compute       |   -5.97%  |   +2.34%    |   +3.22%
> > > -----------------------------------------------------
> > > custom        |  -31.61%  |  -10.29%    |  +15.23%
> > > -----------------------------------------------------
> > > disk          |  +24.64%  |  +28.96%    |  +21.28%
> > > -----------------------------------------------------
> > > fserver       |   -1.35%  |   +4.82%    |   +9.35%
> > > -----------------------------------------------------
> > > high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> > > -----------------------------------------------------
> > > shared        |  -28.31%  |  -19.99%    |   -7.10%
> > > -----------------------------------------------------
> > > short         |  -44.63%  |  -37.48%    |  -33.62%
> > > -----------------------------------------------------
> > > 
> > Thanks, Jason. Sorry for late response.
> > 
> > What about the variation of the tests? The machine you test on?
> 
> Hi Yuyang,
> 
> These tests were also done on an 8 socket machine (80 cores). In terms
> of variation between the average throughputs, typically the noise range
> is about 2% in many of the workloads.
> 

Thanks a lot, Jason.

So for this particular set of workloads on a big machine, I think the
result is mixed and overall "neutral", but I expected the variation probably
could be bigger especially for light workloads.

Any comment from the maintainers and others? Ping Peter and Ben, I haven't
heard from you for the 5th version.

Yuyang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-08  7:11           ` Peter Zijlstra
@ 2014-08-07 23:15             ` Yuyang Du
  2014-08-08  0:02               ` Yuyang Du
  0 siblings, 1 reply; 16+ messages in thread
From: Yuyang Du @ 2014-08-07 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, Ingo Molnar, linux-kernel, Ben Segall, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran

On Fri, Aug 08, 2014 at 09:11:03AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 08, 2014 at 06:30:08AM +0800, Yuyang Du wrote:
> > > > > -----------------------------------------------------
> > > > >               |   10-90   |  100-1000   |  1100-2000
> > > > >               |   users   |   users     |   users
> > > > > -----------------------------------------------------
> > > > > alltests      |   -3.37%  |  -10.64%    |   -2.25%
> > > > > -----------------------------------------------------
> > > > > all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> > > > > -----------------------------------------------------
> > > > > compute       |   -5.97%  |   +2.34%    |   +3.22%
> > > > > -----------------------------------------------------
> > > > > custom        |  -31.61%  |  -10.29%    |  +15.23%
> > > > > -----------------------------------------------------
> > > > > disk          |  +24.64%  |  +28.96%    |  +21.28%
> > > > > -----------------------------------------------------
> > > > > fserver       |   -1.35%  |   +4.82%    |   +9.35%
> > > > > -----------------------------------------------------
> > > > > high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> > > > > -----------------------------------------------------
> > > > > shared        |  -28.31%  |  -19.99%    |   -7.10%
> > > > > -----------------------------------------------------
> > > > > short         |  -44.63%  |  -37.48%    |  -33.62%
> > > > > -----------------------------------------------------
> 
> > Thanks a lot, Jason.
> > 
> > So for this particular set of workloads on a big machine, I think the
> > result is mixed and overall "neutral", but I expected the variation probably
> > could be bigger especially for light workloads.
> > 
> > Any comment from the maintainers and others? Ping Peter and Ben, I haven't
> > heard from you for the 5th version.
> 
> Been a bit busy.. but in general I worry about the performance decrease
> on the lighter loads. I should probably run some workloads on my 2
> socket and 4 socket machines, but the coming few weeks will be very busy
> and I'm afraid I'll not get around to it in a timely manner.

Ok. I understand. From our part, Fengguang's LKP does not include light loads,
we also need some such tests to confirm this and see what is next.

Since typical benchmarks would be heavy ones, what do you suggest for light loads?

Jason, possible you can share some of your workloads?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-07 23:15             ` Yuyang Du
@ 2014-08-08  0:02               ` Yuyang Du
  0 siblings, 0 replies; 16+ messages in thread
From: Yuyang Du @ 2014-08-08  0:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Low, Ingo Molnar, linux-kernel, Ben Segall, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, fengguang.wu

On Fri, Aug 08, 2014 at 07:15:35AM +0800, Yuyang Du wrote:
> On Fri, Aug 08, 2014 at 09:11:03AM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 08, 2014 at 06:30:08AM +0800, Yuyang Du wrote:
> > > > > > -----------------------------------------------------
> > > > > >               |   10-90   |  100-1000   |  1100-2000
> > > > > >               |   users   |   users     |   users
> > > > > > -----------------------------------------------------
> > > > > > alltests      |   -3.37%  |  -10.64%    |   -2.25%
> > > > > > -----------------------------------------------------
> > > > > > all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> > > > > > -----------------------------------------------------
> > > > > > compute       |   -5.97%  |   +2.34%    |   +3.22%
> > > > > > -----------------------------------------------------
> > > > > > custom        |  -31.61%  |  -10.29%    |  +15.23%
> > > > > > -----------------------------------------------------
> > > > > > disk          |  +24.64%  |  +28.96%    |  +21.28%
> > > > > > -----------------------------------------------------
> > > > > > fserver       |   -1.35%  |   +4.82%    |   +9.35%
> > > > > > -----------------------------------------------------
> > > > > > high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> > > > > > -----------------------------------------------------
> > > > > > shared        |  -28.31%  |  -19.99%    |   -7.10%
> > > > > > -----------------------------------------------------
> > > > > > short         |  -44.63%  |  -37.48%    |  -33.62%
> > > > > > -----------------------------------------------------
> > 
> > > Thanks a lot, Jason.
> > > 
> > > So for this particular set of workloads on a big machine, I think the
> > > result is mixed and overall "neutral", but I expected the variation probably
> > > could be bigger especially for light workloads.
> > > 
> > > Any comment from the maintainers and others? Ping Peter and Ben, I haven't
> > > heard from you for the 5th version.
> > 
> > Been a bit busy.. but in general I worry about the performance decrease
> > on the lighter loads. I should probably run some workloads on my 2
> > socket and 4 socket machines, but the coming few weeks will be very busy
> > and I'm afraid I'll not get around to it in a timely manner.
> 
> Ok. I understand. From our part, Fengguang's LKP does not include light loads,
> we also need some such tests to confirm this and see what is next.
> 
> Since typical benchmarks would be heavy ones, what do you suggest for light loads?
> 
> Jason, possible you can share some of your workloads?
> 
Just heard from Fengguang, LKP has AIM7, which is not private workload. We will do
it. Thanks, Jason.

+Fengguang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-07 18:02     ` Yuyang Du
@ 2014-08-08  4:18       ` Jason Low
  2014-08-07 22:30         ` Yuyang Du
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Low @ 2014-08-08  4:18 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Ben Segall,
	Waiman Long, Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran

On Fri, 2014-08-08 at 02:02 +0800, Yuyang Du wrote:
> On Wed, Aug 06, 2014 at 11:21:35AM -0700, Jason Low wrote:
> > I ran these tests with most of the AIM7 workloads to compare its
> > performance between a 3.16 kernel and the kernel with these patches
> > applied.
> > 
> > The table below contains the percent difference between the baseline
> > kernel and the kernel with the patches at various user counts. A
> > positive percent means the kernel with the patches performed better,
> > while a negative percent means the baseline performed better.
> > 
> > Based on these numbers, for many of the workloads, the change was
> > beneficial in those highly contended, while it had - impact in many
> > of the lightly/moderately contended case (10 to 90 users).
> > 
> > -----------------------------------------------------
> >               |   10-90   |  100-1000   |  1100-2000
> >               |   users   |   users     |   users
> > -----------------------------------------------------
> > alltests      |   -3.37%  |  -10.64%    |   -2.25%
> > -----------------------------------------------------
> > all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> > -----------------------------------------------------
> > compute       |   -5.97%  |   +2.34%    |   +3.22%
> > -----------------------------------------------------
> > custom        |  -31.61%  |  -10.29%    |  +15.23%
> > -----------------------------------------------------
> > disk          |  +24.64%  |  +28.96%    |  +21.28%
> > -----------------------------------------------------
> > fserver       |   -1.35%  |   +4.82%    |   +9.35%
> > -----------------------------------------------------
> > high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> > -----------------------------------------------------
> > shared        |  -28.31%  |  -19.99%    |   -7.10%
> > -----------------------------------------------------
> > short         |  -44.63%  |  -37.48%    |  -33.62%
> > -----------------------------------------------------
> > 
> Thanks, Jason. Sorry for late response.
> 
> What about the variation of the tests? The machine you test on?

Hi Yuyang,

These tests were also done on an 8 socket machine (80 cores). In terms
of variation between the average throughputs, typically the noise range
is about 2% in many of the workloads.

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-07 22:30         ` Yuyang Du
@ 2014-08-08  7:11           ` Peter Zijlstra
  2014-08-07 23:15             ` Yuyang Du
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2014-08-08  7:11 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Jason Low, Ingo Molnar, linux-kernel, Ben Segall, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran

On Fri, Aug 08, 2014 at 06:30:08AM +0800, Yuyang Du wrote:
> > > > -----------------------------------------------------
> > > >               |   10-90   |  100-1000   |  1100-2000
> > > >               |   users   |   users     |   users
> > > > -----------------------------------------------------
> > > > alltests      |   -3.37%  |  -10.64%    |   -2.25%
> > > > -----------------------------------------------------
> > > > all_utime     |   +0.33%  |   +3.73%    |   +3.33%
> > > > -----------------------------------------------------
> > > > compute       |   -5.97%  |   +2.34%    |   +3.22%
> > > > -----------------------------------------------------
> > > > custom        |  -31.61%  |  -10.29%    |  +15.23%
> > > > -----------------------------------------------------
> > > > disk          |  +24.64%  |  +28.96%    |  +21.28%
> > > > -----------------------------------------------------
> > > > fserver       |   -1.35%  |   +4.82%    |   +9.35%
> > > > -----------------------------------------------------
> > > > high_systime  |   -6.73%  |   -6.28%    |  +12.36%
> > > > -----------------------------------------------------
> > > > shared        |  -28.31%  |  -19.99%    |   -7.10%
> > > > -----------------------------------------------------
> > > > short         |  -44.63%  |  -37.48%    |  -33.62%
> > > > -----------------------------------------------------

> Thanks a lot, Jason.
> 
> So for this particular set of workloads on a big machine, I think the
> result is mixed and overall "neutral", but I expected the variation probably
> could be bigger especially for light workloads.
> 
> Any comment from the maintainers and others? Ping Peter and Ben, I haven't
> heard from you for the 5th version.

Been a bit busy.. but in general I worry about the performance decrease
on the lighter loads. I should probably run some workloads on my 2
socket and 4 socket machines, but the coming few weeks will be very busy
and I'm afraid I'll not get around to it in a timely manner.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-04 20:52 ` bsegall
  2014-08-04 21:27   ` Jason Low
@ 2014-08-11 17:31   ` Jason Low
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Low @ 2014-08-11 17:31 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Waiman Long,
	Mel Gorman, Mike Galbraith, Rik van Riel,
	Aswin Chandramouleeswaran, Chegu Vinod, Scott J Norton, pjt,
	jason.low2

On Mon, 2014-08-04 at 13:52 -0700, bsegall@google.com wrote:
> 
> That said, it might be better to remove force_update for this function,
> or make it just reduce the minimum to /64 or something. If the test is
> easy to run it would be good to see what it's like just removing the
> force_update param for this function to see if it's worth worrying
> about or if the zero case catches ~all the perf gain.

Hi Ben,

I removed the force update in __update_cfs_rq_tg_load_contrib and it
helped reduce overhead a lot more. I saw up to a 20x reduction in system
overhead from update_cfs_rq_blocked_load when running some of the AIM7
workloads with this change.

-----
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..7a6e18b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update)
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	long tg_contrib;
@@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;
 
-	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+	if (abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
 		atomic_long_add(tg_contrib, &tg->load_avg);
 		cfs_rq->tg_load_contrib += tg_contrib;
 	}
@@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update) {}
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
@@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		cfs_rq->last_decay = now;
 	}
 
-	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+	__update_cfs_rq_tg_load_contrib(cfs_rq);
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */



^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-08-11 17:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-04 20:28 [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
2014-08-04 19:15 ` Yuyang Du
2014-08-04 21:42   ` Yuyang Du
2014-08-05 15:42   ` Jason Low
2014-08-06 18:21   ` Jason Low
2014-08-07 18:02     ` Yuyang Du
2014-08-08  4:18       ` Jason Low
2014-08-07 22:30         ` Yuyang Du
2014-08-08  7:11           ` Peter Zijlstra
2014-08-07 23:15             ` Yuyang Du
2014-08-08  0:02               ` Yuyang Du
2014-08-04 20:52 ` bsegall
2014-08-04 21:27   ` Jason Low
2014-08-11 17:31   ` Jason Low
2014-08-04 21:04 ` Peter Zijlstra
2014-08-05 17:53 ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.