[PATCH v2] sched: Reduce contention in update_cfs_rq_blocked

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
@ 2014-08-26 23:11 Jason Low
  2014-08-26 23:24 ` Paul Turner
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Low @ 2014-08-26 23:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Ben Segall; +Cc: linux-kernel, Jason Low

Based on perf profiles, the update_cfs_rq_blocked_load function constantly
shows up as taking up a noticeable % of system run time. This is especially
apparent on larger numa systems.

Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
updating the tg load contribution stats. However, it was noticed that the
values often don't get modified by much. In fact, much of the time, they
don't get modified at all. However, the update can always get attempted due
to force_update.

In this patch, we remove the force_update in only the
__update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
modified only if the delta is large enough (in the current code, they get
updated when the delta is larger than 12.5%). This is a way to rate-limit
the updates while largely keeping the values accurate.

When testing this change with AIM7 workloads, we found that it was able to
reduce the overhead of the function by up to a factor of 20x.

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/sched/fair.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..7a6e18b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update)
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	long tg_contrib;
@@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;
 
-	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+	if (abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
 		atomic_long_add(tg_contrib, &tg->load_avg);
 		cfs_rq->tg_load_contrib += tg_contrib;
 	}
@@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update) {}
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
@@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		cfs_rq->last_decay = now;
 	}
 
-	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+	__update_cfs_rq_tg_load_contrib(cfs_rq);
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
-- 
1.7.1




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-26 23:11 [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
@ 2014-08-26 23:24 ` Paul Turner
  2014-08-27 17:34   ` Jason Low
  2014-08-27 19:18   ` [PATCH RESULT] sched: Rewrite per entity runnable load average tracking v5 Yuyang Du
  0 siblings, 2 replies; 10+ messages in thread
From: Paul Turner @ 2014-08-26 23:24 UTC (permalink / raw)
  To: Jason Low; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On Tue, Aug 26, 2014 at 4:11 PM, Jason Low <jason.low2@hp.com> wrote:
> Based on perf profiles, the update_cfs_rq_blocked_load function constantly
> shows up as taking up a noticeable % of system run time. This is especially
> apparent on larger numa systems.
>
> Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
> updating the tg load contribution stats. However, it was noticed that the
> values often don't get modified by much. In fact, much of the time, they
> don't get modified at all. However, the update can always get attempted due
> to force_update.
>
> In this patch, we remove the force_update in only the
> __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
> modified only if the delta is large enough (in the current code, they get
> updated when the delta is larger than 12.5%). This is a way to rate-limit
> the updates while largely keeping the values accurate.
>
> When testing this change with AIM7 workloads, we found that it was able to
> reduce the overhead of the function by up to a factor of 20x.

Looks reasonable.

>
> Cc: Yuyang Du <yuyang.du@intel.com>
> Cc: Waiman Long <Waiman.Long@hp.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> Cc: Chegu Vinod <chegu_vinod@hp.com>
> Cc: Scott J Norton <scott.norton@hp.com>
> Signed-off-by: Jason Low <jason.low2@hp.com>
> ---
>  kernel/sched/fair.c |   10 ++++------
>  1 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..7a6e18b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> -                                                int force_update)
> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
>  {
>         struct task_group *tg = cfs_rq->tg;
>         long tg_contrib;
> @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>         tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>         tg_contrib -= cfs_rq->tg_load_contrib;
>
> -       if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {

Another option with slightly higher accuracy would be to increase the
sensitivity here when force_update == 1.

E.g.:
    abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { ...

Alternatively we could bound total inaccuracy:
   int divisor = force_update ? NR_CPUS : 8;
   if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ...


[ And probably rename force_update to want_update ]


> +       if (abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>                 atomic_long_add(tg_contrib, &tg->load_avg);
>                 cfs_rq->tg_load_contrib += tg_contrib;
>         }
> @@ -2436,8 +2435,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>         __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>  }
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> -                                                int force_update) {}
> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq) {}
>  static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>                                                   struct cfs_rq *cfs_rq) {}
>  static inline void __update_group_entity_contrib(struct sched_entity *se) {}
> @@ -2537,7 +2535,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>                 cfs_rq->last_decay = now;
>         }
>
> -       __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
> +       __update_cfs_rq_tg_load_contrib(cfs_rq);
>  }
>
>  /* Add the load generated by se into cfs_rq's child load-average */
> --
> 1.7.1
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-26 23:24 ` Paul Turner
@ 2014-08-27 17:34   ` Jason Low
  2014-08-27 23:32     ` Tim Chen
  2014-08-27 19:18   ` [PATCH RESULT] sched: Rewrite per entity runnable load average tracking v5 Yuyang Du
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Low @ 2014-08-27 17:34 UTC (permalink / raw)
  To: Paul Turner; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote:
> On Tue, Aug 26, 2014 at 4:11 PM, Jason Low <jason.low2@hp.com> wrote:
> > Based on perf profiles, the update_cfs_rq_blocked_load function constantly
> > shows up as taking up a noticeable % of system run time. This is especially
> > apparent on larger numa systems.
> >
> > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
> > updating the tg load contribution stats. However, it was noticed that the
> > values often don't get modified by much. In fact, much of the time, they
> > don't get modified at all. However, the update can always get attempted due
> > to force_update.
> >
> > In this patch, we remove the force_update in only the
> > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
> > modified only if the delta is large enough (in the current code, they get
> > updated when the delta is larger than 12.5%). This is a way to rate-limit
> > the updates while largely keeping the values accurate.
> >
> > When testing this change with AIM7 workloads, we found that it was able to
> > reduce the overhead of the function by up to a factor of 20x.
> 
> Looks reasonable.
> 
> >
> > Cc: Yuyang Du <yuyang.du@intel.com>
> > Cc: Waiman Long <Waiman.Long@hp.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > Cc: Chegu Vinod <chegu_vinod@hp.com>
> > Cc: Scott J Norton <scott.norton@hp.com>
> > Signed-off-by: Jason Low <jason.low2@hp.com>
> > ---
> >  kernel/sched/fair.c |   10 ++++------
> >  1 files changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fea7d33..7a6e18b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> >  }
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> > -                                                int force_update)
> > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
> >  {
> >         struct task_group *tg = cfs_rq->tg;
> >         long tg_contrib;
> > @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> >         tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> >         tg_contrib -= cfs_rq->tg_load_contrib;
> >
> > -       if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> 
> Another option with slightly higher accuracy would be to increase the
> sensitivity here when force_update == 1.
> 
> E.g.:
>     abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { ...
> 
> Alternatively we could bound total inaccuracy:
>    int divisor = force_update ? NR_CPUS : 8;
>    if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ...
> 
> 
> [ And probably rename force_update to want_update ]

Out of the 2 additional options, I think the first one is better. The
other option of using NR_CPUS looks like we're increasing the update
rate as the system gets larger, and its the larger systems that are
typically more affected by the contention.

Do you prefer either of the 2 other options over this v2 patch? If so, I
can test and send out a new patch, otherwise, I'll keep this current v2
patch.

Thanks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH RESULT] sched: Rewrite per entity runnable load average tracking v5
  2014-08-26 23:24 ` Paul Turner
  2014-08-27 17:34   ` Jason Low
@ 2014-08-27 19:18   ` Yuyang Du
  1 sibling, 0 replies; 10+ messages in thread
From: Yuyang Du @ 2014-08-27 19:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Paul Turner, Ben Segall, Jason Low
  Cc: LKML, fengguang.wu, arjan.van.de.ven, len.brown, rafael.j.wysocki

Hi all,

We finished a round of tests for AIM7 workload for the patch v5 by LKP, the
result is as follows:

Hostname: brickland3
Model: Brickland Ivy Bridge-E
CPU: 120
Memory: 512G

      v3.16-rc7         rewrite-v5          testbox/testcase/testparams
---------------  -------------------------  ---------------------------
     %stddev        %change               %stddev
            \          |                 /
     26317 ± 3%     +34.7%      35437 ± 1%  brickland3/aim7/2000-fork_test
     43481 ± 0%      +0.5%      43712 ± 0%  brickland3/aim7/3000-brk_test
     80810 ± 0%      -0.2%      80625 ± 1%  brickland3/aim7/3000-sieve
    489830 ± 1%     +14.7%     561691 ± 1%  brickland3/aim7/8000-disk_src
   1672643 ±12%     -13.0%    1455960 ± 6%  brickland3/aim7/8000-disk_wrt
   1193986 ± 0%      -1.5%    1175841 ± 0%  brickland3/aim7/8000-mem_rtns_1
    875113 ±13%     +21.9%    1067191 ±13%  brickland3/aim7/8000-misc_rtns_1
   1188420 ± 1%      -3.2%    1150624 ± 0%  brickland3/aim7/8000-sort_rtns_1
    807986 ± 0%      -0.7%     802275 ± 0%  brickland3/aim7/8000-string_rtns
   6378589 ± 5%      -0.1%    6373359 ± 4%  TOTAL aim7.jobs-per-min

---------------  -------------------------  
       339 ± 7%     -22.9%        262 ± 0%  brickland3/aim7/2000-fork_test
       463 ± 0%      +0.4%        465 ± 0%  brickland3/aim7/3000-brk_test
       337 ± 9%      -3.1%        327 ± 6%  brickland3/aim7/3000-disk_rd
       534 ± 0%      +0.9%        539 ± 0%  brickland3/aim7/3000-sieve
       391 ± 0%     -10.7%        349 ± 1%  brickland3/aim7/8000-disk_src
       301 ± 4%     +10.0%        331 ± 5%  brickland3/aim7/8000-disk_wrt
       405 ± 1%      +6.0%        429 ± 0%  brickland3/aim7/8000-mem_rtns_1
       227 ±25%      -0.8%        225 ±12%  brickland3/aim7/8000-misc_rtns_1
       395 ± 1%      +5.1%        415 ± 0%  brickland3/aim7/8000-sort_rtns_1
       419 ± 0%      +4.6%        438 ± 0%  brickland3/aim7/8000-string_rtns
      3815 ± 3%      -0.8%       3784 ± 2%  TOTAL turbostat.Pkg_W

---------------  -------------------------  
     90.79 ± 1%     -15.1%      77.04 ± 0%  brickland3/aim7/2000-fork_test
     62.42 ± 0%      -0.0%      62.41 ± 0%  brickland3/aim7/3000-brk_test
     61.69 ± 2%      -1.3%      60.92 ± 2%  brickland3/aim7/3000-disk_rd
    137.13 ± 0%      +0.2%     137.39 ± 0%  brickland3/aim7/3000-sieve
     66.03 ± 1%      +0.6%      66.41 ± 0%  brickland3/aim7/8000-disk_src
     53.94 ± 2%      +6.1%      57.21 ± 0%  brickland3/aim7/8000-disk_wrt
     54.43 ± 0%      +4.8%      57.02 ± 0%  brickland3/aim7/8000-mem_rtns_1
     58.71 ± 2%      +0.0%      58.72 ± 1%  brickland3/aim7/8000-misc_rtns_1
     53.78 ± 1%      +4.8%      56.36 ± 0%  brickland3/aim7/8000-sort_rtns_1
     56.47 ± 2%      +2.5%      57.89 ± 0%  brickland3/aim7/8000-string_rtns
    695.38 ± 1%      -0.6%     691.37 ± 1%  TOTAL turbostat.RAM_W

---------------  -------------------------  
     23.40 ±22%     -16.2%      19.60 ± 1%  brickland3/aim7/2000-fork_test
     97.28 ± 0%      -0.0%      97.25 ± 0%  brickland3/aim7/3000-brk_test
     35.09 ±24%      -8.7%      32.05 ±21%  brickland3/aim7/3000-disk_rd
     94.92 ± 0%      -0.0%      94.88 ± 0%  brickland3/aim7/3000-sieve
     41.04 ± 2%     -26.8%      30.06 ± 3%  brickland3/aim7/8000-disk_src
     36.54 ±11%     -10.2%      32.82 ±12%  brickland3/aim7/8000-disk_wrt
     61.39 ± 2%      -7.4%      56.87 ± 1%  brickland3/aim7/8000-mem_rtns_1
     62.57 ± 4%     -11.1%      55.64 ± 3%  brickland3/aim7/8000-sort_rtns_1
     76.06 ± 2%      -3.4%      73.51 ± 0%  brickland3/aim7/8000-string_rtns
    528.30 ± 4%      -6.7%     492.67 ± 3%  TOTAL turbostat.%c0

For performance, I think the overall is flat. For power, we can see some
benefits. In perticular, the stddev indicates the variation is smaller than
v3.16-rc7 (all subtests are done at least for 5 times).

To Jason, the format or specific metric is not directly comparable with your
tests. If you can specify how to get the same look as your tests, we are happy
to try.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-27 17:34   ` Jason Low
@ 2014-08-27 23:32     ` Tim Chen
  2014-08-28 19:46       ` Jason Low
  0 siblings, 1 reply; 10+ messages in thread
From: Tim Chen @ 2014-08-27 23:32 UTC (permalink / raw)
  To: Jason Low; +Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote:
> On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote:
> > On Tue, Aug 26, 2014 at 4:11 PM, Jason Low <jason.low2@hp.com> wrote:
> > > Based on perf profiles, the update_cfs_rq_blocked_load function constantly
> > > shows up as taking up a noticeable % of system run time. This is especially
> > > apparent on larger numa systems.
> > >
> > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
> > > updating the tg load contribution stats. However, it was noticed that the
> > > values often don't get modified by much. In fact, much of the time, they
> > > don't get modified at all. However, the update can always get attempted due
> > > to force_update.
> > >
> > > In this patch, we remove the force_update in only the
> > > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
> > > modified only if the delta is large enough (in the current code, they get
> > > updated when the delta is larger than 12.5%). This is a way to rate-limit
> > > the updates while largely keeping the values accurate.
> > >
> > > When testing this change with AIM7 workloads, we found that it was able to
> > > reduce the overhead of the function by up to a factor of 20x.
> > 
> > Looks reasonable.
> > 
> > >
> > > Cc: Yuyang Du <yuyang.du@intel.com>
> > > Cc: Waiman Long <Waiman.Long@hp.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > > Cc: Chegu Vinod <chegu_vinod@hp.com>
> > > Cc: Scott J Norton <scott.norton@hp.com>
> > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > ---
> > >  kernel/sched/fair.c |   10 ++++------
> > >  1 files changed, 4 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index fea7d33..7a6e18b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> > >  }
> > >
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> > > -                                                int force_update)
> > > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
> > >  {
> > >         struct task_group *tg = cfs_rq->tg;
> > >         long tg_contrib;
> > > @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> > >         tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> > >         tg_contrib -= cfs_rq->tg_load_contrib;
> > >
> > > -       if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> > 
> > Another option with slightly higher accuracy would be to increase the
> > sensitivity here when force_update == 1.
> > 
> > E.g.:
> >     abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { ...
> > 
> > Alternatively we could bound total inaccuracy:
> >    int divisor = force_update ? NR_CPUS : 8;
> >    if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ...
> > 
> > 
> > [ And probably rename force_update to want_update ]
> 
> Out of the 2 additional options, I think the first one is better. The
> other option of using NR_CPUS looks like we're increasing the update
> rate as the system gets larger, and its the larger systems that are
> typically more affected by the contention.

Probably num_present_cpus() will be better than NR_CPUS, which can
be much larger than the actual cpus present.

> 
> Do you prefer either of the 2 other options over this v2 patch? If so, I
> can test and send out a new patch, otherwise, I'll keep this current v2
> patch.

If there are multiple non-forced updates, option 1's error seems to
accumulate and non-bounded as we do not actually update?  
Is this a concern?

Thanks.

Tim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-27 23:32     ` Tim Chen
@ 2014-08-28 19:46       ` Jason Low
  2014-09-01 12:55         ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Low @ 2014-08-28 19:46 UTC (permalink / raw)
  To: Tim Chen; +Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Ben Segall, LKML

On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote:
> On Wed, 2014-08-27 at 10:34 -0700, Jason Low wrote:
> > On Tue, 2014-08-26 at 16:24 -0700, Paul Turner wrote:
> > > On Tue, Aug 26, 2014 at 4:11 PM, Jason Low <jason.low2@hp.com> wrote:
> > > > Based on perf profiles, the update_cfs_rq_blocked_load function constantly
> > > > shows up as taking up a noticeable % of system run time. This is especially
> > > > apparent on larger numa systems.
> > > >
> > > > Much of the contention is in __update_cfs_rq_tg_load_contrib when we're
> > > > updating the tg load contribution stats. However, it was noticed that the
> > > > values often don't get modified by much. In fact, much of the time, they
> > > > don't get modified at all. However, the update can always get attempted due
> > > > to force_update.
> > > >
> > > > In this patch, we remove the force_update in only the
> > > > __update_cfs_rq_tg_load_contrib. Thus the tg load contrib stats now get
> > > > modified only if the delta is large enough (in the current code, they get
> > > > updated when the delta is larger than 12.5%). This is a way to rate-limit
> > > > the updates while largely keeping the values accurate.
> > > >
> > > > When testing this change with AIM7 workloads, we found that it was able to
> > > > reduce the overhead of the function by up to a factor of 20x.
> > > 
> > > Looks reasonable.
> > > 
> > > >
> > > > Cc: Yuyang Du <yuyang.du@intel.com>
> > > > Cc: Waiman Long <Waiman.Long@hp.com>
> > > > Cc: Mel Gorman <mgorman@suse.de>
> > > > Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> > > > Cc: Rik van Riel <riel@redhat.com>
> > > > Cc: Aswin Chandramouleeswaran <aswin@hp.com>
> > > > Cc: Chegu Vinod <chegu_vinod@hp.com>
> > > > Cc: Scott J Norton <scott.norton@hp.com>
> > > > Signed-off-by: Jason Low <jason.low2@hp.com>
> > > > ---
> > > >  kernel/sched/fair.c |   10 ++++------
> > > >  1 files changed, 4 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index fea7d33..7a6e18b 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -2352,8 +2352,7 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> > > >  }
> > > >
> > > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > > > -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> > > > -                                                int force_update)
> > > > +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq)
> > > >  {
> > > >         struct task_group *tg = cfs_rq->tg;
> > > >         long tg_contrib;
> > > > @@ -2361,7 +2360,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> > > >         tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> > > >         tg_contrib -= cfs_rq->tg_load_contrib;
> > > >
> > > > -       if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> > > 
> > > Another option with slightly higher accuracy would be to increase the
> > > sensitivity here when force_update == 1.
> > > 
> > > E.g.:
> > >     abs(tg_contrib) > cfs_rq->tg_load_contrib / (8 * (1 + force_update))) { ...
> > > 
> > > Alternatively we could bound total inaccuracy:
> > >    int divisor = force_update ? NR_CPUS : 8;
> > >    if (abs(tg_contrib) > cfs_rq->tg_load_contrib / divisor) { ...
> > > 
> > > 
> > > [ And probably rename force_update to want_update ]
> > 
> > Out of the 2 additional options, I think the first one is better. The
> > other option of using NR_CPUS looks like we're increasing the update
> > rate as the system gets larger, and its the larger systems that are
> > typically more affected by the contention.
> 
> Probably num_present_cpus() will be better than NR_CPUS, which can
> be much larger than the actual cpus present.

Yeah, num_present_cpus(), though the same issue would still be there.

> > 
> > Do you prefer either of the 2 other options over this v2 patch? If so, I
> > can test and send out a new patch, otherwise, I'll keep this current v2
> > patch.
> 
> If there are multiple non-forced updates, option 1's error seems to
> accumulate and non-bounded as we do not actually update?  
> Is this a concern?

It should be fine. Once the delta is large enough, we will end up doing
the update anyway.

Thanks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-08-28 19:46       ` Jason Low
@ 2014-09-01 12:55         ` Peter Zijlstra
  2014-09-02  7:41           ` Jason Low
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2014-09-01 12:55 UTC (permalink / raw)
  To: Jason Low; +Cc: Tim Chen, Paul Turner, Ingo Molnar, Ben Segall, LKML

On Thu, Aug 28, 2014 at 12:46:36PM -0700, Jason Low wrote:
> On Wed, 2014-08-27 at 16:32 -0700, Tim Chen wrote:

> > If there are multiple non-forced updates, option 1's error seems to
> > accumulate and non-bounded as we do not actually update?  
> > Is this a concern?
> 
> It should be fine. Once the delta is large enough, we will end up doing
> the update anyway.

Well, the thing is you can have nr_cpus * 12.5% of outstanding delta;
that might be a lot, esp on the large machines.

Now there's two problems with all this; the first is the relative
threshold, typically such per-cpu things have a fixed update threshold,
this makes it much easier to qualify the actual error.

Secondly the indeed the nr_cpus in the error bound. Some things; like
the proportion code scale the threshold by log2(nr_cpus) in an attempt
to do something sensible there.

But yes, unbounded errors here are a problem, sure relaxing the updates
makes things go fast, they also make things go skew.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-09-01 12:55         ` Peter Zijlstra
@ 2014-09-02  7:41           ` Jason Low
  2014-09-03 11:32             ` Peter Zijlstra
  2014-09-09 14:52             ` [tip:sched/core] sched: Reduce contention in update_cfs_rq_blocked_load() tip-bot for Jason Low
  0 siblings, 2 replies; 10+ messages in thread
From: Jason Low @ 2014-09-02  7:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Paul Turner, Ingo Molnar, Ben Segall, LKML, jason.low2

On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote:
> But yes, unbounded errors here are a problem, sure relaxing the updates
> makes things go fast, they also make things go skew.

Okay. In that case, would you like to take our original patch which
avoids unnecessary updates?

-----
Subject: [PATCH] sched: Reduce contention in update_cfs_rq_blocked_load

When running workloads on 2+ socket systems, based on perf profiles, the
update_cfs_rq_blocked_load function often shows up as taking up a
noticeable % of run time.

Much of the contention is in __update_cfs_rq_tg_load_contrib when we
update the tg load contribution stats.  However, it turns out that in many
cases, they don't need to be updated and "tg_contrib" is 0.

This patch adds a check in __update_cfs_rq_tg_load_contrib to skip updating
tg load contribution stats when nothing needs to be updated. This reduces the
cacheline contention that would be unnecessary.

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Scott J Norton <scott.norton@hp.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3427a8..45e346c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2382,6 +2382,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;
 
+	if (!tg_contrib)
+		return;
+
 	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
 		atomic_long_add(tg_contrib, &tg->load_avg);
 		cfs_rq->tg_load_contrib += tg_contrib;
-- 
1.7.1




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load
  2014-09-02  7:41           ` Jason Low
@ 2014-09-03 11:32             ` Peter Zijlstra
  2014-09-09 14:52             ` [tip:sched/core] sched: Reduce contention in update_cfs_rq_blocked_load() tip-bot for Jason Low
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2014-09-03 11:32 UTC (permalink / raw)
  To: Jason Low; +Cc: Tim Chen, Paul Turner, Ingo Molnar, Ben Segall, LKML

On Tue, Sep 02, 2014 at 12:41:24AM -0700, Jason Low wrote:
> On Mon, 2014-09-01 at 14:55 +0200, Peter Zijlstra wrote:
> > But yes, unbounded errors here are a problem, sure relaxing the updates
> > makes things go fast, they also make things go skew.
> 
> Okay. In that case, would you like to take our original patch which
> avoids unnecessary updates?

This seems a safe patch; yes I'll queue it while we think of smarter
things :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched: Reduce contention in update_cfs_rq_blocked_load()
  2014-09-02  7:41           ` Jason Low
  2014-09-03 11:32             ` Peter Zijlstra
@ 2014-09-09 14:52             ` tip-bot for Jason Low
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot for Jason Low @ 2014-09-09 14:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, torvalds, pjt, peterz,
	tim.c.chen, yuyang.du, jason.low2, Waiman.Long, chegu_vinod,
	tglx, scott.norton, aswin

Commit-ID:  8236d907ab3411ad452280faa8b26c1347327380
Gitweb:     http://git.kernel.org/tip/8236d907ab3411ad452280faa8b26c1347327380
Author:     Jason Low <jason.low2@hp.com>
AuthorDate: Tue, 2 Sep 2014 00:41:24 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 9 Sep 2014 06:47:29 +0200

sched: Reduce contention in update_cfs_rq_blocked_load()

When running workloads on 2+ socket systems, based on perf profiles, the
update_cfs_rq_blocked_load() function often shows up as taking up a
noticeable % of run time.

Much of the contention is in __update_cfs_rq_tg_load_contrib() when we
update the tg load contribution stats.  However, it turns out that in many
cases, they don't need to be updated and "tg_contrib" is 0.

This patch adds a check in __update_cfs_rq_tg_load_contrib() to skip updating
tg load contribution stats when nothing needs to be updated. This reduces the
cacheline contention that would be unnecessary.

Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: jason.low2@hp.com
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409643684.19197.15.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50d2025..be9e97b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2382,6 +2382,9 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;
 
+	if (!tg_contrib)
+		return;
+
 	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
 		atomic_long_add(tg_contrib, &tg->load_avg);
 		cfs_rq->tg_load_contrib += tg_contrib;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-09 15:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-26 23:11 [PATCH v2] sched: Reduce contention in update_cfs_rq_blocked_load Jason Low
2014-08-26 23:24 ` Paul Turner
2014-08-27 17:34   ` Jason Low
2014-08-27 23:32     ` Tim Chen
2014-08-28 19:46       ` Jason Low
2014-09-01 12:55         ` Peter Zijlstra
2014-09-02  7:41           ` Jason Low
2014-09-03 11:32             ` Peter Zijlstra
2014-09-09 14:52             ` [tip:sched/core] sched: Reduce contention in update_cfs_rq_blocked_load() tip-bot for Jason Low
2014-08-27 19:18   ` [PATCH RESULT] sched: Rewrite per entity runnable load average tracking v5 Yuyang Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.