Re: [PATCH v2 1/2] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

From: Wanpeng Li <kernellwp@gmail.com>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	stable@vger.kernel.org,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: Re: [PATCH v2 1/2] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
Date: Wed, 8 Mar 2017 16:47:40 +0800	[thread overview]
Message-ID: <CANRm+CxXov_A+ufuQNOmTkfm8vkw2Gogb=J7iN07O=ukokGWug@mail.gmail.com> (raw)
In-Reply-To: <20170217120731.11868-2-matt@codeblueprint.co.uk>

2017-02-17 20:07 GMT+08:00 Matt Fleming <matt@codeblueprint.co.uk>:
> If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
> the pending sample window time on exit, setting the next update not
> one window into the future, but two.
>
> This situation on exiting NO_HZ is described by:
>
>   this_rq->calc_load_update < jiffies < calc_load_update
>
> In this scenario, what we should be doing is:
>
>   this_rq->calc_load_update = calc_load_update               [ next window ]
>
> But what we actually do is:
>
>   this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
>
> This has the effect of delaying load average updates for potentially
> up to ~9seconds.
>
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
>
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().
>
> This issue is easy to reproduce before,
>
>   commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
>
> just by forking short-lived process pipelines built from ps(1) and
> grep(1) in a loop. I'm unable to reproduce the spikes after that
> commit, but the bug still seems to be present from code review.
>
> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: <stable@vger.kernel.org> # v3.5+
> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>

Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>

> ---
>  kernel/sched/loadavg.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Changes in v2:
>
>  - Folded in Peter's suggestion for how to fix this.
>
>  - Tried to clairfy the changelog based on feedback from Peter and
>    Frederic
>
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index a2d6eb71f06b..ec91fcc09bfe 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -201,8 +201,9 @@ void calc_load_exit_idle(void)
>         struct rq *this_rq = this_rq();
>
>         /*
> -        * If we're still before the sample window, we're done.
> +        * If we're still before the pending sample window, we're done.
>          */
> +       this_rq->calc_load_update = calc_load_update;
>         if (time_before(jiffies, this_rq->calc_load_update))
>                 return;
>
> @@ -211,7 +212,6 @@ void calc_load_exit_idle(void)
>          * accounted through the nohz accounting, so skip the entire deal and
>          * sync up for the next window.
>          */
> -       this_rq->calc_load_update = calc_load_update;
>         if (time_before(jiffies, this_rq->calc_load_update + 10))
>                 this_rq->calc_load_update += LOAD_FREQ;
>  }
> --
> 2.10.0
>