* [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
@ 2017-02-08 13:29 Matt Fleming
2017-02-08 16:04 ` kbuild test robot
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Matt Fleming @ 2017-02-08 13:29 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, Matt Fleming, Mike Galbraith, Morten Rasmussen,
stable, Vincent Guittot
The calculation for the next sample window when exiting NOH_HZ idle
does not handle the fact that we may not have reached the next sample
window yet, i.e. that we came out of idle between sample windows.
If we wake from NO_HZ idle after the pending this_rq->calc_load_update
window time when we want idle but before the next sample window, we
will add an unnecessary LOAD_FREQ delay to the load average
accounting, delaying any update for potentially ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v3.5+
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
kernel/sched/loadavg.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a2d6eb71f06b..a7a6f3646970 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -199,6 +199,7 @@ void calc_load_enter_idle(void)
void calc_load_exit_idle(void)
{
struct rq *this_rq = this_rq();
+ unsigned long next_window;
/*
* If we're still before the sample window, we're done.
@@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
* We woke inside or after the sample window, this means we're already
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
+ *
+ * The next window is 'calc_load_update' if we haven't reached it yet,
+ * and 'calc_load_update + 10' if we're inside the current window.
*/
- this_rq->calc_load_update = calc_load_update;
- if (time_before(jiffies, this_rq->calc_load_update + 10))
- this_rq->calc_load_update += LOAD_FREQ;
+ next_window = calc_load_update;
+
+ if (time_in_range_open(jiffies, next_window, next_window + 10)
+ next_window += LOAD_FREQ;
+
+ this_rq->calc_load_update = next_window;
}
static long calc_load_fold_idle(void)
--
2.10.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
@ 2017-02-08 16:04 ` kbuild test robot
2017-02-08 16:07 ` kbuild test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: kbuild test robot @ 2017-02-08 16:04 UTC (permalink / raw)
To: Matt Fleming
Cc: kbuild-all, Peter Zijlstra, Ingo Molnar, linux-kernel,
Matt Fleming, Mike Galbraith, Morten Rasmussen, stable,
Vincent Guittot
[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]
Hi Matt,
[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.10-rc7 next-20170208]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Matt-Fleming/sched-loadavg-Avoid-loadavg-spikes-caused-by-delayed-NO_HZ-accounting/20170208-224135
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 6.2.0
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=c6x
All errors (new ones prefixed by >>):
kernel/sched/loadavg.c: In function 'calc_load_exit_idle':
>> kernel/sched/loadavg.c:221:3: error: expected ')' before 'next_window'
next_window += LOAD_FREQ;
^~~~~~~~~~~
>> kernel/sched/loadavg.c:224:1: error: expected expression before '}' token
}
^
vim +221 kernel/sched/loadavg.c
215 * The next window is 'calc_load_update' if we haven't reached it yet,
216 * and 'calc_load_update + 10' if we're inside the current window.
217 */
218 next_window = calc_load_update;
219
220 if (time_in_range_open(jiffies, next_window, next_window + 10)
> 221 next_window += LOAD_FREQ;
222
223 this_rq->calc_load_update = next_window;
> 224 }
225
226 static long calc_load_fold_idle(void)
227 {
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 5211 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
2017-02-08 16:04 ` kbuild test robot
@ 2017-02-08 16:07 ` kbuild test robot
2017-02-09 15:07 ` Peter Zijlstra
2017-02-15 15:12 ` Frederic Weisbecker
3 siblings, 0 replies; 8+ messages in thread
From: kbuild test robot @ 2017-02-08 16:07 UTC (permalink / raw)
To: Matt Fleming
Cc: kbuild-all, Peter Zijlstra, Ingo Molnar, linux-kernel,
Matt Fleming, Mike Galbraith, Morten Rasmussen, stable,
Vincent Guittot
[-- Attachment #1: Type: text/plain, Size: 19371 bytes --]
Hi Matt,
[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.10-rc7 next-20170208]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Matt-Fleming/sched-loadavg-Avoid-loadavg-spikes-caused-by-delayed-NO_HZ-accounting/20170208-224135
config: i386-randconfig-x003-201706 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All error/warnings (new ones prefixed by >>):
kernel/sched/loadavg.c: In function 'calc_load_exit_idle':
>> kernel/sched/loadavg.c:404:0: error: unterminated argument list invoking macro "if"
}
>> kernel/sched/loadavg.c:220:2: error: expected '(' at end of input
if (time_in_range_open(jiffies, next_window, next_window + 10)
^~
>> kernel/sched/loadavg.c:220:2: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
kernel/sched/loadavg.c:404:0: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
}
>> kernel/sched/loadavg.c:220:2: error: expected declaration or statement at end of input
if (time_in_range_open(jiffies, next_window, next_window + 10)
^~
At top level:
kernel/sched/loadavg.c:100:1: warning: 'calc_load' defined but not used [-Wunused-function]
calc_load(unsigned long load, unsigned long exp, unsigned long active)
^~~~~~~~~
vim +/if +404 kernel/sched/loadavg.c
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 214 *
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 215 * The next window is 'calc_load_update' if we haven't reached it yet,
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 216 * and 'calc_load_update + 10' if we're inside the current window.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 217 */
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 218 next_window = calc_load_update;
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 219
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 @220 if (time_in_range_open(jiffies, next_window, next_window + 10)
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 221 next_window += LOAD_FREQ;
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 222
5ee9e5df kernel/sched/loadavg.c Matt Fleming 2017-02-08 223 this_rq->calc_load_update = next_window;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 224 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 225
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 226 static long calc_load_fold_idle(void)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 227 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 228 int idx = calc_load_read_idx();
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 229 long delta = 0;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 230
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 231 if (atomic_long_read(&calc_load_idle[idx]))
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 232 delta = atomic_long_xchg(&calc_load_idle[idx], 0);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 233
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 234 return delta;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 235 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 236
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 237 /**
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 238 * fixed_power_int - compute: x^n, in O(log n) time
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 239 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 240 * @x: base of the power
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 241 * @frac_bits: fractional bits of @x
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 242 * @n: power to raise @x to.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 243 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 244 * By exploiting the relation between the definition of the natural power
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 245 * function: x^n := x*x*...*x (x multiplied by itself for n times), and
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 246 * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 247 * (where: n_i \elem {0, 1}, the binary vector representing n),
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 248 * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 249 * of course trivially computable in O(log_2 n), the length of our binary
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 250 * vector.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 251 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 252 static unsigned long
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 253 fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 254 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 255 unsigned long result = 1UL << frac_bits;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 256
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 257 if (n) {
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 258 for (;;) {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 259 if (n & 1) {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 260 result *= x;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 261 result += 1UL << (frac_bits - 1);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 262 result >>= frac_bits;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 263 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 264 n >>= 1;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 265 if (!n)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 266 break;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 267 x *= x;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 268 x += 1UL << (frac_bits - 1);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 269 x >>= frac_bits;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 270 }
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 271 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 272
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 273 return result;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 274 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 275
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 276 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 277 * a1 = a0 * e + a * (1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 278 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 279 * a2 = a1 * e + a * (1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 280 * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 281 * = a0 * e^2 + a * (1 - e) * (1 + e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 282 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 283 * a3 = a2 * e + a * (1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 284 * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 285 * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 286 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 287 * ...
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 288 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 289 * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 290 * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 291 * = a0 * e^n + a * (1 - e^n)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 292 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 293 * [1] application of the geometric series:
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 294 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 295 * n 1 - x^(n+1)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 296 * S_n := \Sum x^i = -------------
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 297 * i=0 1 - x
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 298 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 299 static unsigned long
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 300 calc_load_n(unsigned long load, unsigned long exp,
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 301 unsigned long active, unsigned int n)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 302 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 303 return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 304 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 305
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 306 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 307 * NO_HZ can leave us missing all per-cpu ticks calling
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 308 * calc_load_account_active(), but since an idle CPU folds its delta into
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 309 * calc_load_tasks_idle per calc_load_account_idle(), all we need to do is fold
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 310 * in the pending idle delta if our idle period crossed a load cycle boundary.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 311 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 312 * Once we've updated the global active value, we need to apply the exponential
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 313 * weights adjusted to the number of cycles missed.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 314 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 315 static void calc_global_nohz(void)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 316 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 317 long delta, active, n;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 318
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 319 if (!time_before(jiffies, calc_load_update + 10)) {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 320 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 321 * Catch-up, fold however many we are behind still
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 322 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 323 delta = jiffies - calc_load_update - 10;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 324 n = 1 + (delta / LOAD_FREQ);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 325
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 326 active = atomic_long_read(&calc_load_tasks);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 327 active = active > 0 ? active * FIXED_1 : 0;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 328
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 329 avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 330 avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 331 avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 332
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 333 calc_load_update += n * LOAD_FREQ;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 334 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 335
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 336 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 337 * Flip the idle index...
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 338 *
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 339 * Make sure we first write the new time then flip the index, so that
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 340 * calc_load_write_idx() will see the new time when it reads the new
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 341 * index, this avoids a double flip messing things up.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 342 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 343 smp_wmb();
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 344 calc_load_idx++;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 345 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 346 #else /* !CONFIG_NO_HZ_COMMON */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 347
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 348 static inline long calc_load_fold_idle(void) { return 0; }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 349 static inline void calc_global_nohz(void) { }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 350
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 351 #endif /* CONFIG_NO_HZ_COMMON */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 352
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 353 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 354 * calc_load - update the avenrun load estimates 10 ticks after the
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 355 * CPUs have updated calc_load_tasks.
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 356 *
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 357 * Called from the global timer code.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 358 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 359 void calc_global_load(unsigned long ticks)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 360 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 361 long active, delta;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 362
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 363 if (time_before(jiffies, calc_load_update + 10))
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 364 return;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 365
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 366 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 367 * Fold the 'old' idle-delta to include all NO_HZ cpus.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 368 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 369 delta = calc_load_fold_idle();
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 370 if (delta)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 371 atomic_long_add(delta, &calc_load_tasks);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 372
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 373 active = atomic_long_read(&calc_load_tasks);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 374 active = active > 0 ? active * FIXED_1 : 0;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 375
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 376 avenrun[0] = calc_load(avenrun[0], EXP_1, active);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 377 avenrun[1] = calc_load(avenrun[1], EXP_5, active);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 378 avenrun[2] = calc_load(avenrun[2], EXP_15, active);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 379
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 380 calc_load_update += LOAD_FREQ;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 381
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 382 /*
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 383 * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 384 */
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 385 calc_global_nohz();
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 386 }
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 387
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 388 /*
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 389 * Called from scheduler_tick() to periodically update this CPU's
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 390 * active count.
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 391 */
3289bdb4 kernel/sched/loadavg.c Peter Zijlstra 2015-04-14 392 void calc_global_load_tick(struct rq *this_rq)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 393 {
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 394 long delta;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 395
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 396 if (time_before(jiffies, this_rq->calc_load_update))
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 397 return;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 398
d60585c5 kernel/sched/loadavg.c Thomas Gleixner 2016-07-12 399 delta = calc_load_fold_active(this_rq, 0);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 400 if (delta)
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 401 atomic_long_add(delta, &calc_load_tasks);
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 402
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 403 this_rq->calc_load_update += LOAD_FREQ;
45ceebf7 kernel/sched/proc.c Paul Gortmaker 2013-04-19 @404 }
:::::: The code at line 404 was first introduced by commit
:::::: 45ceebf77653975815d82fcf7cec0a164215ae11 sched: Factor out load calculation code from sched/core.c --> sched/proc.c
:::::: TO: Paul Gortmaker <paul.gortmaker@windriver.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27628 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
2017-02-08 16:04 ` kbuild test robot
2017-02-08 16:07 ` kbuild test robot
@ 2017-02-09 15:07 ` Peter Zijlstra
2017-02-15 15:30 ` Frederic Weisbecker
2017-02-15 15:12 ` Frederic Weisbecker
3 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2017-02-09 15:07 UTC (permalink / raw)
To: Matt Fleming
Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Morten Rasmussen,
stable, Vincent Guittot
On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> The calculation for the next sample window when exiting NOH_HZ idle
> does not handle the fact that we may not have reached the next sample
> window yet, i.e. that we came out of idle between sample windows.
>
> If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> window time when we want idle but before the next sample window, we
> will add an unnecessary LOAD_FREQ delay to the load average
> accounting, delaying any update for potentially ~9seconds.
>
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
>
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().
Right, so differently put; the problem is that we check against the
'stale' rq->calc_load_update, while the current and effective period
boundary is 'calc_load_update'.
So, when rq->calc_load_update < jiffies < calc_load_update, we end up
setting the next-update to calc_load_update+LOAD_FREQ, where it should
have been calc_load_update.
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index a2d6eb71f06b..a7a6f3646970 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
> * We woke inside or after the sample window, this means we're already
> * accounted through the nohz accounting, so skip the entire deal and
> * sync up for the next window.
> + *
> + * The next window is 'calc_load_update' if we haven't reached it yet,
> + * and 'calc_load_update + 10' if we're inside the current window.
> */
> + next_window = calc_load_update;
> +
> + if (time_in_range_open(jiffies, next_window, next_window + 10)
> + next_window += LOAD_FREQ;
> +
> + this_rq->calc_load_update = next_window;
> }
So I don't much like the time_in_range_open() thing. The simpler patch
which you tested to also work was:
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..cfb47bd0ee50 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -201,6 +201,8 @@ void calc_load_exit_idle(void)
{
struct rq *this_rq = this_rq();
+ this_rq->calc_load_update = calc_load_update;
+
/*
* If we're still before the sample window, we're done.
*/
@@ -212,7 +214,6 @@ void calc_load_exit_idle(void)
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
*/
- this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
}
But the problem there is that we unconditionally issue that store. Now
I've no idea how much of a problem that is, and it certainly is the
simplest form (+- comments that need updating), so maybe that makes
sense.
Alternatively, something like:
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..3dd4ce6fe151 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -207,12 +207,15 @@ void calc_load_exit_idle(void)
if (time_before(jiffies, this_rq->calc_load_update))
return;
+ this_rq->calc_load_update = calc_load_update;
+ if (time_before(jiffies, this_rq->calc_load_update))
+ return;
+
/*
* We woke inside or after the sample window, this means we're already
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
*/
- this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
}
might be another solution.
Irrespective the above though; should we not make this:
+ this_rq->calc_load_update = READ_ONCE(calc_load_update);
because if for some reason we do a double load of calc_load_update and
see two different values, weird stuff could happen.
And because, on general principle, a READ_ONCE() should be paired with a
WRITE_ONCE(), that should be done too I suppose.
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
` (2 preceding siblings ...)
2017-02-09 15:07 ` Peter Zijlstra
@ 2017-02-15 15:12 ` Frederic Weisbecker
2017-02-15 16:16 ` Matt Fleming
3 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2017-02-15 15:12 UTC (permalink / raw)
To: Matt Fleming
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
Morten Rasmussen, stable, Vincent Guittot
On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> The calculation for the next sample window when exiting NOH_HZ idle
> does not handle the fact that we may not have reached the next sample
> window yet
That sentence is hard to parse, it took me some time to figure out that
those two "next sample window" may not refer to the same thing.
Maybe it would be clearer with something along the lines of:
"The calculation for the next sample window when exiting NO_HZ
does not handle the fact that we may not have crossed any sample
window during the NO_HZ period."
> If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> window time when we want idle but before the next sample window
That too was hard to understand. How about:
"If we enter in NO_HZ mode after a pending this_rq->calc_load_update
and we exit from NO_HZ mode before the forthcoming sample window, ..."
> we will add an unnecessary LOAD_FREQ delay to the load average
> accounting, delaying any update for potentially ~9seconds.
>
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
>
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().
>
> This issue is easy to reproduce before,
>
> commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
>
> just by forking short-lived process pipelines built from ps(1) and
> grep(1) in a loop. I'm unable to reproduce the spikes after that
> commit, but the bug still seems to be present from code review.
>
> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: <stable@vger.kernel.org> # v3.5+
> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
I'll comment the change on Peter's proposition.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-09 15:07 ` Peter Zijlstra
@ 2017-02-15 15:30 ` Frederic Weisbecker
0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2017-02-15 15:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Matt Fleming, Ingo Molnar, linux-kernel, Mike Galbraith,
Morten Rasmussen, stable, Vincent Guittot
On Thu, Feb 09, 2017 at 04:07:53PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> > The calculation for the next sample window when exiting NOH_HZ idle
> > does not handle the fact that we may not have reached the next sample
> > window yet, i.e. that we came out of idle between sample windows.
> >
> > If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> > window time when we want idle but before the next sample window, we
> > will add an unnecessary LOAD_FREQ delay to the load average
> > accounting, delaying any update for potentially ~9seconds.
> >
> > This can result in huge spikes in the load average values due to
> > per-cpu uninterruptible task counts being out of sync when accumulated
> > across all CPUs.
> >
> > It's safe to update the per-cpu active count if we wake between sample
> > windows because any load that we left in 'calc_load_idle' will have
> > been zero'd when the idle load was folded in calc_global_load().
>
> Right, so differently put; the problem is that we check against the
> 'stale' rq->calc_load_update, while the current and effective period
> boundary is 'calc_load_update'.
>
> So, when rq->calc_load_update < jiffies < calc_load_update, we end up
> setting the next-update to calc_load_update+LOAD_FREQ, where it should
> have been calc_load_update.
>
> > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> > index a2d6eb71f06b..a7a6f3646970 100644
> > --- a/kernel/sched/loadavg.c
> > +++ b/kernel/sched/loadavg.c
>
> > @@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
> > * We woke inside or after the sample window, this means we're already
> > * accounted through the nohz accounting, so skip the entire deal and
> > * sync up for the next window.
> > + *
> > + * The next window is 'calc_load_update' if we haven't reached it yet,
> > + * and 'calc_load_update + 10' if we're inside the current window.
Hmm, the comment doesn't seem to match the code.
> > */
> > + next_window = calc_load_update;
> > +
> > + if (time_in_range_open(jiffies, next_window, next_window + 10)
> > + next_window += LOAD_FREQ;
> > +
> > + this_rq->calc_load_update = next_window;
> > }
>
> So I don't much like the time_in_range_open() thing. The simpler patch
> which you tested to also work was:
>
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index 7296b7308eca..cfb47bd0ee50 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -201,6 +201,8 @@ void calc_load_exit_idle(void)
> {
> struct rq *this_rq = this_rq();
>
> + this_rq->calc_load_update = calc_load_update;
> +
> /*
> * If we're still before the sample window, we're done.
> */
> @@ -212,7 +214,6 @@ void calc_load_exit_idle(void)
> * accounted through the nohz accounting, so skip the entire deal and
> * sync up for the next window.
> */
> - this_rq->calc_load_update = calc_load_update;
> if (time_before(jiffies, this_rq->calc_load_update + 10))
> this_rq->calc_load_update += LOAD_FREQ;
> }
>
> But the problem there is that we unconditionally issue that store. Now
> I've no idea how much of a problem that is, and it certainly is the
> simplest form (+- comments that need updating), so maybe that makes
> sense.
Well, that version looks fine to me.
Thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-15 15:12 ` Frederic Weisbecker
@ 2017-02-15 16:16 ` Matt Fleming
2017-02-15 16:44 ` Frederic Weisbecker
0 siblings, 1 reply; 8+ messages in thread
From: Matt Fleming @ 2017-02-15 16:16 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
Morten Rasmussen, stable, Vincent Guittot
On Wed, 15 Feb, at 04:12:11PM, Frederic Weisbecker wrote:
> On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> > The calculation for the next sample window when exiting NOH_HZ idle
> > does not handle the fact that we may not have reached the next sample
> > window yet
>
> That sentence is hard to parse, it took me some time to figure out that
> those two "next sample window" may not refer to the same thing.
Yeah, it's not the most lucid thing I've ever written.
> Maybe it would be clearer with something along the lines of:
>
> "The calculation for the next sample window when exiting NO_HZ
> does not handle the fact that we may not have crossed any sample
> window during the NO_HZ period."
Umm... this isn't the problem. In fact, it's the opposite.
The problem is that if we *did* cross a sample window while in NO_HZ,
then when we exit the pending window may be far enough into the future
that all we need to do is update this_rq->calc_load_update.
> > If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> > window time when we want idle but before the next sample window
>
> That too was hard to understand. How about:
>
> "If we enter in NO_HZ mode after a pending this_rq->calc_load_update
> and we exit from NO_HZ mode before the forthcoming sample window, ..."
You've got this backwards again. We enter NO_HZ before the pending
window, not after.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
2017-02-15 16:16 ` Matt Fleming
@ 2017-02-15 16:44 ` Frederic Weisbecker
0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2017-02-15 16:44 UTC (permalink / raw)
To: Matt Fleming
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Mike Galbraith,
Morten Rasmussen, stable, Vincent Guittot
On Wed, Feb 15, 2017 at 04:16:19PM +0000, Matt Fleming wrote:
> On Wed, 15 Feb, at 04:12:11PM, Frederic Weisbecker wrote:
> > On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> > > The calculation for the next sample window when exiting NOH_HZ idle
> > > does not handle the fact that we may not have reached the next sample
> > > window yet
> >
> > That sentence is hard to parse, it took me some time to figure out that
> > those two "next sample window" may not refer to the same thing.
>
> Yeah, it's not the most lucid thing I've ever written.
>
> > Maybe it would be clearer with something along the lines of:
> >
> > "The calculation for the next sample window when exiting NO_HZ
> > does not handle the fact that we may not have crossed any sample
> > window during the NO_HZ period."
>
> Umm... this isn't the problem. In fact, it's the opposite.
>
> The problem is that if we *did* cross a sample window while in NO_HZ,
> then when we exit the pending window may be far enough into the future
> that all we need to do is update this_rq->calc_load_update.
Ah right. Well, see the problem statement wasn't clear to me ;-)
>
> > > If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> > > window time when we want idle but before the next sample window
> >
> > That too was hard to understand. How about:
> >
> > "If we enter in NO_HZ mode after a pending this_rq->calc_load_update
> > and we exit from NO_HZ mode before the forthcoming sample window, ..."
>
> You've got this backwards again. We enter NO_HZ before the pending
> window, not after.
Right.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-02-15 16:44 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
2017-02-08 16:04 ` kbuild test robot
2017-02-08 16:07 ` kbuild test robot
2017-02-09 15:07 ` Peter Zijlstra
2017-02-15 15:30 ` Frederic Weisbecker
2017-02-15 15:12 ` Frederic Weisbecker
2017-02-15 16:16 ` Matt Fleming
2017-02-15 16:44 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).