Divide-by-zero in post_init_entity_util_avg

* Divide-by-zero in post_init_entity_util_avg
@ 2016-06-09  9:01 Chris Wilson
  2016-06-09  1:33 ` Yuyang Du
  2016-06-09 10:29 ` Divide-by-zero in post_init_entity_util_avg Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Chris Wilson @ 2016-06-09  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrey Ryabinin, Yuyang Du, Linus Torvalds, Mike Galbraith,
	Thomas Gleixner, bsegall, morten.rasmussen, pjt, steve.muckle,
	linux-kernel

[15774.966082] divide error: 0000 [#1] SMP
[15774.966137] Modules linked in: i915 intel_gtt
[15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330
[15774.966252] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000
[15774.966377] RIP: 0010:[<ffffffff810b12e3>]  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.966463] RSP: 0018:ffff880272c13e30  EFLAGS: 00010057
[15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8
[15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000
[15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008
[15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544
[15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda
[15774.966721] FS:  00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[15774.966781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0
[15774.966860] Stack:
[15774.966894]  ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7
[15774.966987]  0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8
[15774.967073]  ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000
[15774.967162] Call Trace:
[15774.967208]  [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130
[15774.967256]  [<ffffffff8107a364>] _do_fork+0x184/0x400
[15774.967302]  [<ffffffff8107a6b7>] SyS_clone+0x37/0x50
[15774.967353]  [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0
[15774.967402]  [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25
[15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e 
[15774.968086] RIP  [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.968142]  RSP <ffff880272c13e30>

I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's
util avg to a bounded value") to be at fault, hence the CCs. Though it
may just be a victim.

gdb says 0x43/0x80 is

   725			if (cfs_rq->avg.util_avg != 0) {
   726				sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
-> 727				sa->util_avg /= (cfs_rq->avg.load_avg + 1);
   728	
   729				if (sa->util_avg > cap)
   730					sa->util_avg = cap;
   731			} else {

I've run the same fork-heavy workload that seemed to hit the initial
fault under kasan. kasan has not reported any errors, nor has the bug
reoccurred after a day (earlier I had a couple of panics within a few
hours). 

Is it possible for a race window where cfg_rq->avg.load_avg is indeed
-1? Any evidence of other memcorruption in the above?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 14+ messages in thread