From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1424166AbcFIJCE (ORCPT ); Thu, 9 Jun 2016 05:02:04 -0400 Received: from mail.fireflyinternet.com ([87.106.93.118]:57045 "EHLO fireflyinternet.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1424024AbcFIJCA (ORCPT ); Thu, 9 Jun 2016 05:02:00 -0400 X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Date: Thu, 9 Jun 2016 10:01:42 +0100 From: Chris Wilson To: Peter Zijlstra Cc: Andrey Ryabinin , Yuyang Du , Linus Torvalds , Mike Galbraith , Thomas Gleixner , bsegall@google.com, morten.rasmussen@arm.com, pjt@google.com, steve.muckle@linaro.org, linux-kernel@vger.kernel.org Subject: Divide-by-zero in post_init_entity_util_avg Message-ID: <20160609090142.GS32344@nuc-i3427.alporthouse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [15774.966082] divide error: 0000 [#1] SMP [15774.966137] Modules linked in: i915 intel_gtt [15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330 [15774.966252] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000 [15774.966377] RIP: 0010:[] [] post_init_entity_util_avg+0x43/0x80 [15774.966463] RSP: 0018:ffff880272c13e30 EFLAGS: 00010057 [15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8 [15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000 [15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008 [15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544 [15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda [15774.966721] FS: 00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [15774.966781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0 [15774.966860] Stack: [15774.966894] ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7 [15774.966987] 0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8 [15774.967073] ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000 [15774.967162] Call Trace: [15774.967208] [] wake_up_new_task+0x95/0x130 [15774.967256] [] _do_fork+0x184/0x400 [15774.967302] [] SyS_clone+0x37/0x50 [15774.967353] [] do_syscall_64+0x5d/0xe0 [15774.967402] [] entry_SYSCALL64_slow_path+0x25/0x25 [15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e [15774.968086] RIP [] post_init_entity_util_avg+0x43/0x80 [15774.968142] RSP I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's util avg to a bounded value") to be at fault, hence the CCs. Though it may just be a victim. gdb says 0x43/0x80 is 725 if (cfs_rq->avg.util_avg != 0) { 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); 728 729 if (sa->util_avg > cap) 730 sa->util_avg = cap; 731 } else { I've run the same fork-heavy workload that seemed to hit the initial fault under kasan. kasan has not reported any errors, nor has the bug reoccurred after a day (earlier I had a couple of panics within a few hours). Is it possible for a race window where cfg_rq->avg.load_avg is indeed -1? Any evidence of other memcorruption in the above? -Chris -- Chris Wilson, Intel Open Source Technology Centre