From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753333Ab3KPVhw (ORCPT ); Sat, 16 Nov 2013 16:37:52 -0500 Received: from mout.gmx.net ([212.227.15.18]:57030 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752127Ab3KPVho (ORCPT ); Sat, 16 Nov 2013 16:37:44 -0500 Date: Sat, 16 Nov 2013 22:37:40 +0100 From: Christian Engelmayer To: Ingo Molnar , Stanislaw Gruszka Cc: linux-kernel@vger.kernel.org Subject: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale_stime() Message-ID: <20131116223740.3cd579cb@spike> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.20; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:ztiDABO0t3AiwC+e/yn+wDMpAQv+itiXY7HyWQs75h1inviGFg0 y18FEQL+6TAvJBPhmtaDPwuUfXgIyC38WK+i3KhfZQdZtL2KouT3vajlZ+3XPIQVHNNYOC3 FCf633AsFycKQrgZ3AHnR6UHp6UexnyYF/i6ZZ9Gjmxu8HP1t9vIEvNvrJqGnri1icdGk21 8fLPLftBInYML/0NK6Efw== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Since upgrading from v3.8 to v3.12 I see random crashes in function scale_stime() in kernel/sched/cputime.c: divide error: 0000 [#1] SMP Modules linked in: parport_pc(F) ppdev(F) bnep rfcomm bluetooth binfmt_misc(F) zl10353 cx88_dvb cx88_vp3054_i2c videobuf_dvb dvb_core intel_powerclamp coretemp kvm_intel(F) tuner_xc2028 kvm(F) i915 snd_hda_codec_hdmi snd_hda_codec_realtek cx8800 cx8802 tuner snd_hda_intel snd_hda_codec cx88_alsa crct10dif_pclmul(F) crc32_pclmul(F) snd_hwdep(F) snd_pcm(F) snd_page_alloc(F) ghash_clmulni_intel(F) aesni_intel(F) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F) joydev(F) cx88xx snd_seq_device(F) snd_timer(F) aes_x86_64(F) lrw(F) gf128mul(F) glue_helper(F) video(F) btcx_risc drm_kms_helper ablk_helper(F) tveeprom cryptd(F) lp(F) videobuf_dma_sg rc_core drm v4l2_common videobuf_core mei_me parport(F) snd(F) mei soundcore(F) videodev i2c_algo_bit serio_raw(F) microcode(F) mac_hid lpc_ich asus_atk0110 hid_generic usbhid hid usb_storage(F) firewire_ohci ahci(F) libahci(F) firewire_core r8169 crc_itu_t(F) mii(F) CPU: 3 PID: 15367 Comm: htop Tainted: GF 3.12.0-031200-generic #201311031935 Hardware name: System manufacturer System Product Name/P7H55-M PRO, BIOS 1709 01/04/2011 task: ffff8800cc09e000 ti: ffff8800af620000 task.ti: ffff8800af620000 RIP: 0010:[] [] cputime_adjust+0xf0/0x110 RSP: 0018:ffff8800af621cc8 EFLAGS: 00010847 RAX: 85fdc1fef4047c00 RBX: 0000000000000000 RCX: ffff8800af621df8 RDX: 0000000000000000 RSI: ffff8800cc0634d0 RDI: 0000000000000000 RBP: ffff8800af621cd8 R08: 00000000fffffffe R09: 0000000000000000 R10: 0000000000000000 R11: fffffffe03427acc R12: ffff8800af621df0 R13: ffff8800af621df8 R14: 0000000000000000 R15: ffff8800cc063300 FS: 00007f22a387d740(0000) GS:ffff880117c60000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f22a3892000 CR3: 0000000097023000 CR4: 00000000000007e0 Stack: ffff8800c37f0000 ffff8800af621df0 ffff8800af621d18 ffffffff8109aa51 000000000a7d8c00 0000000042fee100 fffffffe03427acc ffff8800bf112a80 ffff8800c37f0000 ffff8800c307c280 ffff8800af621e50 ffffffff8121f74b Call Trace: [] thread_group_cputime_adjusted+0x41/0x50 [] do_task_stat+0x8eb/0xb60 [] ? vma_compute_subtree_gap+0x50/0x50 [] proc_tgid_stat+0x14/0x20 [] proc_single_show+0x4d/0x90 [] seq_read+0x14e/0x390 [] vfs_read+0x95/0x160 [] SyS_read+0x49/0xa0 [] system_call_fastpath+0x1a/0x1f Code: 89 fa 49 c1 ea 20 4d 85 d2 74 ca 4c 89 c2 48 d1 ef 49 89 c0 48 d1 ea 48 89 d0 eb 9f 0f 1f 80 00 00 00 00 89 c0 31 d2 49 0f af c0 <48> f7 f7 4c 89 df 48 29 c7 49 89 c3 e9 31 ff ff ff 66 66 66 66 RIP [] cputime_adjust+0xf0/0x110 RSP ---[ end trace dbafd2159a385dd6 ]--- The affected LOC performing the division by 0 was introduced in commit commit 55eaa7c1f511af5fb6ef808b5328804f4d4e5243 Author: Stanislaw Gruszka Date: Tue Apr 30 17:14:42 2013 +0200 sched: Avoid cputime scaling overflow For the problem to occur the function is called eg. with the following input parameters stime: 0x3567e00 rtime: 0xffffffffbf1abfdb total: 0x3938700 which leads to 'total' being shifted to 0 during the adaption of the precision and is then used without further check in scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total); The root cause triggering this issue seems to be an overflowed value of rtime = nsecs_to_cputime(curr->sum_exec_runtime); On the affected machine the problem can be triggered by loading the previously idle system by starting a full kernel build. The problem occurs within a minute after the ondemand frequency scaling governor adjusts the frequency from the minimum to the maximum. The x86 init check whether all booted CPUs have their TSC's synchronized, never failed so far, however, the tsc clocksource is sporadically marked unstable. Clocksource tsc unstable (delta = -74994678 ns) The used CPU provides an Intel Invariant TSC as stated by CPUID.80000007H:EDX[8]: eax in eax ebx ecx edx 00000000 0000000b 756e6547 6c65746e 49656e69 00000001 00020652 04100800 0298e3ff bfebfbff 00000002 55035a01 00f0b2e3 00000000 09ca212c 00000003 00000000 00000000 00000000 00000000 00000004 00000000 00000000 00000000 00000000 00000005 00000040 00000040 00000003 00001120 00000006 00000005 00000002 00000001 00000000 00000007 00000000 00000000 00000000 00000000 00000008 00000000 00000000 00000000 00000000 00000009 00000000 00000000 00000000 00000000 0000000a 07300403 00000004 00000000 00000603 0000000b 00000000 00000000 0000002c 00000004 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 28100800 80000002 65746e49 2952286c 726f4320 4d542865 80000003 35692029 55504320 20202020 20202020 80000004 30353620 20402020 30322e33 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 01006040 00000000 80000007 00000000 00000000 00000000 00000100 80000008 00003024 00000000 00000000 00000000 Vendor ID: "GenuineIntel"; CPUID level 11 Intel-specific functions: Version 00020652: Type 0 - Original OEM Family 6 - Pentium Pro Model 5 - Pentium II Model 5/Xeon/Celeron Stepping 2 Reserved 8 Extended brand string: "Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 4 Hyper threading siblings: 16 Feature flags bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 reserved Nevertheless, when looking into the issue I saw occurences of sched_clock going backwards as if the TSCs were read out of sync. Accordingly the problem does not occur when either booting with option 'nosmp' or when forcing the TSC to be marked as unstable for sched_clock. Booting with 'acpi=off' and no frequency scaling works too. Having a look at the scheduler code I see the following pattern that would also catch a time warp, eg. kernel/sched/rt.c update_curr_rt(): u64 delta_exec; delta_exec = rq_clock_task(rq) - curr->se.exec_start; if (unlikely((s64)delta_exec <= 0)) return; However, there are still vulnerable places, eg. kernel/sched/fair.c update_curr(): /* * Get the amount of time the current task was running * since the last time we changed load (this cannot * overflow on 32 bits): */ delta_exec = (unsigned long)(now - curr->exec_start); if (!delta_exec) return; Regards, Christian