All of lore.kernel.org
 help / color / mirror / Atom feed
* perf_fuzzer: lockup/reboot bug
@ 2014-03-03 20:34 Vince Weaver
  2014-03-04 21:32 ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-03 20:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: H. Peter Anvin, Peter Zijlstra, Ingo Molnar


Starting a new thread on this, in case this is unrelated to the cr2-saving 
bug that the previous thread ended up finding.

This is on a core2 system, running stock 3.14-rc5 (without any cr2 patches 
applied)

It is reproducible, unlike the similar looking reboot bug I was seeing 
before.

[ 4330.676015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[ 4330.684003] IP: [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[ 4330.684003] PGD bd2e1067 PUD adffa067 PMD 0 
[ 4330.684003] Oops: 0000 [#1] SMP 
[ 4330.684003] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg acpi_cpufreq evdev mcs7830 usbnet coretemp psmouse serio_raw pcspkr video wmi processor button thermal_sys ohci_pci ohci_hcd i2c_nforce2 sg ehci_pci ehci_hcd sd_mod usbcore usb_common
[ 4330.684003] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W    3.14.0-rc5 #32
[ 4330.684003] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[ 4330.684003] task: ffff88011b2b37e0 ti: ffff88011b340000 task.ti: ffff88011b340000
[ 4330.684003] RIP: 0010:[<ffffffff812a3867>]  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[ 4330.684003] RSP: 0018:ffff88011fc83de8  EFLAGS: 00010002
[ 4330.684003] RAX: ffff8800cb3b0010 RBX: ffff88011fc8da60 RCX: ffff8800b799c000
[ 4330.684003] RDX: 0000000000000040 RSI: ffff88011fc8d060 RDI: ffff8800cb3b0010
[ 4330.684003] RBP: ffff88011fc83de8 R08: ffff88011fc8dbd0 R09: 0000000000000002
[ 4330.684003] R10: 0000000000000001 R11: ffff88011b359028 R12: ffff88011fc8d060
[ 4330.684003] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88011fc8d050
[ 4330.684003] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
[ 4330.684003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4330.684003] CR2: 0000000000000040 CR3: 00000000adff8000 CR4: 00000000000407e0
[ 4330.684003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000001cfd000
[ 4330.684003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000ff00628
[ 4330.684003] Stack:
[ 4330.684003]  ffff88011fc83e08 ffffffff812a447e ffff88011fc8da60 ffff88011fc8d050
[ 4330.684003]  ffff88011fc83e38 ffffffff8105e022 ffff8800c7b27340 ffff88011fc8da60
[ 4330.684003]  000003f57aac8f00 0000000000000000 ffff88011fc83ed8 ffffffff8105e265
[ 4330.684003] Call Trace:
[ 4330.684003]  <IRQ> 
[ 4330.684003]  [<ffffffff812a447e>] timerqueue_add+0x7a/0x98
[ 4330.684003]  [<ffffffff8105e022>] enqueue_hrtimer+0x51/0x7c
[ 4330.684003]  [<ffffffff8105e265>] __hrtimer_start_range_ns+0x218/0x2ff
[ 4330.684003]  [<ffffffff8105e364>] hrtimer_start+0x18/0x1a
[ 4330.684003]  [<ffffffff81091885>] __tick_nohz_idle_enter+0x2ce/0x387
[ 4330.684003]  [<ffffffff81091962>] tick_nohz_irq_exit+0x24/0x26
[ 4330.684003]  [<ffffffff81044582>] irq_exit+0x95/0x9c
[ 4330.684003]  [<ffffffff8102b85e>] smp_apic_timer_interrupt+0x2f/0x3c
[ 4330.684003]  [<ffffffff8153cbca>] apic_timer_interrupt+0x6a/0x70
[ 4330.684003]  <EOI> 
[ 4330.684003]  [<ffffffff8106a13c>] ? sched_clock_idle_sleep_event+0x11/0x13
[ 4330.684003]  [<ffffffff8100a7a3>] ? default_idle+0x1d/0x2f
[ 4330.684003]  [<ffffffff8100a7a1>] ? default_idle+0x1b/0x2f
[ 4330.684003]  [<ffffffff8100a290>] arch_cpu_idle+0x18/0x1d
[ 4330.684003]  [<ffffffff8107fb82>] cpu_startup_entry+0xd1/0x133
[ 4330.684003]  [<ffffffff8102a34d>] start_secondary+0x196/0x19b
[ 4330.684003] Code: 24 48 89 de 4c 89 ef 41 ff d6 5b 41 5c 41 5d 41 5e c9 c3 55 48 8b 17 48 89 e5 48 85 d2 75 0c 48 c7 07 01 00 00 00 e9 13 01 00 00 <48> 8b 02 a8 01 0f 85 08 01 00 00 48 8b 48 08 48 39 ca 74 66 48 
[ 4330.684003] RIP  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[ 4330.684003]  RSP <ffff88011fc83de8>
[ 4330.684003] CR2: 0000000000000040
[ 4330.684003] ---[ end trace 680f8979aa2ba0dc ]---
[ 4330.684003] Kernel panic - not syncing: Attempted to kill the idle task!
[ 4330.684003] Shutting down cpus with NMI
[ 4330.684003] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: perf_fuzzer: lockup/reboot bug
  2014-03-03 20:34 perf_fuzzer: lockup/reboot bug Vince Weaver
@ 2014-03-04 21:32 ` Vince Weaver
  2014-03-18 16:56   ` rb tree hrtimer lockup bug (found by perf_fuzzer) Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-04 21:32 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner


I patched the kernel with the cr2 save/restore bug to make sure this 
wasn't related to that problem.  It seems like it's not.

Somehow my perf_fuzzer can quickly cause the machine to lockup due to some 
sort of hrtimer queue corruption?  It's proving really hard to isolate 
this as the machine locks hard very quickly.

This is a core2 machine, 3.14-rc5

The code in question is:
/home/vince/research/linux-kernel/linux-2.6/lib/rbtree.c:89

   } else if (rb_is_black(parent))

=>	mov    (%rdx),%rax
	test   $0x1,%al
	jne    <rb_insert_color+0x12b>

Though it sometimes also crashes here instead:

/home/vince/research/linux-kernel/linux-2.6/lib/rbtree.c:94

 tmp = gparent->rb_right;

=>	mov    0x8(%rax),%rcx
	cmp    %rcx,%rdx
	je     ffffffff812a38e1 <rb_insert_color+0x92>



[  107.100035] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  107.109164] IP: [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[  107.129085] PGD 0 
[  107.129085] Oops: 0000 [#1] SMP 
[  107.129085] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 usbnet ohci_pci pcspkr i2c_nforce2 psmouse ohci_hcd serio_raw evdev coretemp wmi video button acpi_cpufreq processor thermal_sys ehci_pci sg ehci_hcd sd_mod usbcore usb_common
[  107.129085] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc5+ #33
[  107.129085] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[  107.129085] task: ffffffff81a11450 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[  107.129085] RIP: 0010:[<ffffffff812a3867>]  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[  107.129085] RSP: 0000:ffff88011fc03de8  EFLAGS: 00010002
[  107.129085] RAX: ffff880037dc77e0 RBX: ffff88011fc0da60 RCX: ffff880037dc0000
[  107.129085] RDX: 0000000000000040 RSI: ffff88011fc0d060 RDI: ffff880037dc77e0
[  107.129085] RBP: ffff88011fc03de8 R08: ffff88011fc03d98 R09: 0000000000000002
[  107.129085] R10: 0000000000000001 R11: ffffffff81c090a8 R12: ffff88011fc0d060
[  107.129085] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88011fc0d050
[  107.129085] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[  107.129085] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  107.129085] CR2: 0000000000000040 CR3: 0000000001a0c000 CR4: 00000000000407f0
[  107.129085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.129085] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  107.129085] Stack:
[  107.129085]  ffff88011fc03e08 ffffffff812a447e ffff88011fc0da60 ffff88011fc0d050
[  107.129085]  ffff88011fc03e38 ffffffff8105e022 0000000000000286 ffff88011fc0da60
[  107.129085]  0000001965e9bf00 0000000000000000 ffff88011fc03ed8 ffffffff8105e265
[  107.129085] Call Trace:
[  107.129085]  <IRQ> 
[  107.129085]  [<ffffffff812a447e>] timerqueue_add+0x7a/0x98
[  107.129085]  [<ffffffff8105e022>] enqueue_hrtimer+0x51/0x7c
[  107.129085]  [<ffffffff8105e265>] __hrtimer_start_range_ns+0x218/0x2ff
[  107.129085]  [<ffffffff8105e364>] hrtimer_start+0x18/0x1a
[  107.129085]  [<ffffffff81091885>] __tick_nohz_idle_enter+0x2ce/0x387
[  107.129085]  [<ffffffff81091962>] tick_nohz_irq_exit+0x24/0x26
[  107.129085]  [<ffffffff81044582>] irq_exit+0x95/0x9c
[  107.129085]  [<ffffffff8102bc9f>] smp_trace_apic_timer_interrupt+0x83/0x91
[  107.129085]  [<ffffffff8153cc3a>] trace_apic_timer_interrupt+0x6a/0x70
[  107.129085]  <EOI> 
[  107.129085]  [<ffffffff8106a13c>] ? sched_clock_idle_sleep_event+0x11/0x13
[  107.129085]  [<ffffffff8100a7a3>] ? default_idle+0x1d/0x2f
[  107.129085]  [<ffffffff8100a7a1>] ? default_idle+0x1b/0x2f
[  107.129085]  [<ffffffff8100a290>] arch_cpu_idle+0x18/0x1d
[  107.129085]  [<ffffffff8107fb82>] cpu_startup_entry+0xd1/0x133
[  107.129085]  [<ffffffff8152a1d3>] rest_init+0x77/0x79
[  107.129085]  [<ffffffff81abbf19>] start_kernel+0x3f0/0x3fd
[  107.129085]  [<ffffffff81abb95e>] ? repair_env_string+0x58/0x58
[  107.129085]  [<ffffffff81530ad5>] ? memblock_reserve+0x49/0x4e
[  107.129085]  [<ffffffff81abb47e>] x86_64_start_reservations+0x2a/0x2c
[  107.129085]  [<ffffffff81abb5c5>] x86_64_start_kernel+0x145/0x14c
[  107.129085] Code: 24 48 89 de 4c 89 ef 41 ff d6 5b 41 5c 41 5d 41 5e c9 c3 55 48 8b 17 48 89 e5 48 85 d2 75 0c 48 c7 07 01 00 00 00 e9 13 01 00 00 <48> 8b 02 a8 01 0f 85 08 01 00 00 48 8b 48 08 48 39 ca 74 66 48 
[  107.129085] RIP  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
[  107.129085]  RSP <ffff88011fc03de8>
[  107.129085] CR2: 0000000000000040
[  107.129085] ---[ end trace 05819cea8e48bcd9 ]---
[  107.129085] Kernel panic - not syncing: Attempted to kill the idle task!
[  107.129085] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

> [ 4330.676015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
> [ 4330.684003] IP: [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
> [ 4330.684003] PGD bd2e1067 PUD adffa067 PMD 0 
> [ 4330.684003] Oops: 0000 [#1] SMP 
> [ 4330.684003] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg acpi_cpufreq evdev mcs7830 usbnet coretemp psmouse serio_raw pcspkr video wmi processor button thermal_sys ohci_pci ohci_hcd i2c_nforce2 sg ehci_pci ehci_hcd sd_mod usbcore usb_common
> [ 4330.684003] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W    3.14.0-rc5 #32
> [ 4330.684003] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
> [ 4330.684003] task: ffff88011b2b37e0 ti: ffff88011b340000 task.ti: ffff88011b340000
> [ 4330.684003] RIP: 0010:[<ffffffff812a3867>]  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
> [ 4330.684003] RSP: 0018:ffff88011fc83de8  EFLAGS: 00010002
> [ 4330.684003] RAX: ffff8800cb3b0010 RBX: ffff88011fc8da60 RCX: ffff8800b799c000
> [ 4330.684003] RDX: 0000000000000040 RSI: ffff88011fc8d060 RDI: ffff8800cb3b0010
> [ 4330.684003] RBP: ffff88011fc83de8 R08: ffff88011fc8dbd0 R09: 0000000000000002
> [ 4330.684003] R10: 0000000000000001 R11: ffff88011b359028 R12: ffff88011fc8d060
> [ 4330.684003] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88011fc8d050
> [ 4330.684003] FS:  0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> [ 4330.684003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 4330.684003] CR2: 0000000000000040 CR3: 00000000adff8000 CR4: 00000000000407e0
> [ 4330.684003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000001cfd000
> [ 4330.684003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000ff00628
> [ 4330.684003] Stack:
> [ 4330.684003]  ffff88011fc83e08 ffffffff812a447e ffff88011fc8da60 ffff88011fc8d050
> [ 4330.684003]  ffff88011fc83e38 ffffffff8105e022 ffff8800c7b27340 ffff88011fc8da60
> [ 4330.684003]  000003f57aac8f00 0000000000000000 ffff88011fc83ed8 ffffffff8105e265
> [ 4330.684003] Call Trace:
> [ 4330.684003]  <IRQ> 
> [ 4330.684003]  [<ffffffff812a447e>] timerqueue_add+0x7a/0x98
> [ 4330.684003]  [<ffffffff8105e022>] enqueue_hrtimer+0x51/0x7c
> [ 4330.684003]  [<ffffffff8105e265>] __hrtimer_start_range_ns+0x218/0x2ff
> [ 4330.684003]  [<ffffffff8105e364>] hrtimer_start+0x18/0x1a
> [ 4330.684003]  [<ffffffff81091885>] __tick_nohz_idle_enter+0x2ce/0x387
> [ 4330.684003]  [<ffffffff81091962>] tick_nohz_irq_exit+0x24/0x26
> [ 4330.684003]  [<ffffffff81044582>] irq_exit+0x95/0x9c
> [ 4330.684003]  [<ffffffff8102b85e>] smp_apic_timer_interrupt+0x2f/0x3c
> [ 4330.684003]  [<ffffffff8153cbca>] apic_timer_interrupt+0x6a/0x70
> [ 4330.684003]  <EOI> 
> [ 4330.684003]  [<ffffffff8106a13c>] ? sched_clock_idle_sleep_event+0x11/0x13
> [ 4330.684003]  [<ffffffff8100a7a3>] ? default_idle+0x1d/0x2f
> [ 4330.684003]  [<ffffffff8100a7a1>] ? default_idle+0x1b/0x2f
> [ 4330.684003]  [<ffffffff8100a290>] arch_cpu_idle+0x18/0x1d
> [ 4330.684003]  [<ffffffff8107fb82>] cpu_startup_entry+0xd1/0x133
> [ 4330.684003]  [<ffffffff8102a34d>] start_secondary+0x196/0x19b
> [ 4330.684003] Code: 24 48 89 de 4c 89 ef 41 ff d6 5b 41 5c 41 5d 41 5e c9 c3 55 48 8b 17 48 89 e5 48 85 d2 75 0c 48 c7 07 01 00 00 00 e9 13 01 00 00 <48> 8b 02 a8 01 0f 85 08 01 00 00 48 8b 48 08 48 39 ca 74 66 48 
> [ 4330.684003] RIP  [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
> [ 4330.684003]  RSP <ffff88011fc83de8>
> [ 4330.684003] CR2: 0000000000000040
> [ 4330.684003] ---[ end trace 680f8979aa2ba0dc ]---
> [ 4330.684003] Kernel panic - not syncing: Attempted to kill the idle task!
> [ 4330.684003] Shutting down cpus with NMI
> [ 4330.684003] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-04 21:32 ` Vince Weaver
@ 2014-03-18 16:56   ` Vince Weaver
  2014-03-18 18:21     ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-18 16:56 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner


The perf_fuzzer can quickly cause a machine to lockup with an hrtimer 
related rb tree related oops.  I've had a hard time debugging this in any 
useful manner, but I can trigger it on both core2 and haswell test systems 
on 3.14-rc7.

This involves making a large number of perf_event events of all types and 
then forking a lot.

As mentioned previously in this e-mail thread the errors look like this:
[ 4330.676015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[ 4330.684003] IP: [<ffffffff812a3867>] rb_insert_color+0x18/0x12d
or
[  232.642581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  232.646569] IP: [<ffffffff812a3872>] rb_insert_color+0x23/0x12d

or here's the most recent dump on the haswell machine:

[68075.490103] BUG: unable to handle kernel NULL pointer dereference at 0000000000000459
[68075.499728] IP: [<ffffffff812da7b9>] rb_erase+0xc9/0x3c0
[68075.506426] PGD 0 
[68075.509681] Oops: 0000 [#1] SMP 
[68075.514310] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc fuse x86_pkg_temp_thermal snd_hda_codec_realtek intel_powerclamp snd_hda_codec_hdmi snd_hda_codec_generic coretemp kvm crct10dif_pclmul iTCO_wdt crc32_pclmul ghash_clmulni_intel snd_hda_intel aesni_intel snd_hda_codec snd_hwdep snd_pcm i915 drm_kms_helper aes_x86_64 lrw gf128mul glue_helper drm iTCO_vendor_support ablk_helper snd_seq parport_pc i2c_algo_bit psmouse i2c_i801 tpm_tis mei_me cryptd snd_seq_device lpc_ich parport pcspkr i2c_core serio_raw tpm mfd_core mei snd_timer processor snd video wmi soundcore evdev battery button sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom hid_generic usbhid hid ahci libahci ehci_pci libata xhci_hcd ehci_hcd e1000e scsi_mod ptp usbcore crc32c_intel usb_common pps_core thermal fan thermal_sys
[68075.600313] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.14.0-rc7 #1
[68075.607917] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68075.616728] task: ffff880118e920d0 ti: ffff880118e9c000 task.ti: ffff880118e9c000
[68075.625592] RIP: 0010:[<ffffffff812da7b9>]  [<ffffffff812da7b9>] rb_erase+0xc9/0x3c0
[68075.634744] RSP: 0018:ffff880118e9de70  EFLAGS: 00010006
[68075.641239] RAX: ffff88011ea8ee00 RBX: ffff88011ea8ec60 RCX: 0000000000000000
[68075.649669] RDX: ffff880113406128 RSI: ffff88011ea8e170 RDI: 0000000000000459
[68075.658097] RBP: ffff88011ea8e170 R08: 0000000000000000 R09: 0000000000000000
[68075.666527] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000000
[68075.674971] R13: 0000000000000001 R14: ffff88011ea8ec60 R15: 0000000000000000
[68075.683374] FS:  0000000000000000(0000) GS:ffff88011ea80000(0000) knlGS:0000000000000000
[68075.692854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68075.699773] CR2: 0000000000000459 CR3: 000000000180e000 CR4: 00000000001407e0
[68075.708150] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68075.716557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68075.724924] Stack:
[68075.727819]  ffffffff812dc80e ffff88011ea8ec60 ffff88011ea8e160 ffffffff81082efe
[68075.736576]  0000000000000001 ffff88011ea8ec60 ffff88011ea8e160 ffff880118e9dfd8
[68075.745302]  ffff880118e9dfd8 ffffffff810832fc 0000000000000086 ffff88011ea8ec60
[68075.754063] Call Trace:
[68075.757409]  [<ffffffff812dc80e>] ? timerqueue_del+0x1e/0x60
[68075.764238]  [<ffffffff81082efe>] ? __remove_hrtimer+0x3e/0x90
[68075.771192]  [<ffffffff810832fc>] ? hrtimer_try_to_cancel+0x8c/0xd0
[68075.778621]  [<ffffffff8108335a>] ? hrtimer_cancel+0x1a/0x30
[68075.785365]  [<ffffffff810c4a0e>] ? tick_nohz_idle_exit+0x8e/0x1a0
[68075.792653]  [<ffffffff810b046b>] ? cpu_startup_entry+0x1eb/0x230
[68075.799856] Code: 40 00 48 8b 3a 48 89 0a 83 e7 01 0f 84 11 01 00 00 48 85 c0 0f 84 08 01 00 00 31 c9 eb 47 0f 1f 40 00 48 8b 7a 08 48 85 ff 74 09 <f6> 07 01 0f 84 d7 01 00 00 48 8b 4a 10 48 85 c9 74 09 f6 01 01 
[68075.822929] RIP  [<ffffffff812da7b9>] rb_erase+0xc9/0x3c0
[68075.829396]  RSP <ffff880118e9de70>
[68075.833838] CR2: 0000000000000459
[68096.514495] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 2} (detected by 7, t=5252 jiffies, g=452763, c=452762, q=27)
[68096.527278] sending NMI to all CPUs:
[68096.531796] NMI backtrace for cpu 2
[68096.536772] CPU: 2 PID: 18 Comm: ksoftirqd/2 Not tainted 3.14.0-rc7 #1
[68096.545020] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68096.554197] task: ffff880118f0f3b0 ti: ffff880118f90000 task.ti: ffff880118f90000
[68096.563497] RIP: 0010:[<ffffffff81510a03>]  [<ffffffff81510a03>] _raw_spin_lock+0x23/0x30
[68096.573567] RSP: 0018:ffff88011ea83f38  EFLAGS: 00000093
[68096.580540] RAX: 00000000000044b8 RBX: ffff88011ea94340 RCX: 00000000000044ba
[68096.589462] RDX: 00000000000044ba RSI: 0000000000000002 RDI: ffff88011ea8e120
[68096.598358] RBP: ffff88011ea8e120 R08: ffff880118f90000 R09: ffffffff81659a40
[68096.607221] R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff88011778b480
[68096.616089] R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000003
[68096.624935] FS:  0000000000000000(0000) GS:ffff88011ea80000(0000) knlGS:0000000000000000
[68096.634839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68096.642219] CR2: 0000000000000459 CR3: 000000000180e000 CR4: 00000000001407e0
[68096.651110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68096.659983] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68096.668893] Stack:
[68096.672344]  ffffffff81083ada 0000000200000200 0000000101026f07 ffff880118e9dfd8
[68096.681704]  ffff88011ea8e680 0000000000000046 ffff88011ea94340 0000000000000000
[68096.691055]  ffff88011778b480 0000000000000000 00000000ffffffff ffff880118f0f3b0
[68096.700405] Call Trace:
[68096.704358]  <IRQ> 
[68096.706438]  [<ffffffff81083ada>] ? hrtimer_interrupt+0x4a/0x220
[68096.715650]  [<ffffffff81042adc>] ? smp_trace_apic_timer_interrupt+0x3c/0xc0
[68096.724497]  [<ffffffff815199dd>] ? trace_apic_timer_interrupt+0x6d/0x80
[68096.732944]  <EOI> 
[68096.735035]  [<ffffffff81089491>] ? finish_task_switch+0x41/0xf0
[68096.744172]  [<ffffffff810894f0>] ? finish_task_switch+0xa0/0xf0
[68096.751753]  [<ffffffff8150d606>] ? __schedule+0x2a6/0x6e0
[68096.758755]  [<ffffffff81086e56>] ? smpboot_thread_fn+0xc6/0x1a0
[68096.766274]  [<ffffffff81086d90>] ? SyS_setgroups+0x180/0x180
[68096.773563]  [<ffffffff81080811>] ? kthread+0xc1/0xe0
[68096.780091]  [<ffffffff81080750>] ? kthread_create_on_node+0x180/0x180
[68096.788182]  [<ffffffff81518b4c>] ? ret_from_fork+0x7c/0xb0
[68096.795225]  [<ffffffff81080750>] ? kthread_create_on_node+0x180/0x180
[68096.803272] Code: 0f 1f 84 00 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 89 d1 75 01 c3 0f b7 07 66 39 d0 74 f7 f3 90 0f b7 07 <66> 39 c8 75 f6 c3 0f 1f 80 00 00 00 00 fa 66 0f 1f 44 00 00 b8 
[68096.827145] NMI backtrace for cpu 7
[68096.827146] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 295.349 msecs
[68096.843512] CPU: 7 PID: 28456 Comm: perf_fuzzer Not tainted 3.14.0-rc7 #1
[68096.851934] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68096.861034] task: ffff8801128cb2f0 ti: ffff8800d058a000 task.ti: ffff8800d058a000
[68096.870215] RIP: 0010:[<ffffffff812e050f>]  [<ffffffff812e050f>] delay_tsc+0x2f/0x70
[68096.879746] RSP: 0000:ffff88011ebc3e30  EFLAGS: 00000046
[68096.886640] RAX: 0000000000000007 RBX: 00000000754b1baf RCX: 00000000019e1478
[68096.895523] RDX: 00000000754b1bc7 RSI: 0000000000000200 RDI: 00000000003358a8
[68096.904356] RBP: 00000000003358a8 R08: 0000000000000000 R09: 00000000000023f2
[68096.913202] R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000007
[68096.922017] R13: ffffffff818b1fc0 R14: 000000000000001b R15: ffffffff818411c0
[68096.930842] FS:  00007fd6a1db6700(0000) GS:ffff88011ebc0000(0000) knlGS:0000000000000000
[68096.940671] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68096.948029] CR2: 000000000040a570 CR3: 00000000cfaab000 CR4: 00000000001407e0
[68096.956832] DR0: 0000000000000000 DR1: 0000000001068000 DR2: 0000000001068000
[68096.965664] DR3: 0000000001068000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68096.974481] Stack:
[68096.977811]  0000000000002710 ffffffff818411c0 0000000000000007 ffffffff81044532
[68096.987022]  ffff88011ebce820 ffffffff810baa30 0000000000000002 0000000000000083
[68096.996215]  0000000200000082 ffff8801128cb2f0 0000000000000001 0000000000000007
[68097.005394] Call Trace:
[68097.009221]  <IRQ> 
[68097.011324]  [<ffffffff81044532>] ? arch_trigger_all_cpu_backtrace+0xb2/0xd0
[68097.021399]  [<ffffffff810baa30>] ? rcu_check_callbacks+0x600/0x620
[68097.029307]  [<ffffffff810c4890>] ? tick_sched_handle.isra.16+0x60/0x60
[68097.037547]  [<ffffffff8106d65b>] ? update_process_times+0x3b/0x60
[68097.045298]  [<ffffffff810c484b>] ? tick_sched_handle.isra.16+0x1b/0x60
[68097.053535]  [<ffffffff810c48c7>] ? tick_sched_timer+0x37/0x60
[68097.060857]  [<ffffffff810833d2>] ? __run_hrtimer+0x62/0x1b0
[68097.067983]  [<ffffffff81083b79>] ? hrtimer_interrupt+0xe9/0x220
[68097.075518]  [<ffffffff81042adc>] ? smp_trace_apic_timer_interrupt+0x3c/0xc0
[68097.084170]  [<ffffffff815199dd>] ? trace_apic_timer_interrupt+0x6d/0x80
[68097.092479]  <EOI> 
[68097.094538] Code: fd 53 65 44 8b 24 25 c4 b0 00 00 66 66 90 0f ae e8 e8 b6 a8 d3 ff 66 90 89 c3 eb 0f f3 90 65 8b 04 25 c4 b0 00 00 41 39 c4 75 1a <66> 66 90 0f ae e8 e8 96 a8 d3 ff 66 90 89 c2 29 d8 39 e8 72 dc 
[68097.119697] NMI backtrace for cpu 0
[68097.119699] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 587.901 msecs
[68097.136033] CPU: 0 PID: 26910 Comm: perf_fuzzer Not tainted 3.14.0-rc7 #1
[68097.144438] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68097.153497] task: ffff880113046960 ti: ffff880036b04000 task.ti: ffff880036b04000
[68097.162663] RIP: 0010:[<ffffffff81510a77>]  [<ffffffff81510a77>] _raw_spin_lock_irqsave+0x37/0x40
[68097.173395] RSP: 0018:ffff880036b05a60  EFLAGS: 00000002
[68097.180263] RAX: 00000000000044b9 RBX: ffff88011ea8e160 RCX: 00000000000044b9
[68097.189094] RDX: 00000000000044b8 RSI: 0000000000000286 RDI: ffff88011ea8e120
[68097.197938] RBP: ffff880036b05ae0 R08: ffff880036b04000 R09: 0000000000000000
[68097.206773] R10: ffff880118e69978 R11: 0000000000005cfc R12: ffff880036b05a88
[68097.215606] R13: ffff880118075364 R14: ffff880118075364 R15: 0000000000000000
[68097.224444] FS:  00007fd6a1db6700(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000
[68097.234320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68097.241654] CR2: 000000000069d5b8 CR3: 00000000ce667000 CR4: 00000000001407f0
[68097.250504] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68097.259331] DR3: 0000000001068000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68097.268152] Stack:
[68097.271519]  ffffffff8108320b ffff880036b05ba8 ffff880036b05ab0 0000000000000000
[68097.280793]  ffffffff81083287 0000000000000000 ffff880036b05ba8 000000000000c350
[68097.290003]  0000000000000000 ffffffff8150d192 ffff880036b05ab0 ffff88011ea8ec60
[68097.299240] Call Trace:
[68097.303071]  [<ffffffff8108320b>] ? lock_hrtimer_base.isra.19+0x1b/0x40
[68097.311383]  [<ffffffff81083287>] ? hrtimer_try_to_cancel+0x17/0xd0
[68097.319347]  [<ffffffff8150d192>] ? schedule_hrtimeout_range_clock+0xb2/0x140
[68097.328203]  [<ffffffff81083170>] ? hrtimer_get_res+0x40/0x40
[68097.335544]  [<ffffffff8150d16f>] ? schedule_hrtimeout_range_clock+0x8f/0x140
[68097.344354]  [<ffffffff81198ce0>] ? poll_schedule_timeout+0x40/0x60
[68097.352215]  [<ffffffff8119a232>] ? do_sys_poll+0x422/0x540
[68097.359345]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.367843]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.376327]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.384773]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.393211]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.401624]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.410057]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.418498]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.426930]  [<ffffffff81198e60>] ? poll_select_copy_remaining+0x130/0x130
[68097.435322]  [<ffffffff8119a408>] ? SyS_poll+0x58/0xf0
[68097.441857]  [<ffffffff81518bfd>] ? system_call_fastpath+0x1a/0x1f
[68097.449501] Code: 0f 1f 44 00 00 ba 00 00 01 00 f0 0f c1 17 89 d1 c1 e9 10 66 39 d1 89 c8 75 04 48 89 f0 c3 0f b7 17 66 39 d1 74 f4 f3 90 0f b7 17 <66> 39 d0 75 f6 eb e8 90 90 41 55 48 39 f7 41 54 49 89 d4 55 48 
[68097.473168] NMI backtrace for cpu 4
[68097.473169] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 941.372 msecs
[68097.489525] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.14.0-rc7 #1
[68097.497414] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68097.506532] task: ffff880118e9e9e0 ti: ffff880118ea2000 task.ti: ffff880118ea2000
[68097.515682] RIP: 0010:[<ffffffff8133d620>]  [<ffffffff8133d620>] intel_idle+0xd0/0x160
[68097.525401] RSP: 0000:ffff880118ea3e70  EFLAGS: 00000046
[68097.532220] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[68097.541064] RDX: 0000000000000000 RSI: ffffffff81876880 RDI: 0000000000000004
[68097.549884] RBP: 0000000000000005 R08: ffff88011eb11194 R09: 0000000000000018
[68097.558707] R10: 00000000001e8429 R11: 00000000002be43c R12: 0000000000000032
[68097.567530] R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81876a50
[68097.576342] FS:  0000000000000000(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
[68097.586234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68097.593584] CR2: 000000000040a570 CR3: 000000000180e000 CR4: 00000000001407e0
[68097.602457] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000001068000
[68097.611265] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68097.620084] Stack:
[68097.623359]  000000048101adf5 ffff88011eb1a500 ffff880118ea3ec8 ffffffff81876880
[68097.632509]  00003de81e705915 ffffffff813f0b5b 00000000fffffff0 0000000000000005
[68097.641678]  0000000000000004 ffffffff81876880 0000000000000000 ffff88011eb1a500
[68097.650843] Call Trace:
[68097.654603]  [<ffffffff813f0b5b>] ? cpuidle_enter_state+0x3b/0xc0
[68097.662291]  [<ffffffff813f0c89>] ? cpuidle_idle_call+0xa9/0x1d0
[68097.669888]  [<ffffffff8101ca95>] ? arch_cpu_idle+0x5/0x30
[68097.676904]  [<ffffffff810b0315>] ? cpu_startup_entry+0x95/0x230
[68097.684442] Code: c8 00 00 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 65 48 8b 04 25 30 c8 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <65> 48 8b 04 25 30 c8 00 00 83 a0 3c e0 ff ff fb 0f ae f0 65 48 
[68097.708259] NMI backtrace for cpu 6
[68097.708261] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1176.462 msecs
[68097.724610] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 3.14.0-rc7 #1
[68097.732503] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68097.741595] task: ffff880118eab2f0 ti: ffff880118eac000 task.ti: ffff880118eac000
[68097.750837] RIP: 0010:[<ffffffff8133d620>]  [<ffffffff8133d620>] intel_idle+0xd0/0x160
[68097.760583] RSP: 0000:ffff880118eade70  EFLAGS: 00000046
[68097.767471] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[68097.776322] RDX: 0000000000000000 RSI: ffffffff81876880 RDI: 0000000000000006
[68097.785115] RBP: 0000000000000005 R08: ffff88011eb91184 R09: 0000000000000014
[68097.793912] R10: 00000000003d08e8 R11: 00000000003d091f R12: 0000000000000032
[68097.802781] R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81876a50
[68097.811514] FS:  0000000000000000(0000) GS:ffff88011eb80000(0000) knlGS:0000000000000000
[68097.821324] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68097.828660] CR2: 000000000040a570 CR3: 000000000180e000 CR4: 00000000001407e0
[68097.837554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000001068000
[68097.846326] DR3: 0000000001068000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68097.855136] Stack:
[68097.858454]  000000068101adf5 ffff88011eb9a500 ffff880118eadec8 ffffffff81876880
[68097.867643]  00003de774a97afa ffffffff813f0b5b 00000000fffffff0 0000000000000005
[68097.876865]  0000000000000006 ffffffff81876880 0000000000000000 ffff88011eb9a500
[68097.886026] Call Trace:
[68097.889806]  [<ffffffff813f0b5b>] ? cpuidle_enter_state+0x3b/0xc0
[68097.897621]  [<ffffffff813f0c89>] ? cpuidle_idle_call+0xa9/0x1d0
[68097.905233]  [<ffffffff8101ca95>] ? arch_cpu_idle+0x5/0x30
[68097.912283]  [<ffffffff810b0315>] ? cpu_startup_entry+0x95/0x230
[68097.919959] Code: c8 00 00 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 65 48 8b 04 25 30 c8 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <65> 48 8b 04 25 30 c8 00 00 83 a0 3c e0 ff ff fb 0f ae f0 65 48 
[68097.943945] NMI backtrace for cpu 3
[68097.943947] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1412.149 msecs
[68097.960436] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.14.0-rc7 #1
[68097.968391] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68097.977590] task: ffff880118e9f2b0 ti: ffff880118ea0000 task.ti: ffff880118ea0000
[68097.986867] RIP: 0010:[<ffffffff8133d620>]  [<ffffffff8133d620>] intel_idle+0xd0/0x160
[68097.996682] RSP: 0018:ffff880118ea1e70  EFLAGS: 00000046
[68098.003583] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[68098.012479] RDX: 0000000000000000 RSI: ffffffff81876880 RDI: 0000000000000003
[68098.021334] RBP: 0000000000000005 R08: ffff88011ead1184 R09: 0000000000000018
[68098.030208] R10: 00000000003d08e8 R11: 00000000007a1206 R12: 0000000000000032
[68098.039026] R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81876a50
[68098.047812] FS:  0000000000000000(0000) GS:ffff88011eac0000(0000) knlGS:0000000000000000
[68098.057638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68098.064946] CR2: 000000000069d488 CR3: 000000000180e000 CR4: 00000000001407e0
[68098.073771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68098.082621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68098.091433] Stack:
[68098.094770]  000000038101adf5 ffff88011eada500 ffff880118ea1ec8 ffffffff81876880
[68098.103978]  00003de7724722e3 ffffffff813f0b5b 00000000fffffff0 0000000000000005
[68098.113216]  0000000000000003 ffffffff81876880 0000000000000000 ffff88011eada500
[68098.122417] Call Trace:
[68098.126227]  [<ffffffff813f0b5b>] ? cpuidle_enter_state+0x3b/0xc0
[68098.134001]  [<ffffffff813f0c89>] ? cpuidle_idle_call+0xa9/0x1d0
[68098.141643]  [<ffffffff8101ca95>] ? arch_cpu_idle+0x5/0x30
[68098.148717]  [<ffffffff810b0315>] ? cpu_startup_entry+0x95/0x230
[68098.156312] Code: c8 00 00 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 65 48 8b 04 25 30 c8 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <65> 48 8b 04 25 30 c8 00 00 83 a0 3c e0 ff ff fb 0f ae f0 65 48 
[68098.180354] NMI backtrace for cpu 1
[68098.180355] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1648.557 msecs
[68098.197010] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc7 #1
[68098.204971] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68098.214192] task: ffff880118e929a0 ti: ffff880118e98000 task.ti: ffff880118e98000
[68098.223499] RIP: 0010:[<ffffffff8133d620>]  [<ffffffff8133d620>] intel_idle+0xd0/0x160
[68098.233299] RSP: 0018:ffff880118e99e70  EFLAGS: 00000046
[68098.240237] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[68098.249115] RDX: 0000000000000000 RSI: ffffffff81876880 RDI: 0000000000000001
[68098.257974] RBP: 0000000000000005 R08: ffff88011ea5118c R09: 0000000000001ecf
[68098.266859] R10: 00000000ffffffff R11: 0000000000007cb3 R12: 0000000000000032
[68098.275703] R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81876a50
[68098.284578] FS:  0000000000000000(0000) GS:ffff88011ea40000(0000) knlGS:0000000000000000
[68098.294484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68098.301820] CR2: 00007fff3a0283f8 CR3: 000000000180e000 CR4: 00000000001407e0
[68098.310655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68098.319509] DR3: 0000000001068000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68098.328369] Stack:
[68098.331737]  000000018101adf5 ffff88011ea5a500 ffff880118e99ec8 ffffffff81876880
[68098.340937]  00003de8450a9810 ffffffff813f0b5b 00000000fffffff0 0000000000000005
[68098.350202]  0000000000000001 ffffffff81876880 0000000000000000 ffff88011ea5a500
[68098.359447] Call Trace:
[68098.363268]  [<ffffffff813f0b5b>] ? cpuidle_enter_state+0x3b/0xc0
[68098.370997]  [<ffffffff813f0c89>] ? cpuidle_idle_call+0xa9/0x1d0
[68098.378687]  [<ffffffff8101ca95>] ? arch_cpu_idle+0x5/0x30
[68098.385753]  [<ffffffff810b0315>] ? cpu_startup_entry+0x95/0x230
[68098.393393] Code: c8 00 00 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 65 48 8b 04 25 30 c8 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <65> 48 8b 04 25 30 c8 00 00 83 a0 3c e0 ff ff fb 0f ae f0 65 48 
[68098.417424] NMI backtrace for cpu 5
[68098.417426] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1885.605 msecs
[68098.433041] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.14.0-rc7 #1
[68098.440461] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[68098.449101] task: ffff880118e9e110 ti: ffff880118ea6000 task.ti: ffff880118ea6000
[68098.457795] RIP: 0010:[<ffffffff8133d620>]  [<ffffffff8133d620>] intel_idle+0xd0/0x160
[68098.466996] RSP: 0000:ffff880118ea7e70  EFLAGS: 00000046
[68098.473382] RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
[68098.481728] RDX: 0000000000000000 RSI: ffffffff81876880 RDI: 0000000000000005
[68098.490039] RBP: 0000000000000005 R08: ffff88011eb5118c R09: 0000000000001ecf
[68098.498346] R10: 00000000ffffffff R11: 0000000000007cc6 R12: 0000000000000032
[68098.506646] R13: 0000000000000004 R14: 0000000000000005 R15: ffffffff81876a50
[68098.514922] FS:  0000000000000000(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
[68098.524237] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68098.531049] CR2: 000000000040a570 CR3: 000000000180e000 CR4: 00000000001407e0
[68098.539336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000001068000
[68098.547621] DR3: 0000000001068000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[68098.555962] Stack:
[68098.558809]  000000058101adf5 ffff88011eb5a500 ffff880118ea7ec8 ffffffff81876880
[68098.567491]  00003de84918106f ffffffff813f0b5b 00000000fffffff0 0000000000000005
[68098.576179]  0000000000000005 ffffffff81876880 0000000000000000 ffff88011eb5a500
[68098.584862] Call Trace:
[68098.588160]  [<ffffffff813f0b5b>] ? cpuidle_enter_state+0x3b/0xc0
[68098.595371]  [<ffffffff813f0c89>] ? cpuidle_idle_call+0xa9/0x1d0
[68098.602509]  [<ffffffff8101ca95>] ? arch_cpu_idle+0x5/0x30
[68098.609060]  [<ffffffff810b0315>] ? cpu_startup_entry+0x95/0x230
[68098.616137] Code: c8 00 00 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 65 48 8b 04 25 30 c8 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <65> 48 8b 04 25 30 c8 00 00 83 a0 3c e0 ff ff fb 0f ae f0 65 48 
[68098.639064] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 2107.244 msecs


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 16:56   ` rb tree hrtimer lockup bug (found by perf_fuzzer) Vince Weaver
@ 2014-03-18 18:21     ` Thomas Gleixner
  2014-03-18 19:25       ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-18 18:21 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Vince Weaver wrote:

> 
> The perf_fuzzer can quickly cause a machine to lockup with an hrtimer 
> related rb tree related oops.  I've had a hard time debugging this in any 
> useful manner, but I can trigger it on both core2 and haswell test systems 
> on 3.14-rc7.
> 
> This involves making a large number of perf_event events of all types and 
> then forking a lot.

Can you enable debugobjects please? The should give us an hint what
corrupts the rbtree.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 18:21     ` Thomas Gleixner
@ 2014-03-18 19:25       ` Vince Weaver
  2014-03-18 20:52         ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-18 19:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Thomas Gleixner wrote:

> On Tue, 18 Mar 2014, Vince Weaver wrote:
> 
> > 
> > The perf_fuzzer can quickly cause a machine to lockup with an hrtimer 
> > related rb tree related oops.  I've had a hard time debugging this in any 
> > useful manner, but I can trigger it on both core2 and haswell test systems 
> > on 3.14-rc7.
> > 
> > This involves making a large number of perf_event events of all types and 
> > then forking a lot.
> 
> Can you enable debugobjects please? The should give us an hint what
> corrupts the rbtree.

I enabled debugobjects and then said Y to most of the questions brought up 
by make oldconfig but now the system crashes at boot:

[    3.678040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018                                                                        
[    3.686776] IP: [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250    
[    3.694289] PGD 0                                                            
[    3.696642] Oops: 0000 [#1] SMP                                              
[    3.700394] Modules linked in: sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom hid_generic usbhid hid ahci e1000e libahci ehci_pci ptp ehci_hcd xhci_hcd libata pps_core usbcore crc32c_intel scsi_mod usb_common fan thermal thermal_sys 
[    3.725377] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc7 #2          
[    3.732217] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014                                                                             
[    3.740296] task: ffff880118e989a0 ti: ffff880118e9e000 task.ti: ffff880118e9e000                                                                            
[    3.748447] RIP: 0010:[<ffffffff8106d7a8>]  [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250                                                        
[    3.758601] RSP: 0018:ffff880118e9fe58  EFLAGS: 00010017                     
[    3.764413] RAX: 0000000000000000 RBX: 000000013ffede62 RCX: 0000000000000000
[    3.772162] RDX: 0000000000000000 RSI: ffff880118ecd228 RDI: 0000000000fffedf
[    3.779863] RBP: ffff880118e9fea0 R08: 0000000000000001 R09: 0000000000000020
[    3.787553] R10: 000000000000001f R11: ffff880118ecd028 R12: ffff880118ecc000
[    3.795295] R13: 00000000fffede63 R14: ffff880118e9fe60 R15: ffff880118e9fe78
[    3.803003] FS:  0000000000000000(0000) GS:ffff88011ea40000(0000) knlGS:0000000000000000                                                                     
[    3.811760] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                
[    3.818042] CR2: 0000000000000018 CR3: 000000000180e000 CR4: 00000000001407e0
[    3.825772] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    3.833506] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    3.841257] Stack:                                                           
[    3.843536]  ffff880118ecd028 ffff880118ecd428 ffff880118ecd828 ffff880118ecdc28                                                                             
[    3.851967]  0000000000000001 00000000cc902a00 ffff88011ea4de00 0000000000000000                                                                             
[    3.860406]  ffff88011ea4eda0 00000000cc91c17c ffffffff810c5525 00000000fffede63                                                                             
[    3.868786] Call Trace:                                                      
[    3.871512]  [<ffffffff810c5525>] ? __tick_nohz_idle_enter+0x2c5/0x460       
[    3.878634]  [<ffffffff810c56f4>] ? tick_nohz_idle_enter+0x34/0x60           
[    3.885374]  [<ffffffff810b089e>] ? cpu_startup_entry+0x3e/0x230             
[    3.891895] Code: 24 18 41 89 fa 41 83 e2 3f 45 89 d1 0f 1f 80 00 00 00 00 49 63 f1 48 c1 e6 04 4c 01 de 48 8b 06 48 39 f0 74 25 66 0f 1f 44 00 00 <f6> 40 18 01 75 11 48 8b 48 10 41 b8 01 00 00 00 48 39 d1 48 0f                          
[    3.918182] RIP  [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250   
[    3.925697]  RSP <ffff880118e9fe58>                                          
[    3.929514] CR2: 0000000000000018                                            
[    3.933151] ---[ end trace aff36205690b9b9e ]---                             
[    3.938191] Kernel panic - not syncing: Attempted to kill the idle task!     
[    3.945483] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

this is a haswell system, 3.14-rc7

Vince

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 19:25       ` Vince Weaver
@ 2014-03-18 20:52         ` Thomas Gleixner
  2014-03-18 21:10           ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-18 20:52 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Vince Weaver wrote:
> On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> 
> > On Tue, 18 Mar 2014, Vince Weaver wrote:
> > 
> > > 
> > > The perf_fuzzer can quickly cause a machine to lockup with an hrtimer 
> > > related rb tree related oops.  I've had a hard time debugging this in any 
> > > useful manner, but I can trigger it on both core2 and haswell test systems 
> > > on 3.14-rc7.
> > > 
> > > This involves making a large number of perf_event events of all types and 
> > > then forking a lot.
> > 
> > Can you enable debugobjects please? The should give us an hint what
> > corrupts the rbtree.
> 
> I enabled debugobjects and then said Y to most of the questions brought up 
> by make oldconfig but now the system crashes at boot:

> [    3.678040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018                                                                        
> [    3.686776] IP: [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250    

The back trace is not really helpful. Is there anything before that
BUG in dmesg?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 20:52         ` Thomas Gleixner
@ 2014-03-18 21:10           ` Vince Weaver
  2014-03-18 21:45             ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-18 21:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> 
> > [    3.678040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018                                                                        
> > [    3.686776] IP: [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250    
> 
> The back trace is not really helpful. Is there anything before that
> BUG in dmesg?

no, just normal boot messages.

I re-compiled with the frame pointer enabled and here's an updated 
version:

[    3.722836] sd 1:0:0:0: [sda] 4096-byte physical blocks                      
[    3.728682] sd 1:0:0:0: [sda] Write Protect is off                           
[    3.728826] sr 0:0:0:0: Attached scsi generic sg0 type 5                     
[    3.728992] sd 1:0:0:0: Attached scsi generic sg1 type 0                     
[    3.745700] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA                                                          
[    3.787096]  sda: sda1 sda2 sda3 sda4                                        
[    3.792169] sd 1:0:0:0: [sda] Attached SCSI disk                             
[    4.312111] ------------[ cut here ]------------                             
[    4.317176] kernel BUG at kernel/timer.c:1084!                               
[    4.322093] invalid opcode: 0000 [#1] SMP                                    
[    4.326856] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci ehci_pci libata e1000e xhci_hcd ehci_hcd ptp scsi_mod crc32c_intel usbcore pps_core usb_common fan thermal thermal_sys 
[    4.352168] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc7 #3          
[    4.359034] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014                                                                             
[    4.367138] task: ffff880118e989a0 ti: ffff880118e9e000 task.ti: ffff880118e9e000                                                                            
[    4.375353] RIP: 0010:[<ffffffff8106d073>]  [<ffffffff8106d073>] cascade+0x93/0xa0                                                                           
[    4.383766] RSP: 0018:ffff88011ea43e78  EFLAGS: 00010086                     
[    4.389555] RAX: 0000000000000000 RBX: ffff88011839de60 RCX: ffff880118ecdc28
[    4.397324] RDX: 000000000000001f RSI: ffff8801183a8150 RDI: ffff880118ecc000
Begin: Loading essential drivers ... done.                                                                      
[    4.405232] RBP: ffff88011ea43ea8 R08: ffe20047a9380000 R09: 0000000000000000                                                                
[    4.414260] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880118ecc000                                                                
Begin: Running /scripts/init-premount ... done.                                                                     
[    4.423278] R13: ffff88011ea43e78 R14: 000000000000001f R15: ffffffff81876e10                                                                            
[    4.432348] FS:  0000000000000000(0000) GS:ffff88011ea40000(0000) knlGS:0000000000000000                                                     
[    4.442441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Begin: Mounting root file system ...
[    4.450013] CR2: 0000000000e762c8 CR3: 000000000180e000 CR4: 00000000001407e0                                                                             
[    4.459072] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000                                                                
[    4.468156] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400                                                                
[    4.477197] Stack:                                           
Begin: Running /scripts/local-top ... done.                   
[    4.480741]  ffff8801183a8150 ffff88011839d460 ffff880118ecc0000000000000000000
Begin: Running /scripts/local-pre mount ... done.                                                             
[    4.495754]  0000000000000001 0000000000000100 ffff88011ea43f18 ffffffff8106e15c                                                                             
[    4.504269]  ffff880118ecdc28 ffff880118ecd828 ffff880118ecd428 ffff880118ecd028                                                                             
[    4.512741] Call Trace:                                                      
[    4.515466]  <IRQ>                                                           
[    4.517617]  [<ffffffff8106e15c>] run_timer_softirq+0x21c/0x2a0              
[    4.524524]  [<ffffffff810c8716>] ? clockevents_program_event+0x66/0xe0      
[    4.531800]  [<ffffffff81066c15>] __do_softirq+0xf5/0x290                    
[    4.537734]  [<ffffffff81066fdd>] irq_exit+0x9d/0xb0                         
[    4.543211]  [<ffffffff8104440f>] smp_apic_timer_interrupt+0x3f/0x50         
[    4.550147]  [<ffffffff81545f9d>] apic_timer_interrupt+0x6d/0x80             
[    4.556729]  <EOI>                                                           
[    4.558894]  [<ffffffff81413ffd>] ? cpuidle_enter_state+0x4d/0xc0            
[    4.565969]  [<ffffffff81414124>] cpuidle_idle_call+0xb4/0x1f0               
[    4.572381]  [<ffffffff8101cfc9>] arch_cpu_idle+0x9/0x30                     
[    4.578239]  [<ffffffff810b4f05>] cpu_startup_entry+0x95/0x240               
[    4.584608]  [<ffffffff810426f0>] start_secondary+0x1a0/0x1f0                
[    4.590924] Code: 49 39 cc 75 26 48 89 de 48 89 c3 4c 89 e7 e8 15 f7 ff ff 4c 39 eb 48 8b 03 75 dd 48 83 c4 10 44 89 f0 5b 41 5c 41 5d 41 5e 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 31 c0 48 89 e5 41 56                          
[    4.617262] RIP  [<ffffffff8106d073>] cascade+0x93/0xa0                      
[    4.623091]  RSP <ffff88011ea43e78>                                          
[    4.626976] ---[ end trace dd695d8667870673 ]---                             
[    4.632059] Kernel panic - not syncing: Fatal exception in interrupt         
[    4.639037] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

Vince

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 21:10           ` Vince Weaver
@ 2014-03-18 21:45             ` Thomas Gleixner
  2014-03-19 13:46               ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-18 21:45 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Vince Weaver wrote:

> On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> > 
> > > [    3.678040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018                                                                        
> > > [    3.686776] IP: [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250    
> > 
> > The back trace is not really helpful. Is there anything before that
> > BUG in dmesg?
> 
> no, just normal boot messages.
> 
> I re-compiled with the frame pointer enabled and here's an updated 
> version:
> 
> [    3.722836] sd 1:0:0:0: [sda] 4096-byte physical blocks                      
> [    3.728682] sd 1:0:0:0: [sda] Write Protect is off                           
> [    3.728826] sr 0:0:0:0: Attached scsi generic sg0 type 5                     
> [    3.728992] sd 1:0:0:0: Attached scsi generic sg1 type 0                     
> [    3.745700] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA                                                          
> [    3.787096]  sda: sda1 sda2 sda3 sda4                                        
> [    3.792169] sd 1:0:0:0: [sda] Attached SCSI disk                             
> [    4.312111] ------------[ cut here ]------------                             
> [    4.317176] kernel BUG at kernel/timer.c:1084!                               

Wow. That's in the cascade code. So it looks like the timer got
corrupted without using the timer api wrong. That might be a memset or
a stray pointer. Can you apply the following patch?

That really should give us a hint.

If that fails, enable the tracer in your config and add the following
to your commandline

   trace_event=timer_* ftrace_dump_on_oops

That will spill out a lot of crap over serial, but it should tell us
which timer is corrupted.

Thanks,

	tglx

Index: linux-2.6/include/linux/debugobjects.h
===================================================================
--- linux-2.6.orig/include/linux/debugobjects.h
+++ linux-2.6/include/linux/debugobjects.h
@@ -68,6 +68,7 @@ extern void debug_object_deactivate(void
 extern void debug_object_destroy   (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_free      (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_assert_init(void *addr, struct debug_obj_descr *descr);
+extern void debug_object_info(void *addr, struct debug_obj_descr *descr);
 
 /*
  * Active state:
@@ -95,6 +96,8 @@ static inline void
 debug_object_free      (void *addr, struct debug_obj_descr *descr) { }
 static inline void
 debug_object_assert_init(void *addr, struct debug_obj_descr *descr) { }
+static inline void
+debug_object_info(void *addr, struct debug_obj_descr *descr) { }
 
 static inline void debug_objects_early_init(void) { }
 static inline void debug_objects_mem_init(void) { }
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -1081,7 +1081,12 @@ static int cascade(struct tvec_base *bas
 	 * don't have to detach them individually.
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
+		if (tbase_get_base(timer->base) != base) {
+			pr_err("Invalid timer base: tmr %p tmr->base %p base %p\n",
+			       timer, timer->base, base);
+			debug_object_info(timer, &timer_debug_descr);
+			BUG();
+		}
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -377,6 +377,28 @@ void debug_object_init_on_stack(void *ad
 	__debug_object_init(addr, descr, 1);
 }
 
+void debug_object_info(void *addr, struct debug_obj_descr *descr)
+{
+	struct debug_bucket *db;
+	struct debug_obj *obj;
+	unsigned long flags;
+
+	if (!debug_objects_enabled)
+		return;
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+}
+
 /**
  * debug_object_activate - debug checks when an object is activated
  * @addr:	address of the object

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-18 21:45             ` Thomas Gleixner
@ 2014-03-19 13:46               ` Vince Weaver
  2014-03-19 13:58                 ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-19 13:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 18 Mar 2014, Thomas Gleixner wrote:

> On Tue, 18 Mar 2014, Vince Weaver wrote:
> 
> > On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> > > 
> > > > [    3.678040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018                                                                        
> > > > [    3.686776] IP: [<ffffffff8106d7a8>] get_next_timer_interrupt+0x168/0x250    
> > > 
> > > The back trace is not really helpful. Is there anything before that
> > > BUG in dmesg?
> > 
> > no, just normal boot messages.
> > 
> > I re-compiled with the frame pointer enabled and here's an updated 
> > version:
> > 
> > [    3.722836] sd 1:0:0:0: [sda] 4096-byte physical blocks                      
> > [    3.728682] sd 1:0:0:0: [sda] Write Protect is off                           
> > [    3.728826] sr 0:0:0:0: Attached scsi generic sg0 type 5                     
> > [    3.728992] sd 1:0:0:0: Attached scsi generic sg1 type 0                     
> > [    3.745700] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA                                                          
> > [    3.787096]  sda: sda1 sda2 sda3 sda4                                        
> > [    3.792169] sd 1:0:0:0: [sda] Attached SCSI disk                             
> > [    4.312111] ------------[ cut here ]------------                             
> > [    4.317176] kernel BUG at kernel/timer.c:1084!                               
> 
> Wow. That's in the cascade code. So it looks like the timer got
> corrupted without using the timer api wrong. That might be a memset or
> a stray pointer. Can you apply the following patch?
> 
> That really should give us a hint.
> 
> If that fails, enable the tracer in your config and add the following
> to your commandline
> 
>    trace_event=timer_* ftrace_dump_on_oops
> 
> That will spill out a lot of crap over serial, but it should tell us
> which timer is corrupted.
> 
> Thanks,

your patch didn't seem to print anything additional the first time throug.

I then tried the trace command you suggested, but I'm getting an empty
ftrace buffer which possibly means I don't have enough ftrace kernel 
options enabled.

Here's the most recent boot crash.

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.14.0-rc7+ root=UUID=ec527b30-713c-47c1-bf23-9df8681afab7 ro console=tty0 console=ttyS0,115200n8 trace_event=timer_* ftrace_dump_on_oops
...
[    5.342788] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118894000
[    5.352842] ------------[ cut here ]------------
[    5.357938] WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:260 debug_print_object+0x83/0xa0()
[    5.367069] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null)
[    5.376559] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci e1000e libata ehci_pci ptp xhci_hcd ehci_hcd pps_core scsi_mod crc32c_intel usbcore usb_common thermal fan thermal_sys
[    5.401708] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc7+ #5
[    5.408653] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    5.416781]  0000000000000009 ffff88011ea43d68 ffffffff8155b113 ffff88011ea43db0
[    5.425271]  ffff88011ea43da0 ffffffff810661ad ffff880117a3af88 ffffffff8183a7e0
[    5.433768]  ffffffff8173de38 00000000000258b0 ffffffff818774d0 ffff88011ea43e00
[    5.442264] Call Trace:
[    5.445021]  <IRQ>  [<ffffffff8155b113>] dump_stack+0x45/0x56
[    5.451533]  [<ffffffff810661ad>] warn_slowpath_common+0x7d/0xa0
[    5.458165]  [<ffffffff8106621c>] warn_slowpath_fmt+0x4c/0x50
[    5.464499]  [<ffffffff8131d1e3>] debug_print_object+0x83/0xa0
[    5.470919]  [<ffffffff8131d904>] debug_object_info+0xf4/0x100
[    5.477310]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.482664]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    5.489123]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    5.496362]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    5.502288]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    5.507758]  [<ffffffff8156c075>] smp_apic_timer_interrupt+0x45/0x60
[    5.514694]  [<ffffffff8156a9dd>] apic_timer_interrupt+0x6d/0x80
[    5.521291]  <EOI>  [<ffffffff814330f2>] ? cpuidle_enter_state+0x52/0xc0
[    5.528865]  [<ffffffff81433219>] cpuidle_idle_call+0xb9/0x1f0
[    5.535305]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    5.541152]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    5.547572]  [<ffffffff81044af0>] start_secondary+0x1a0/0x1f0
[    5.553844] ---[ end trace 32b995b31388939e ]---
[    5.558927] ------------[ cut here ]------------
[    5.563994] kernel BUG at kernel/timer.c:1088!
[    5.568922] invalid opcode: 0000 [#1] SMP 
[    5.573614] Dumping ftrace buffer:
[    5.577383]    (ftrace buffer empty)
[    5.581373] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci e1000e libata ehci_pci ptp xhci_hcd ehci_hcd pps_core scsi_mod crc32c_intel usbcore usb_common thermal fan thermal_sys
[    5.606612] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W    3.14.0-rc7+ #5
[    5.614634] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    5.622701] task: ffff8801188609c0 ti: ffff880118866000 task.ti: ffff880118866000
[    5.630944] RIP: 0010:[<ffffffff81071904>]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.639315] RSP: 0018:ffff88011ea43e78  EFLAGS: 00010002
[    5.645132] RAX: 0000000000000024 RBX: ffff880117740150 RCX: 000000000000079f
[    5.652919] RDX: 00000000000012d0 RSI: 0000000000000002 RDI: 0000000000000002
[    5.660704] RBP: ffff88011ea43ea8 R08: 0000000000000000 R09: 00000000000002b1
[    5.668470] R10: ffffffff8165a580 R11: ffff88011ea43aa6 R12: ffff880118894000
[    5.676255] R13: ffff88011ea43e78 R14: 0000000000000020 R15: ffffffff818774d0
[    5.684061] FS:  0000000000000000(0000) GS:ffff88011ea40000(0000) knlGS:0000000000000000
[    5.692928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.699184] CR2: 0000000000bc5304 CR3: 000000000180e000 CR4: 00000000001407e0
[    5.706984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.714744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    5.722511] Stack:
[    5.724768]  ffff880117740150 ffff88011ea50740 ffff880118894000 0000000000000000
[    5.733264]  0000000000000001 0000000000000100 ffff88011ea43f18 ffffffff81072d7c
[    5.741748]  ffff880118895c28 ffff880118895828 ffff880118895428 ffff880118895028
[    5.750236] Call Trace:
[    5.752992]  <IRQ> 
[    5.755142]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    5.762074]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    5.769362]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    5.775279]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    5.780721]  [<ffffffff8156c075>] smp_apic_timer_interrupt+0x45/0x60
[    5.787676]  [<ffffffff8156a9dd>] apic_timer_interrupt+0x6d/0x80
[    5.794270]  <EOI> 
[    5.796439]  [<ffffffff814330f2>] ? cpuidle_enter_state+0x52/0xc0
[    5.803559]  [<ffffffff81433219>] cpuidle_idle_call+0xb9/0x1f0
[    5.809934]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    5.815794]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    5.822160]  [<ffffffff81044af0>] start_secondary+0x1a0/0x1f0
[    5.828463] Code: 5d 41 5e 5d c3 48 89 f3 4c 89 e1 48 89 de 48 c7 c7 f8 05 71 81 31 c0 e8 7e 6c 4e 00 48 c7 c6 e0 a7 83 81 48 89 df e8 0c bf 2a 00 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 
[    5.854807] RIP  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.860662]  RSP <ffff88011ea43e78>
[    5.864547] ---[ end trace 32b995b31388939f ]---
[    5.869596] Kernel panic - not syncing: Fatal exception in interrupt
[    5.876532] Dumping ftrace buffer:
[    5.880352]    (ftrace buffer empty)
[    5.884368] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-19 13:46               ` Vince Weaver
@ 2014-03-19 13:58                 ` Thomas Gleixner
  2014-03-19 14:42                   ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-19 13:58 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Wed, 19 Mar 2014, Vince Weaver wrote:
> On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> your patch didn't seem to print anything additional the first time throug.
> 
> I then tried the trace command you suggested, but I'm getting an empty
> ftrace buffer which possibly means I don't have enough ftrace kernel 
> options enabled.
> 
> Here's the most recent boot crash.
> 
> [    5.367069] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null)

Stupid me. We get the hint from the wreckaged object ....

A hopefully better approach is the delta patch below.

Thanks,

	tglx
------------------->
Index: linux-2.6/include/linux/debugobjects.h
===================================================================
--- linux-2.6.orig/include/linux/debugobjects.h
+++ linux-2.6/include/linux/debugobjects.h
@@ -30,6 +30,7 @@ struct debug_obj {
 	unsigned int		astate;
 	void			*object;
 	struct debug_obj_descr	*descr;
+	void			*hint;
 };
 
 /**
Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -255,9 +255,9 @@ static void debug_print_object(struct de
 			descr->debug_hint(obj->object) : NULL;
 		limit++;
 		WARN(1, KERN_ERR "ODEBUG: %s %s (active state %u) "
-				 "object type: %s hint: %pS\n",
+				 "object type: %s hint: %pS %pS\n",
 			msg, obj_states[obj->state], obj->astate,
-			descr->name, hint);
+		     descr->name, hint, obj->hint);
 	}
 	debug_objects_warnings++;
 }
@@ -326,6 +326,9 @@ __debug_object_init(void *addr, struct d
 		debug_object_is_on_stack(addr, onstack);
 	}
 
+	if (!obj->hint)
+		obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -447,6 +450,8 @@ int debug_object_activate(void *addr, st
 			ret = 0;
 			break;
 		}
+		if (!obj->hint)
+			obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
 		raw_spin_unlock_irqrestore(&db->lock, flags);
 		return ret;
 	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-19 13:58                 ` Thomas Gleixner
@ 2014-03-19 14:42                   ` Vince Weaver
  2014-03-19 15:05                     ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-19 14:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Wed, 19 Mar 2014, Thomas Gleixner wrote:

> On Wed, 19 Mar 2014, Vince Weaver wrote:
> > On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> > your patch didn't seem to print anything additional the first time throug.
> > 
> > I then tried the trace command you suggested, but I'm getting an empty
> > ftrace buffer which possibly means I don't have enough ftrace kernel 
> > options enabled.
> > 
> > Here's the most recent boot crash.
> > 
> > [    5.367069] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null)
> 
> Stupid me. We get the hint from the wreckaged object ....
> 
> A hopefully better approach is the delta patch below.

with that applied on top:

[    5.342681] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118618000
[    5.352786] ------------[ cut here ]------------
[    5.357911] WARNING: CPU: 4 PID: 0 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
[    5.367023] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null) delayed_work_timer_fn+0x0/0x20
[    5.379430] Modules linked in: sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom hid_generic usbhid hid ehci_pci xhci_hcd ehci_hcd ahci libahci e1000e libata usbcore ptp crc32c_intel usb_common scsi_mod pps_core fan thermal thermal_sys
[    5.404562] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.14.0-rc7+ #6
[    5.411527] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    5.419604]  0000000000000009 ffff88011eb03d60 ffffffff8155b173 ffff88011eb03da8
[    5.428072]  ffff88011eb03d98 ffffffff810661ad ffff880117a51828 ffffffff8183a7e0
[    5.436571]  ffffffff8173de38 00000000000258b0 ffffffff818774d0 ffff88011eb03df8
[    5.445017] Call Trace:
[    5.447777]  <IRQ>  [<ffffffff8155b173>] dump_stack+0x45/0x56
[    5.454257]  [<ffffffff810661ad>] warn_slowpath_common+0x7d/0xa0
[    5.460820]  [<ffffffff8106621c>] warn_slowpath_fmt+0x4c/0x50
[    5.467103]  [<ffffffff8131d1ec>] debug_print_object+0x8c/0xb0
[    5.473485]  [<ffffffff8107fc30>] ? __queue_work+0x320/0x320
[    5.479710]  [<ffffffff8131d954>] debug_object_info+0xf4/0x100
[    5.486082]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.491489]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    5.497965]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    5.505203]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    5.511108]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    5.516530]  [<ffffffff8156c0f5>] smp_apic_timer_interrupt+0x45/0x60
[    5.523488]  [<ffffffff8156aa5d>] apic_timer_interrupt+0x6d/0x80
[    5.530048]  <EOI>  [<ffffffff81433152>] ? cpuidle_enter_state+0x52/0xc0
[    5.537601]  [<ffffffff81433279>] cpuidle_idle_call+0xb9/0x1f0
[    5.543967]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    5.549798]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    5.556220]  [<ffffffff81044af0>] start_secondary+0x1a0/0x1f0
[    5.562492] ---[ end trace e60b62481a1a0611 ]---
[    5.567568] ------------[ cut here ]------------
[    5.572623] kernel BUG at kernel/timer.c:1088!
[    5.577511] invalid opcode: 0000 [#1] SMP 
[    5.582190] Dumping ftrace buffer:
[    5.585956]    (ftrace buffer empty)
[    5.589941] Modules linked in: sg sd_mod sr_mod crc_t10dif crct10dif_common cdrom hid_generic usbhid hid ehci_pci xhci_hcd ehci_hcd ahci libahci e1000e libata usbcore ptp crc32c_intel usb_common scsi_mod pps_core fan thermal thermal_sys
[    5.615094] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G        W    3.14.0-rc7+ #6
[    5.623062] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    5.631142] task: ffff88011886ea00 ti: ffff880118872000 task.ti: ffff880118872000
[    5.639286] RIP: 0010:[<ffffffff81071904>]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.647630] RSP: 0018:ffff88011eb03e78  EFLAGS: 00010002
[    5.653456] RAX: 0000000000000024 RBX: ffff880117740150 RCX: 00000000000007d4
[    5.661261] RDX: 0000000000001369 RSI: 0000000000000002 RDI: 0000000000000002
[    5.668990] RBP: ffff88011eb03ea8 R08: 0000000000000000 R09: 00000000000002b2
[    5.676785] R10: ffffffff8165a580 R11: ffff88011eb03a9e R12: ffff880118618000
[    5.684590] R13: ffff88011eb03e78 R14: 0000000000000020 R15: ffffffff818774d0
[    5.692366] FS:  0000000000000000(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
[    5.701150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.707459] CR2: 00007f8ec214b214 CR3: 000000000180e000 CR4: 00000000001407e0
[    5.715222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.723008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    5.730767] Stack:
[    5.733020]  ffff88011772b470 ffff88011eb10740 ffff880118618000 0000000000000000
[    5.741497]  0000000000000001 0000000000000100 ffff88011eb03f18 ffffffff81072d7c
[    5.749956]  ffff880118619c28 ffff880118619828 ffff880118619428 ffff880118619028
[    5.758481] Call Trace:
[    5.761210]  <IRQ> 
[    5.763390]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    5.770310]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    5.777520]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    5.783440]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    5.788888]  [<ffffffff8156c0f5>] smp_apic_timer_interrupt+0x45/0x60
[    5.795830]  [<ffffffff8156aa5d>] apic_timer_interrupt+0x6d/0x80
[    5.802380]  <EOI> 
[    5.804559]  [<ffffffff81433152>] ? cpuidle_enter_state+0x52/0xc0
[    5.811624]  [<ffffffff81433279>] cpuidle_idle_call+0xb9/0x1f0
[    5.818009]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    5.823838]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    5.830235]  [<ffffffff81044af0>] start_secondary+0x1a0/0x1f0
[    5.836523] Code: 5d 41 5e 5d c3 48 89 f3 4c 89 e1 48 89 de 48 c7 c7 f8 05 71 81 31 c0 e8 de 6c 4e 00 48 c7 c6 e0 a7 83 81 48 89 df e8 5c bf 2a 00 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 
[    5.862939] RIP  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.868769]  RSP <ffff88011eb03e78>
[    5.872662] ---[ end trace e60b62481a1a0612 ]---
[    5.877728] Kernel panic - not syncing: Fatal exception in interrupt
[    5.884701] Dumping ftrace buffer:
[    5.888452]    (ftrace buffer empty)
[    5.892407] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-19 14:42                   ` Vince Weaver
@ 2014-03-19 15:05                     ` Thomas Gleixner
  2014-03-19 17:04                       ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-19 15:05 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Wed, 19 Mar 2014, Vince Weaver wrote:
> On Wed, 19 Mar 2014, Thomas Gleixner wrote:
> 
> > On Wed, 19 Mar 2014, Vince Weaver wrote:
> > > On Tue, 18 Mar 2014, Thomas Gleixner wrote:
> > > your patch didn't seem to print anything additional the first time throug.
> > > 
> > > I then tried the trace command you suggested, but I'm getting an empty
> > > ftrace buffer which possibly means I don't have enough ftrace kernel 
> > > options enabled.
> > > 
> > > Here's the most recent boot crash.
> > > 
> > > [    5.367069] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null)
> > 
> > Stupid me. We get the hint from the wreckaged object ....
> > 
> > A hopefully better approach is the delta patch below.
> 
> with that applied on top:
> 
> [    5.342681] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118618000
> [    5.352786] ------------[ cut here ]------------
> [    5.357911] WARNING: CPU: 4 PID: 0 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
> [    5.367023] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null) delayed_work_timer_fn+0x0/0x20

Good, we have at least a point where we can gather useful
information. The timer belongs to a delayed work. The work itself is
tracked via debug objects as well. So we can get that as well. Delta
patch below.

Thanks,

	tglx
------

Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -14,6 +14,7 @@
 #include <linux/debugfs.h>
 #include <linux/slab.h>
 #include <linux/hash.h>
+#include <linux/workqueue.h>
 
 #define ODEBUG_HASH_BITS	14
 #define ODEBUG_HASH_SIZE	(1 << ODEBUG_HASH_BITS)
@@ -400,6 +401,20 @@ void debug_object_info(void *addr, struc
 		debug_print_object(obj, "Info");
 
 	raw_spin_unlock_irqrestore(&db->lock, flags);
+
+	addr = container_of(addr, struct delayed_work, timer);
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
 }
 
 /**
@@ -452,6 +467,7 @@ int debug_object_activate(void *addr, st
 		}
 		if (!obj->hint)
 			obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+
 		raw_spin_unlock_irqrestore(&db->lock, flags);
 		return ret;
 	}





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-19 15:05                     ` Thomas Gleixner
@ 2014-03-19 17:04                       ` Vince Weaver
  2014-03-20 10:47                         ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-19 17:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Wed, 19 Mar 2014, Thomas Gleixner wrote:

> Good, we have at least a point where we can gather useful
> information. The timer belongs to a delayed work. The work itself is
> tracked via debug objects as well. So we can get that as well. Delta
> patch below.

with that patch applied:

[    4.314095] Invalid timer base: tmr ffff880117738150 tmr->base           (null) base ffffffff81a3edc0
[    4.324216] ------------[ cut here ]------------
[    4.329326] WARNING: CPU: 0 PID: 0 at lib/debugobjects.c:261 debug_print_object+0x8c/0xb0()
[    4.338495] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null) delayed_work_timer_fn+0x0/0x20
[    4.350889] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci ehci_pci e1000e xhci_hcd libata ehci_hcd ptp crc32c_intel scsi_mod pps_core usbcore usb_common fan thermal thermal_sys
[    4.375911] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc7+ #7
[    4.382820] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    4.390947]  0000000000000009 ffff88011ea03d60 ffffffff8155b253 ffff88011ea03da8
[    4.399428]  ffff88011ea03d98 ffffffff810661ad ffff8801179696d8 ffffffff8183a7e0
[    4.407889]  ffffffff8173de38 00000000000291b0 ffffffff818774d0 ffff88011ea03df8
[    4.416347] Call Trace:
[    4.419102]  <IRQ>  [<ffffffff8155b253>] dump_stack+0x45/0x56
[    4.425545]  [<ffffffff810661ad>] warn_slowpath_common+0x7d/0xa0
[    4.432085]  [<ffffffff8106621c>] warn_slowpath_fmt+0x4c/0x50
[    4.438391]  [<ffffffff8131d1ec>] debug_print_object+0x8c/0xb0
[    4.444799]  [<ffffffff8107fc30>] ? __queue_work+0x320/0x320
[    4.451012]  [<ffffffff8131da2c>] debug_object_info+0x1cc/0x1e0
[    4.457488]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    4.462847]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    4.469338]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    4.476553]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    4.482483]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    4.487946]  [<ffffffff8156c1b5>] smp_apic_timer_interrupt+0x45/0x60
[    4.494920]  [<ffffffff8156ab1d>] apic_timer_interrupt+0x6d/0x80
[    4.501446]  <EOI>  [<ffffffff81433232>] ? cpuidle_enter_state+0x52/0xc0
[    4.508936]  [<ffffffff81433359>] cpuidle_idle_call+0xb9/0x1f0
[    4.515367]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    4.521218]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    4.527624]  [<ffffffff8154c8c7>] rest_init+0x77/0x80
[    4.533205]  [<ffffffff818d4f67>] start_kernel+0x42f/0x43a
[    4.539210]  [<ffffffff818d4941>] ? repair_env_string+0x5c/0x5c
[    4.545659]  [<ffffffff818d4120>] ? early_idt_handlers+0x120/0x120
[    4.552386]  [<ffffffff818d45ee>] x86_64_start_reservations+0x2a/0x2c
[    4.559454]  [<ffffffff818d4733>] x86_64_start_kernel+0x143/0x152
[    4.566119] ---[ end trace 18d31e7b09fba1ee ]---
[    4.571194] ------------[ cut here ]------------
[    4.576258] WARNING: CPU: 0 PID: 0 at lib/debugobjects.c:261 debug_print_object+0x8c/0xb0()
[    4.585399] ODEBUG: Info initialized (active state 0) object type: work_struct hint:           (null)           (null)
[    4.597013] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci ehci_pci e1000e xhci_hcd libata ehci_hcd ptp crc32c_intel scsi_mod pps_core usbcore usb_common fan thermal thermal_sys
[    4.622133] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W    3.14.0-rc7+ #7
[    4.630166] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    4.638228]  0000000000000009 ffff88011ea03d60 ffffffff8155b253 ffff88011ea03da8
[    4.646702]  ffff88011ea03d98 ffffffff810661ad ffff880117a4cee8 ffffffff8183b920
[    4.655159]  ffffffff8173de38 00000000000291b0 ffffffff818774d0 ffff88011ea03df8
[    4.663587] Call Trace:
[    4.666345]  <IRQ>  [<ffffffff8155b253>] dump_stack+0x45/0x56
[    4.672835]  [<ffffffff810661ad>] warn_slowpath_common+0x7d/0xa0
[    4.679422]  [<ffffffff8106621c>] warn_slowpath_fmt+0x4c/0x50
[    4.685657]  [<ffffffff8131d1ec>] debug_print_object+0x8c/0xb0
[    4.692070]  [<ffffffff8131da14>] debug_object_info+0x1b4/0x1e0
[    4.698515]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    4.703868]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    4.710359]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    4.717575]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    4.723523]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    4.728954]  [<ffffffff8156c1b5>] smp_apic_timer_interrupt+0x45/0x60
[    4.735887]  [<ffffffff8156ab1d>] apic_timer_interrupt+0x6d/0x80
[    4.742470]  <EOI>  [<ffffffff81433232>] ? cpuidle_enter_state+0x52/0xc0
[    4.750033]  [<ffffffff81433359>] cpuidle_idle_call+0xb9/0x1f0
[    4.756406]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    4.762233]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    4.768669]  [<ffffffff8154c8c7>] rest_init+0x77/0x80
[    4.774215]  [<ffffffff818d4f67>] start_kernel+0x42f/0x43a
[    4.780267]  [<ffffffff818d4941>] ? repair_env_string+0x5c/0x5c
[    4.786719]  [<ffffffff818d4120>] ? early_idt_handlers+0x120/0x120
[    4.793484]  [<ffffffff818d45ee>] x86_64_start_reservations+0x2a/0x2c
[    4.800540]  [<ffffffff818d4733>] x86_64_start_kernel+0x143/0x152
[    4.807186] ---[ end trace 18d31e7b09fba1ef ]---
[    4.812254] ------------[ cut here ]------------
[    4.817350] kernel BUG at kernel/timer.c:1088!
[    4.822227] invalid opcode: 0000 [#1] SMP 
[    4.826884] Dumping ftrace buffer:
[    4.830644]    (ftrace buffer empty)
[    4.834617] Modules linked in: sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common hid_generic usbhid hid ahci libahci ehci_pci e1000e xhci_hcd libata ehci_hcd ptp crc32c_intel scsi_mod pps_core usbcore usb_common fan thermal thermal_sys
[    4.859699] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W    3.14.0-rc7+ #7
[    4.867705] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[    4.875771] task: ffffffff81815460 ti: ffffffff81800000 task.ti: ffffffff81800000
[    4.883928] RIP: 0010:[<ffffffff81071904>]  [<ffffffff81071904>] cascade+0xc4/0xd0
[    4.892320] RSP: 0018:ffff88011ea03e78  EFLAGS: 00010002
[    4.898134] RAX: 0000000000000024 RBX: ffff880117738150 RCX: 0000000000000854
[    4.905885] RDX: 0000000000001462 RSI: 0000000000000002 RDI: 0000000000000002
[    4.913698] RBP: ffff88011ea03ea8 R08: 0000000000000000 R09: 00000000000002d7
[    4.921440] R10: ffffffff8165a580 R11: ffff88011ea03a9e R12: ffffffff81a3edc0
[    4.929190] R13: ffff88011ea03e78 R14: 000000000000001f R15: ffffffff818774d0
[    4.936983] FS:  0000000000000000(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000
[    4.945826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.952134] CR2: 00007fe0f44a39d8 CR3: 000000000180e000 CR4: 00000000001407f0
[    4.959929] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.967728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.975473] Stack:
[    4.977744]  ffff8801176f7060 ffff88011ea10740 ffffffff81a3edc0 0000000000000000
[    4.986202]  0000000000000001 0000000000000100 ffff88011ea03f18 ffffffff81072d7c
[    4.994649]  ffffffff81a409e8 ffffffff81a405e8 ffffffff81a401e8 ffffffff81a3fde8
[    5.003114] Call Trace:
[    5.005838]  <IRQ> 
[    5.007979]  [<ffffffff81072d7c>] run_timer_softirq+0x21c/0x2a0
[    5.014893]  [<ffffffff810cf5db>] ? clockevents_program_event+0x6b/0xf0
[    5.022102]  [<ffffffff8106b5a5>] __do_softirq+0xf5/0x290
[    5.028027]  [<ffffffff8106b98d>] irq_exit+0xad/0xc0
[    5.033464]  [<ffffffff8156c1b5>] smp_apic_timer_interrupt+0x45/0x60
[    5.040410]  [<ffffffff8156ab1d>] apic_timer_interrupt+0x6d/0x80
[    5.046956]  <EOI> 
[    5.049095]  [<ffffffff81433232>] ? cpuidle_enter_state+0x52/0xc0
[    5.056119]  [<ffffffff81433359>] cpuidle_idle_call+0xb9/0x1f0
[    5.062487]  [<ffffffff8101e48e>] arch_cpu_idle+0xe/0x30
[    5.070787]  [<ffffffff810bb57e>] cpu_startup_entry+0x9e/0x240
[    5.077142]  [<ffffffff8154c8c7>] rest_init+0x77/0x80
[    5.082670]  [<ffffffff818d4f67>] start_kernel+0x42f/0x43a
[    5.088685]  [<ffffffff818d4941>] ? repair_env_string+0x5c/0x5c
[    5.095155]  [<ffffffff818d4120>] ? early_idt_handlers+0x120/0x120
[    5.101910]  [<ffffffff818d45ee>] x86_64_start_reservations+0x2a/0x2c
[    5.108979]  [<ffffffff818d4733>] x86_64_start_kernel+0x143/0x152
[    5.115630] Code: 5d 41 5e 5d c3 48 89 f3 4c 89 e1 48 89 de 48 c7 c7 f8 05 71 81 31 c0 e8 be 6d 4e 00 48 c7 c6 e0 a7 83 81 48 89 df e8 5c bf 2a 00 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 
[    5.141945] RIP  [<ffffffff81071904>] cascade+0xc4/0xd0
[    5.147774]  RSP <ffff88011ea03e78>
[    5.151625] ---[ end trace 18d31e7b09fba1f0 ]---
[    5.156707] Kernel panic - not syncing: Fatal exception in interrupt
[    5.163725] Dumping ftrace buffer:
[    5.167535]    (ftrace buffer empty)
[    5.171488] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-19 17:04                       ` Vince Weaver
@ 2014-03-20 10:47                         ` Thomas Gleixner
  2014-03-20 14:47                           ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-20 10:47 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar



On Wed, 19 Mar 2014, Vince Weaver wrote:

> On Wed, 19 Mar 2014, Thomas Gleixner wrote:
> 
> > Good, we have at least a point where we can gather useful
> > information. The timer belongs to a delayed work. The work itself is
> > tracked via debug objects as well. So we can get that as well. Delta
> > patch below.
> 
> with that patch applied:
> 
> [    4.314095] Invalid timer base: tmr ffff880117738150 tmr->base           (null) base ffffffff81a3edc0
> [    4.324216] ------------[ cut here ]------------
> [    4.329326] WARNING: CPU: 0 PID: 0 at lib/debugobjects.c:261 debug_print_object+0x8c/0xb0()
> [    4.338495] ODEBUG: Info active (active state 0) object type: timer_list hint:           (null) delayed_work_timer_fn+0x0/0x20

Do you have DEBUG_OBJECTS_WORK and DEBUG_OBJECTS_FREE enabled?

If yes, then I'm completely puzzled why we find the timer in the
tracking objects, but not the delayed work itself.

Find a more brute force method to get this decoded below. Full patch
this time, please zap the old one. That should spill out the recent
debug object calls and we can just hope, that we get enough history to
find this horror.

Please add "ftrace_dump_on_oops" to the kernel command line. No other
trace related options required.

Btw, can you send me your .config file, so I might try to reproduce.

Thanks,

	tglx

---------

Index: linux-2.6/include/linux/debugobjects.h
===================================================================
--- linux-2.6.orig/include/linux/debugobjects.h
+++ linux-2.6/include/linux/debugobjects.h
@@ -30,6 +30,7 @@ struct debug_obj {
 	unsigned int		astate;
 	void			*object;
 	struct debug_obj_descr	*descr;
+	void			*hint;
 };
 
 /**
@@ -68,6 +69,7 @@ extern void debug_object_deactivate(void
 extern void debug_object_destroy   (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_free      (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_assert_init(void *addr, struct debug_obj_descr *descr);
+extern void debug_object_info(void *addr, struct debug_obj_descr *descr);
 
 /*
  * Active state:
@@ -95,6 +97,8 @@ static inline void
 debug_object_free      (void *addr, struct debug_obj_descr *descr) { }
 static inline void
 debug_object_assert_init(void *addr, struct debug_obj_descr *descr) { }
+static inline void
+debug_object_info(void *addr, struct debug_obj_descr *descr) { }
 
 static inline void debug_objects_early_init(void) { }
 static inline void debug_objects_mem_init(void) { }
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -1081,7 +1081,12 @@ static int cascade(struct tvec_base *bas
 	 * don't have to detach them individually.
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
+		if (tbase_get_base(timer->base) != base) {
+			pr_err("Invalid timer base: tmr %p tmr->base %p base %p\n",
+			       timer, timer->base, base);
+			debug_object_info(timer, &timer_debug_descr);
+			BUG();
+		}
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -14,6 +14,7 @@
 #include <linux/debugfs.h>
 #include <linux/slab.h>
 #include <linux/hash.h>
+#include <linux/workqueue.h>
 
 #define ODEBUG_HASH_BITS	14
 #define ODEBUG_HASH_SIZE	(1 << ODEBUG_HASH_BITS)
@@ -255,9 +256,9 @@ static void debug_print_object(struct de
 			descr->debug_hint(obj->object) : NULL;
 		limit++;
 		WARN(1, KERN_ERR "ODEBUG: %s %s (active state %u) "
-				 "object type: %s hint: %pS\n",
+				 "object type: %s hint: %pS %pS\n",
 			msg, obj_states[obj->state], obj->astate,
-			descr->name, hint);
+		     descr->name, hint, obj->hint);
 	}
 	debug_objects_warnings++;
 }
@@ -326,6 +327,11 @@ __debug_object_init(void *addr, struct d
 		debug_object_is_on_stack(addr, onstack);
 	}
 
+	if (!obj->hint)
+		obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+
+	trace_printk("Object init: %s %p %pS\n", descr->name, addr, obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -377,6 +383,42 @@ void debug_object_init_on_stack(void *ad
 	__debug_object_init(addr, descr, 1);
 }
 
+void debug_object_info(void *addr, struct debug_obj_descr *descr)
+{
+	struct debug_bucket *db;
+	struct debug_obj *obj;
+	unsigned long flags;
+
+	if (!debug_objects_enabled)
+		return;
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+
+	addr = container_of(addr, struct delayed_work, timer);
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+}
+
 /**
  * debug_object_activate - debug checks when an object is activated
  * @addr:	address of the object
@@ -403,6 +445,10 @@ int debug_object_activate(void *addr, st
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+
+		trace_printk("Object activate: %s %p %pS\n", descr->name, addr,
+			     obj->hint);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -425,6 +471,9 @@ int debug_object_activate(void *addr, st
 			ret = 0;
 			break;
 		}
+		if (!obj->hint)
+			obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+
 		raw_spin_unlock_irqrestore(&db->lock, flags);
 		return ret;
 	}
@@ -463,6 +512,10 @@ void debug_object_deactivate(void *addr,
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+
+		trace_printk("Object deactivate: %s %p %pS\n",
+			     descr->name, addr, obj->hint);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -513,6 +566,9 @@ void debug_object_destroy(void *addr, st
 	if (!obj)
 		goto out_unlock;
 
+	trace_printk("Object destroy: %s %p, %pS\n", descr->name, addr,
+		     obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -559,6 +615,8 @@ void debug_object_free(void *addr, struc
 	if (!obj)
 		goto out_unlock;
 
+	trace_printk("Object free: %s %p %pS\n", descr->name, addr, obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_ACTIVE:
 		debug_print_object(obj, "free");
@@ -711,6 +769,8 @@ repeat:
 		/* Now free them */
 		hlist_for_each_entry_safe(obj, tmp, &freelist, node) {
 			hlist_del(&obj->node);
+			trace_printk("Tracking Object free: %s %p %pS\n",
+				     descr->name, addr, obj->hint);
 			free_object(obj);
 		}
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-20 10:47                         ` Thomas Gleixner
@ 2014-03-20 14:47                           ` Vince Weaver
  2014-03-20 15:12                             ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-20 14:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 689 bytes --]

On Thu, 20 Mar 2014, Thomas Gleixner wrote:

> Do you have DEBUG_OBJECTS_WORK and DEBUG_OBJECTS_FREE enabled?

yes.

> Find a more brute force method to get this decoded below. Full patch
> this time, please zap the old one. That should spill out the recent
> debug object calls and we can just hope, that we get enough history to
> find this horror.

The patch didn't compile, as "addr" was not known.  I guessed and changed 
it to "oaddr", but if that's not what you meant then the trace results 
might be useless.

> Btw, can you send me your .config file, so I might try to reproduce.

attached is the log (compressed as it was long this time) as well as the 
config I am using.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 30570 bytes --]

[-- Attachment #3: Type: APPLICATION/octet-stream, Size: 33879 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-20 14:47                           ` Vince Weaver
@ 2014-03-20 15:12                             ` Thomas Gleixner
  2014-03-20 21:25                               ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-20 15:12 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Thu, 20 Mar 2014, Vince Weaver wrote:

> On Thu, 20 Mar 2014, Thomas Gleixner wrote:
> > Find a more brute force method to get this decoded below. Full patch
> > this time, please zap the old one. That should spill out the recent
> > debug object calls and we can just hope, that we get enough history to
> > find this horror.
> 
> The patch didn't compile, as "addr" was not known.  I guessed and
> changed it to "oaddr", but if that's not what you meant then the
> trace results might be useless.

Yeah, it causes the explosion in the debug object code. This should
have been:

+                       trace_printk("Tracking Object free: %s %p %pS\n",
+                                    descr->name, obj->object, obj->hint);

 
Sorry, -ENOTENOUGHCOFFEE to switch on DEBUG_OBJECTS_FREE=y before
compile testing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-20 15:12                             ` Thomas Gleixner
@ 2014-03-20 21:25                               ` Vince Weaver
  2014-03-21  9:02                                 ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-20 21:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 345 bytes --]

On Thu, 20 Mar 2014, Thomas Gleixner wrote:
> 
> Yeah, it causes the explosion in the debug object code. This should
> have been:
> 
> +                       trace_printk("Tracking Object free: %s %p %pS\n",
> +                                    descr->name, obj->object, obj->hint);

attached are the results with the corrected patch.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 30955 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-20 21:25                               ` Vince Weaver
@ 2014-03-21  9:02                                 ` Thomas Gleixner
  2014-03-21 20:11                                   ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-21  9:02 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Thu, 20 Mar 2014, Vince Weaver wrote:

> On Thu, 20 Mar 2014, Thomas Gleixner wrote:
> > 
> > Yeah, it causes the explosion in the debug object code. This should
> > have been:
> > 
> > +                       trace_printk("Tracking Object free: %s %p %pS\n",
> > +                                    descr->name, obj->object, obj->hint);
> 
> attached are the results with the corrected patch.

I'm a complete idiot. I was staring at oaddr and did not notice that
descr->name is the real culprit. Sorry. Delta patch below.

Thanks,

	tglx

Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -769,6 +769,7 @@ repeat:
 		/* Now free them */
 		hlist_for_each_entry_safe(obj, tmp, &freelist, node) {
 			hlist_del(&obj->node);
+			descr = obj->descr;
 			trace_printk("Tracking Object free: %s %p %pS\n",
 				     descr->name, obj->object, obj->hint);
 			free_object(obj);



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-21  9:02                                 ` Thomas Gleixner
@ 2014-03-21 20:11                                   ` Vince Weaver
  2014-03-22 10:24                                     ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-21 20:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 236 bytes --]

On Fri, 21 Mar 2014, Thomas Gleixner wrote:
> 
> I'm a complete idiot. I was staring at oaddr and did not notice that
> descr->name is the real culprit. Sorry. Delta patch below.

OK.  The log was much longer this time, attached.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 183208 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-21 20:11                                   ` Vince Weaver
@ 2014-03-22 10:24                                     ` Thomas Gleixner
  2014-03-22 20:22                                       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-22 10:24 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Fri, 21 Mar 2014, Vince Weaver wrote:

> On Fri, 21 Mar 2014, Thomas Gleixner wrote:
> > 
> > I'm a complete idiot. I was staring at oaddr and did not notice that
> > descr->name is the real culprit. Sorry. Delta patch below.
> 
> OK.  The log was much longer this time, attached.

Hmmm.

[    2.739858] NULL pointer dereference at           (null)
[    2.747390] IP: [<          (null)>]           (null)
[    2.752970] PGD 0
[    2.755287] Oops: 0010 [#1] SMP

So this time the CPU branched to NULL. So let me recap.

First you had the explosion in the hrtimer code. After enabling debug
stuff it went to the timer_list and now it looks different again.

So that looks more like a random memory corruption.

Nasty to debug. And of course it does not reproduce here. I'll throw
your config at more machines in the hope that something will trigger
it.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-22 10:24                                     ` Thomas Gleixner
@ 2014-03-22 20:22                                       ` Thomas Gleixner
  2014-03-23 15:14                                         ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-22 20:22 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> On Fri, 21 Mar 2014, Vince Weaver wrote:
> 
> > On Fri, 21 Mar 2014, Thomas Gleixner wrote:
> > > 
> > > I'm a complete idiot. I was staring at oaddr and did not notice that
> > > descr->name is the real culprit. Sorry. Delta patch below.
> > 
> > OK.  The log was much longer this time, attached.
> 
> Hmmm.
> 
> [    2.739858] NULL pointer dereference at           (null)
> [    2.747390] IP: [<          (null)>]           (null)
> [    2.752970] PGD 0
> [    2.755287] Oops: 0010 [#1] SMP
> 
> So this time the CPU branched to NULL. So let me recap.
> 
> First you had the explosion in the hrtimer code. After enabling debug
> stuff it went to the timer_list and now it looks different again.
> 
> So that looks more like a random memory corruption.
> 
> Nasty to debug. And of course it does not reproduce here. I'll throw
> your config at more machines in the hope that something will trigger
> it.

I've refined the trace_printk stuff in the hope to get a bit more info
out of it.

Thanks,

	tglx

------------>

Index: linux-2.6/include/linux/debugobjects.h
===================================================================
--- linux-2.6.orig/include/linux/debugobjects.h
+++ linux-2.6/include/linux/debugobjects.h
@@ -30,6 +30,7 @@ struct debug_obj {
 	unsigned int		astate;
 	void			*object;
 	struct debug_obj_descr	*descr;
+	void			*hint;
 };
 
 /**
@@ -68,6 +69,7 @@ extern void debug_object_deactivate(void
 extern void debug_object_destroy   (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_free      (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_assert_init(void *addr, struct debug_obj_descr *descr);
+extern void debug_object_info(void *addr, struct debug_obj_descr *descr);
 
 /*
  * Active state:
@@ -95,6 +97,8 @@ static inline void
 debug_object_free      (void *addr, struct debug_obj_descr *descr) { }
 static inline void
 debug_object_assert_init(void *addr, struct debug_obj_descr *descr) { }
+static inline void
+debug_object_info(void *addr, struct debug_obj_descr *descr) { }
 
 static inline void debug_objects_early_init(void) { }
 static inline void debug_objects_mem_init(void) { }
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -1081,7 +1081,12 @@ static int cascade(struct tvec_base *bas
 	 * don't have to detach them individually.
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
+		if (tbase_get_base(timer->base) != base) {
+			pr_err("Invalid timer base: tmr %p tmr->base %p base %p\n",
+			       timer, timer->base, base);
+			debug_object_info(timer, &timer_debug_descr);
+			BUG();
+		}
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -14,6 +14,7 @@
 #include <linux/debugfs.h>
 #include <linux/slab.h>
 #include <linux/hash.h>
+#include <linux/workqueue.h>
 
 #define ODEBUG_HASH_BITS	14
 #define ODEBUG_HASH_SIZE	(1 << ODEBUG_HASH_BITS)
@@ -140,6 +141,7 @@ alloc_object(void *addr, struct debug_bu
 		obj->descr  = descr;
 		obj->state  = ODEBUG_STATE_NONE;
 		obj->astate = 0;
+		obj->hint = NULL;
 		hlist_del(&obj->node);
 
 		hlist_add_head(&obj->node, &b->list);
@@ -255,9 +257,9 @@ static void debug_print_object(struct de
 			descr->debug_hint(obj->object) : NULL;
 		limit++;
 		WARN(1, KERN_ERR "ODEBUG: %s %s (active state %u) "
-				 "object type: %s hint: %pS\n",
+				 "object type: %s hint: %pS %pS\n",
 			msg, obj_states[obj->state], obj->astate,
-			descr->name, hint);
+		     descr->name, hint, obj->hint);
 	}
 	debug_objects_warnings++;
 }
@@ -326,6 +328,10 @@ __debug_object_init(void *addr, struct d
 		debug_object_is_on_stack(addr, onstack);
 	}
 
+	obj->hint = descr->debug_hint ? descr->debug_hint(addr) : NULL;
+	trace_printk("%s %p %p %pS\n", descr->name, addr, obj->hint,
+		     obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -377,6 +383,42 @@ void debug_object_init_on_stack(void *ad
 	__debug_object_init(addr, descr, 1);
 }
 
+void debug_object_info(void *addr, struct debug_obj_descr *descr)
+{
+	struct debug_bucket *db;
+	struct debug_obj *obj;
+	unsigned long flags;
+
+	if (!debug_objects_enabled)
+		return;
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+
+	addr = container_of(addr, struct delayed_work, timer);
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+}
+
 /**
  * debug_object_activate - debug checks when an object is activated
  * @addr:	address of the object
@@ -403,6 +445,11 @@ int debug_object_activate(void *addr, st
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+
+		obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+		trace_printk("%s %p %p %pS\n", descr->name, addr,
+			     obj->hint, obj->hint);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -425,6 +472,7 @@ int debug_object_activate(void *addr, st
 			ret = 0;
 			break;
 		}
+
 		raw_spin_unlock_irqrestore(&db->lock, flags);
 		return ret;
 	}
@@ -463,6 +511,10 @@ void debug_object_deactivate(void *addr,
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+		obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+		trace_printk("%s %p %p %pS\n", descr->name, addr,
+			     obj->hint, obj->hint);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -513,6 +565,10 @@ void debug_object_destroy(void *addr, st
 	if (!obj)
 		goto out_unlock;
 
+	obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+	trace_printk("%s %p %p %pS\n", descr->name, addr,
+		     obj->hint, obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -559,6 +615,10 @@ void debug_object_free(void *addr, struc
 	if (!obj)
 		goto out_unlock;
 
+	obj->hint = descr->debug_hint ?	descr->debug_hint(addr) : NULL;
+	trace_printk("%s %p %p %pS\n", descr->name, addr,
+		     obj->hint, obj->hint);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_ACTIVE:
 		debug_print_object(obj, "free");

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-22 20:22                                       ` Thomas Gleixner
@ 2014-03-23 15:14                                         ` Thomas Gleixner
  2014-03-23 23:25                                           ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-23 15:14 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Sat, 22 Mar 2014, Thomas Gleixner wrote:

> On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> > On Fri, 21 Mar 2014, Vince Weaver wrote:
> > 
> > > On Fri, 21 Mar 2014, Thomas Gleixner wrote:
> > > > 
> > > > I'm a complete idiot. I was staring at oaddr and did not notice that
> > > > descr->name is the real culprit. Sorry. Delta patch below.
> > > 
> > > OK.  The log was much longer this time, attached.
> > 
> > Hmmm.
> > 
> > [    2.739858] NULL pointer dereference at           (null)
> > [    2.747390] IP: [<          (null)>]           (null)
> > [    2.752970] PGD 0
> > [    2.755287] Oops: 0010 [#1] SMP
> > 
> > So this time the CPU branched to NULL. So let me recap.
> > 
> > First you had the explosion in the hrtimer code. After enabling debug
> > stuff it went to the timer_list and now it looks different again.
> > 
> > So that looks more like a random memory corruption.
> > 
> > Nasty to debug. And of course it does not reproduce here. I'll throw
> > your config at more machines in the hope that something will trigger
> > it.
> 
> I've refined the trace_printk stuff in the hope to get a bit more info
> out of it.

You might also try with the trace_printks removed. That wont give us
the history, but maybe then the bug happens again at some decodable
place.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-23 15:14                                         ` Thomas Gleixner
@ 2014-03-23 23:25                                           ` Thomas Gleixner
  2014-03-25 21:06                                             ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-23 23:25 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Sun, 23 Mar 2014, Thomas Gleixner wrote:
> On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> > On Sat, 22 Mar 2014, Thomas Gleixner wrote:
> > > On Fri, 21 Mar 2014, Vince Weaver wrote:
> > > 
> > > > On Fri, 21 Mar 2014, Thomas Gleixner wrote:
> > > > > 
> > > > > I'm a complete idiot. I was staring at oaddr and did not notice that
> > > > > descr->name is the real culprit. Sorry. Delta patch below.
> > > > 
> > > > OK.  The log was much longer this time, attached.
> > > 
> > > Hmmm.
> > > 
> > > [    2.739858] NULL pointer dereference at           (null)
> > > [    2.747390] IP: [<          (null)>]           (null)
> > > [    2.752970] PGD 0
> > > [    2.755287] Oops: 0010 [#1] SMP
> > > 
> > > So this time the CPU branched to NULL. So let me recap.
> > > 
> > > First you had the explosion in the hrtimer code. After enabling debug
> > > stuff it went to the timer_list and now it looks different again.
> > > 
> > > So that looks more like a random memory corruption.
> > > 
> > > Nasty to debug. And of course it does not reproduce here. I'll throw
> > > your config at more machines in the hope that something will trigger
> > > it.
> > 
> > I've refined the trace_printk stuff in the hope to get a bit more info
> > out of it.
> 
> You might also try with the trace_printks removed. That wont give us
> the history, but maybe then the bug happens again at some decodable
> place.
> 

I tried to reproduce on a few machines, but no luck.

But I tooks some time to tweak the debugobjects patch further in the
hope that it helps.

I wonder whether it might be helpful to instrument memset/memcpy with
debugobjects. Those are pretty much the same as a free from an active
object POV. But that's a non trivial thing to do.

I'm traveling tomorrow, so I wont be able to respond before Tuesday.

Thanks,

	tglx

--------------->
 include/linux/debugobjects.h |    4 ++
 kernel/timer.c               |    7 +++
 lib/debugobjects.c           |   83 +++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 90 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/debugobjects.h
===================================================================
--- linux-2.6.orig/include/linux/debugobjects.h
+++ linux-2.6/include/linux/debugobjects.h
@@ -30,6 +30,7 @@ struct debug_obj {
 	unsigned int		astate;
 	void			*object;
 	struct debug_obj_descr	*descr;
+	void			*hint;
 };
 
 /**
@@ -68,6 +69,7 @@ extern void debug_object_deactivate(void
 extern void debug_object_destroy   (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_free      (void *addr, struct debug_obj_descr *descr);
 extern void debug_object_assert_init(void *addr, struct debug_obj_descr *descr);
+extern void debug_object_info(void *addr, struct debug_obj_descr *descr);
 
 /*
  * Active state:
@@ -95,6 +97,8 @@ static inline void
 debug_object_free      (void *addr, struct debug_obj_descr *descr) { }
 static inline void
 debug_object_assert_init(void *addr, struct debug_obj_descr *descr) { }
+static inline void
+debug_object_info(void *addr, struct debug_obj_descr *descr) { }
 
 static inline void debug_objects_early_init(void) { }
 static inline void debug_objects_mem_init(void) { }
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -1081,7 +1081,12 @@ static int cascade(struct tvec_base *bas
 	 * don't have to detach them individually.
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
+		if (tbase_get_base(timer->base) != base) {
+			pr_err("Invalid timer base: tmr %p tmr->base %p base %p\n",
+			       timer, timer->base, base);
+			debug_object_info(timer, &timer_debug_descr);
+			BUG();
+		}
 		/* No accounting, while moving them */
 		__internal_add_timer(base, timer);
 	}
Index: linux-2.6/lib/debugobjects.c
===================================================================
--- linux-2.6.orig/lib/debugobjects.c
+++ linux-2.6/lib/debugobjects.c
@@ -70,13 +70,14 @@ static int __init disable_object_debug(c
 early_param("debug_objects", enable_object_debug);
 early_param("no_debug_objects", disable_object_debug);
 
-static const char *obj_states[ODEBUG_STATE_MAX] = {
+static const char *obj_states[ODEBUG_STATE_MAX + 1] = {
 	[ODEBUG_STATE_NONE]		= "none",
 	[ODEBUG_STATE_INIT]		= "initialized",
 	[ODEBUG_STATE_INACTIVE]		= "inactive",
 	[ODEBUG_STATE_ACTIVE]		= "active",
 	[ODEBUG_STATE_DESTROYED]	= "destroyed",
 	[ODEBUG_STATE_NOTAVAILABLE]	= "not available",
+	[ODEBUG_STATE_MAX]		= "free",
 };
 
 static void fill_pool(void)
@@ -140,6 +141,7 @@ alloc_object(void *addr, struct debug_bu
 		obj->descr  = descr;
 		obj->state  = ODEBUG_STATE_NONE;
 		obj->astate = 0;
+		obj->hint = NULL;
 		hlist_del(&obj->node);
 
 		hlist_add_head(&obj->node, &b->list);
@@ -245,6 +247,47 @@ static struct debug_bucket *get_bucket(u
 	return &obj_hash[hash];
 }
 
+static void debug_object_update_hint(struct debug_obj *obj, int newstate)
+{
+	struct debug_obj_descr *descr = obj->descr;
+	bool has_hint = descr->debug_hint != NULL;
+	void *hint;
+
+	static int limit;
+
+	if (!has_hint || limit > 10)
+		return;
+
+	hint = descr->debug_hint(obj->object);
+#if 1
+	trace_printk("obj: %p %s --> %s (active state %u) type: %s hint: %pS -> %pS\n",
+		     obj->object, obj_states[obj->state], obj_states[newstate], obj->astate,
+		     descr->name, obj->hint, hint);
+#endif
+	/* This does not care about improper state transitions! */
+	switch (newstate) {
+	case ODEBUG_STATE_INIT:
+		return;
+	case ODEBUG_STATE_ACTIVE:
+		if (!hint || !kernel_text_address((unsigned long) hint))
+			break;
+		obj->hint = hint;
+		return;
+	case ODEBUG_STATE_INACTIVE:
+		if (!hint || !kernel_text_address((unsigned long) hint))
+			break;
+		/* Someone changed the callback of an active object */
+		if (obj->hint != hint && obj->state == ODEBUG_STATE_ACTIVE)
+			break;
+	default:
+		return;
+	}
+	WARN(1, KERN_ERR "ODEBUG: %s -> %s (active state %u) object type: %s hint: %pS -> %pS\n",
+	     obj_states[obj->state], obj_states[newstate], obj->astate, descr->name,
+	     obj->hint, hint);
+	limit++;
+}
+
 static void debug_print_object(struct debug_obj *obj, char *msg)
 {
 	struct debug_obj_descr *descr = obj->descr;
@@ -255,9 +298,9 @@ static void debug_print_object(struct de
 			descr->debug_hint(obj->object) : NULL;
 		limit++;
 		WARN(1, KERN_ERR "ODEBUG: %s %s (active state %u) "
-				 "object type: %s hint: %pS\n",
+				 "object type: %s hint: %pS %pS\n",
 			msg, obj_states[obj->state], obj->astate,
-			descr->name, hint);
+		     descr->name, hint, obj->hint);
 	}
 	debug_objects_warnings++;
 }
@@ -326,6 +369,8 @@ __debug_object_init(void *addr, struct d
 		debug_object_is_on_stack(addr, onstack);
 	}
 
+	debug_object_update_hint(obj, ODEBUG_STATE_INIT);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -377,6 +422,28 @@ void debug_object_init_on_stack(void *ad
 	__debug_object_init(addr, descr, 1);
 }
 
+void debug_object_info(void *addr, struct debug_obj_descr *descr)
+{
+	struct debug_bucket *db;
+	struct debug_obj *obj;
+	unsigned long flags;
+
+	if (!debug_objects_enabled)
+		return;
+
+	db = get_bucket((unsigned long) addr);
+
+	raw_spin_lock_irqsave(&db->lock, flags);
+
+	obj = lookup_object(addr, db);
+	if (!obj)
+		pr_err("Object unknown %p\n", addr);
+	else
+		debug_print_object(obj, "Info");
+
+	raw_spin_unlock_irqrestore(&db->lock, flags);
+}
+
 /**
  * debug_object_activate - debug checks when an object is activated
  * @addr:	address of the object
@@ -403,6 +470,9 @@ int debug_object_activate(void *addr, st
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+
+		debug_object_update_hint(obj, ODEBUG_STATE_ACTIVE);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -425,6 +495,7 @@ int debug_object_activate(void *addr, st
 			ret = 0;
 			break;
 		}
+
 		raw_spin_unlock_irqrestore(&db->lock, flags);
 		return ret;
 	}
@@ -463,6 +534,8 @@ void debug_object_deactivate(void *addr,
 
 	obj = lookup_object(addr, db);
 	if (obj) {
+		debug_object_update_hint(obj, ODEBUG_STATE_INACTIVE);
+
 		switch (obj->state) {
 		case ODEBUG_STATE_INIT:
 		case ODEBUG_STATE_INACTIVE:
@@ -513,6 +586,8 @@ void debug_object_destroy(void *addr, st
 	if (!obj)
 		goto out_unlock;
 
+	debug_object_update_hint(obj, ODEBUG_STATE_DESTROYED);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_NONE:
 	case ODEBUG_STATE_INIT:
@@ -559,6 +634,8 @@ void debug_object_free(void *addr, struc
 	if (!obj)
 		goto out_unlock;
 
+	debug_object_update_hint(obj, ODEBUG_STATE_MAX);
+
 	switch (obj->state) {
 	case ODEBUG_STATE_ACTIVE:
 		debug_print_object(obj, "free");


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-23 23:25                                           ` Thomas Gleixner
@ 2014-03-25 21:06                                             ` Vince Weaver
  2014-03-25 21:52                                               ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-25 21:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 389 bytes --]

On Mon, 24 Mar 2014, Thomas Gleixner wrote:

> But I tooks some time to tweak the debugobjects patch further in the
> hope that it helps.

I've attached the log from your most recent patch, it's pretty long.

I'll try without the trace_printk too.

It's been busy here which is why I've been slow to try the patch, but 
hopefully I'll have some more time to look at things tomorrow.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 122949 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-25 21:06                                             ` Vince Weaver
@ 2014-03-25 21:52                                               ` Thomas Gleixner
  2014-03-26 21:33                                                 ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-25 21:52 UTC (permalink / raw)
  To: Vince Weaver; +Cc: linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Tue, 25 Mar 2014, Vince Weaver wrote:
> On Mon, 24 Mar 2014, Thomas Gleixner wrote:
> 
> > But I tooks some time to tweak the debugobjects patch further in the
> > hope that it helps.
> 
> I've attached the log from your most recent patch, it's pretty long.
> 
> I'll try without the trace_printk too.
> 
> It's been busy here which is why I've been slow to try the patch, but 
> hopefully I'll have some more time to look at things tomorrow.

Ok, that trace makes a lot more sense but it does us not yet give
enough data as the delayed work function is assigned after the
debugobjects init call. Patch for that below.

What's clear from the trace is that the timer gets set up correctly,
but when it expires the function callback is NULL. The timer is set up
to 1sec and it's started in the context of init-1. So it's probably an
initcall of some sort.

Do you happen to have the full bootlog up to the crash handy?

Adding "initcall_debug" to the kernel command line might give us a
further hint what is adding the delayed_work, but I guess once we see
the function in the trace we have a good chance to find the place
right away. Then we have to figure out what scribbles over it .....

Thanks,

	tglx


Index: linux-2.6/include/linux/workqueue.h
===================================================================
--- linux-2.6.orig/include/linux/workqueue.h
+++ linux-2.6/include/linux/workqueue.h
@@ -213,19 +213,19 @@ static inline unsigned int work_static(s
 	do {								\
 		static struct lock_class_key __key;			\
 									\
+		PREPARE_WORK((_work), (_func));				\
 		__init_work((_work), _onstack);				\
 		(_work)->data = (atomic_long_t) WORK_DATA_INIT();	\
 		lockdep_init_map(&(_work)->lockdep_map, #_work, &__key, 0); \
 		INIT_LIST_HEAD(&(_work)->entry);			\
-		PREPARE_WORK((_work), (_func));				\
 	} while (0)
 #else
 #define __INIT_WORK(_work, _func, _onstack)				\
 	do {								\
+		PREPARE_WORK((_work), (_func));				\
 		__init_work((_work), _onstack);				\
 		(_work)->data = (atomic_long_t) WORK_DATA_INIT();	\
 		INIT_LIST_HEAD(&(_work)->entry);			\
-		PREPARE_WORK((_work), (_func));				\
 	} while (0)
 #endif
 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-25 21:52                                               ` Thomas Gleixner
@ 2014-03-26 21:33                                                 ` Vince Weaver
  2014-03-26 22:00                                                   ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-26 21:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, linux-kernel, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 806 bytes --]

On Tue, 25 Mar 2014, Thomas Gleixner wrote:
> debugobjects init call. Patch for that below.
> 
> What's clear from the trace is that the timer gets set up correctly,
> but when it expires the function callback is NULL. The timer is set up
> to 1sec and it's started in the context of init-1. So it's probably an
> initcall of some sort.
> 
> Do you happen to have the full bootlog up to the crash handy?

attached is the full bootlog followed by the tracedump.

> Adding "initcall_debug" to the kernel command line might give us a
> further hint what is adding the delayed_work, but I guess once we see
> the function in the trace we have a good chance to find the place
> right away. Then we have to figure out what scribbles over it .....

I haven't tried the initcall_debug yet but can if needed.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 97724 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-26 21:33                                                 ` Vince Weaver
@ 2014-03-26 22:00                                                   ` Thomas Gleixner
  2014-03-27 13:41                                                     ` Vince Weaver
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-26 22:00 UTC (permalink / raw)
  To: Vince Weaver; +Cc: LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2108 bytes --]

On Wed, 26 Mar 2014, Vince Weaver wrote:
> On Tue, 25 Mar 2014, Thomas Gleixner wrote:
> > Do you happen to have the full bootlog up to the crash handy?
> 
> attached is the full bootlog followed by the tracedump.

Ok. So we know now what we are looking for.

[    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
[    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
[    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A

[    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000

1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90

So that happens in the context of the 8250 serial driver.
 
> > Adding "initcall_debug" to the kernel command line might give us a
> > further hint what is adding the delayed_work, but I guess once we see
> > the function in the trace we have a good chance to find the place
> > right away. Then we have to figure out what scribbles over it .....
> 
> I haven't tried the initcall_debug yet but can if needed.

Below is a patch which gives us the call path of the unnamed object
which causes the crash.

Greg???

Thanks,

	tglx

---
 lib/kobject.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/lib/kobject.c
===================================================================
--- linux-2.6.orig/lib/kobject.c
+++ linux-2.6/lib/kobject.c
@@ -645,6 +645,9 @@ static void kobject_release(struct kref
 	struct kobject *kobj = container_of(kref, struct kobject, kref);
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 	unsigned long delay = HZ + HZ * (get_random_int() & 0x3);
+
+	WARN_ON(!kobject_name(kobj));
+
 	pr_info("kobject: '%s' (%p): %s, parent %p (delayed %ld)\n",
 		 kobject_name(kobj), kobj, __func__, kobj->parent, delay);
 	INIT_DELAYED_WORK(&kobj->release, kobject_delayed_cleanup);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-26 22:00                                                   ` Thomas Gleixner
@ 2014-03-27 13:41                                                     ` Vince Weaver
  2014-03-31 11:18                                                       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-27 13:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH

[-- Attachment #1: Type: TEXT/PLAIN, Size: 990 bytes --]

On Wed, 26 Mar 2014, Thomas Gleixner wrote:

> Ok. So we know now what we are looking for.
> 
> [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> 
> [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> 
> 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> 
> So that happens in the context of the 8250 serial driver.
> 
> ...
> 
> Below is a patch which gives us the call path of the unnamed object
> which causes the crash.

I've attached the boot log with that patch applied.

Vince

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 55185 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-27 13:41                                                     ` Vince Weaver
@ 2014-03-31 11:18                                                       ` Thomas Gleixner
  2014-03-31 11:46                                                         ` Ingo Molnar
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-31 11:18 UTC (permalink / raw)
  To: Vince Weaver; +Cc: LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1700 bytes --]

On Thu, 27 Mar 2014, Vince Weaver wrote:
> On Wed, 26 Mar 2014, Thomas Gleixner wrote:
> > Ok. So we know now what we are looking for.
> > 
> > [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> > [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> > [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> > 
> > [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> > 
> > 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> > 
> > So that happens in the context of the 8250 serial driver.
> > 
> > ...
> > 
> > Below is a patch which gives us the call path of the unnamed object
> > which causes the crash.
> 
> I've attached the boot log with that patch applied.

Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
all the debug patches to see whether the issue goes away?

I had a deeper look down that code path and the issue is, that the
serial core is not compatible with the deferred kobject release.

The tty_io layer uses a kobject embedded in its internal tty device
representation and reuses that.

So it seems that for whatever reason the tty layer releases ttyS1 and
then initializes it again. So the deferred release will queue the
object for release while the tty layer happily reinitializes it.

See tty_register_device() and tty_unregister.

Greg?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-31 11:18                                                       ` Thomas Gleixner
@ 2014-03-31 11:46                                                         ` Ingo Molnar
  2014-03-31 13:30                                                         ` Vince Weaver
  2014-04-06  3:47                                                         ` Greg KH
  2 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2014-03-31 11:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, 27 Mar 2014, Vince Weaver wrote:
> > On Wed, 26 Mar 2014, Thomas Gleixner wrote:
> > > Ok. So we know now what we are looking for.
> > > 
> > > [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> > > [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> > > [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> > > 
> > > [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> > > 
> > > 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> > > 
> > > So that happens in the context of the 8250 serial driver.
> > > 
> > > ...
> > > 
> > > Below is a patch which gives us the call path of the unnamed object
> > > which causes the crash.
> > 
> > I've attached the boot log with that patch applied.
> 
> Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> all the debug patches to see whether the issue goes away?
> 
> I had a deeper look down that code path and the issue is, that the
> serial core is not compatible with the deferred kobject release.
> 
> The tty_io layer uses a kobject embedded in its internal tty device
> representation and reuses that.
> 
> So it seems that for whatever reason the tty layer releases ttyS1 and
> then initializes it again. So the deferred release will queue the
> object for release while the tty layer happily reinitializes it.
> 
> See tty_register_device() and tty_unregister.
> 
> Greg?

Hm, this reminds me of the following randconfig testing patch I've 
been carrying for some time:

================>
$ cat patches/qa-dont-boot-DEBUG_KOBJECT_RELEASE.patch
Subject: qa: dont boot DEBUG_KOBJECT_RELEASE
From: Ingo Molnar <mingo@kernel.org>
Date: Mon Oct 7 16:15:46 CEST 2013

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 lib/Kconfig.debug |    3 +++
 1 file changed, 3 insertions(+)

Index: linux2/lib/Kconfig.debug
===================================================================
--- linux2.orig/lib/Kconfig.debug
+++ linux2/lib/Kconfig.debug
@@ -1025,6 +1025,9 @@ config DEBUG_KOBJECT
 config DEBUG_KOBJECT_RELEASE
        bool "kobject release debugging"
        depends on DEBUG_OBJECTS_TIMERS
+       # Frequent, hard to debug crashes:
+       depends on BROKEN_BOOT_ALLOWED
+       select BROKEN_BOOT
        help
          kobjects are reference counted objects.  This means that their
          last reference count put is not predictable, and the kobject can

I never fully tracked it down and forgot about it.

Thanks,

		Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-31 11:18                                                       ` Thomas Gleixner
  2014-03-31 11:46                                                         ` Ingo Molnar
@ 2014-03-31 13:30                                                         ` Vince Weaver
  2014-03-31 13:48                                                           ` Thomas Gleixner
  2014-04-06  3:47                                                         ` Greg KH
  2 siblings, 1 reply; 37+ messages in thread
From: Vince Weaver @ 2014-03-31 13:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH

On Mon, 31 Mar 2014, Thomas Gleixner wrote:
> 
> Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> all the debug patches to see whether the issue goes away?

It makes the crash on boot go away, but it is still possible to trigger
a similar crash (hrtimer oopsing due to corrupt rbtree) using the 
perf_fuzzer.

This all started because when trying to debug the perf_fuzzer crash I 
was told to turn on the extra debugging, and then that caused the crash at 
boot.

The fuzzer related crash happens on both a core2 and a haswell machine
and it triggers pretty quickly when fuzzing, I'll have to go back and see 
how repeatable it is.

Vince

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-31 13:30                                                         ` Vince Weaver
@ 2014-03-31 13:48                                                           ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2014-03-31 13:48 UTC (permalink / raw)
  To: Vince Weaver; +Cc: LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar, Greg KH

On Mon, 31 Mar 2014, Vince Weaver wrote:

> On Mon, 31 Mar 2014, Thomas Gleixner wrote:
> > 
> > Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> > all the debug patches to see whether the issue goes away?
> 
> It makes the crash on boot go away, but it is still possible to trigger

Ok. So we can leave that issue to the tty folks.

> a similar crash (hrtimer oopsing due to corrupt rbtree) using the 
> perf_fuzzer.

I did not expect that it goes away.
 
> This all started because when trying to debug the perf_fuzzer crash I 
> was told to turn on the extra debugging, and then that caused the crash at 
> boot.

I know. Thanks for the patience in decoding this nastiness.

> The fuzzer related crash happens on both a core2 and a haswell machine
> and it triggers pretty quickly when fuzzing, I'll have to go back and see 
> how repeatable it is.

Can you please reenable CONFIG_DEBUG_OBJECTS and check whether they
catch something?

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-03-31 11:18                                                       ` Thomas Gleixner
  2014-03-31 11:46                                                         ` Ingo Molnar
  2014-03-31 13:30                                                         ` Vince Weaver
@ 2014-04-06  3:47                                                         ` Greg KH
  2014-04-16 23:00                                                           ` Thomas Gleixner
  2 siblings, 1 reply; 37+ messages in thread
From: Greg KH @ 2014-04-06  3:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Mon, Mar 31, 2014 at 01:18:34PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Mar 2014, Vince Weaver wrote:
> > On Wed, 26 Mar 2014, Thomas Gleixner wrote:
> > > Ok. So we know now what we are looking for.
> > > 
> > > [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> > > [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> > > [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> > > 
> > > [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> > > 
> > > 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> > > 
> > > So that happens in the context of the 8250 serial driver.
> > > 
> > > ...
> > > 
> > > Below is a patch which gives us the call path of the unnamed object
> > > which causes the crash.
> > 
> > I've attached the boot log with that patch applied.
> 
> Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> all the debug patches to see whether the issue goes away?
> 
> I had a deeper look down that code path and the issue is, that the
> serial core is not compatible with the deferred kobject release.
> 
> The tty_io layer uses a kobject embedded in its internal tty device
> representation and reuses that.

It does?  What kobject is that?  I've dug through the code and I can't
find it.  I see where we create a new device in
tty_register_device_attr() which is dynamic and should be torn down when
free_tty_struct() is called eventually.

> So it seems that for whatever reason the tty layer releases ttyS1 and
> then initializes it again. So the deferred release will queue the
> object for release while the tty layer happily reinitializes it.

That's not good, but I can't find that code path, any hints?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-04-06  3:47                                                         ` Greg KH
@ 2014-04-16 23:00                                                           ` Thomas Gleixner
  2014-04-17  2:38                                                             ` Greg KH
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-04-16 23:00 UTC (permalink / raw)
  To: Greg KH; +Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2560 bytes --]

On Sat, 5 Apr 2014, Greg KH wrote:
> On Mon, Mar 31, 2014 at 01:18:34PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Mar 2014, Vince Weaver wrote:
> > > On Wed, 26 Mar 2014, Thomas Gleixner wrote:
> > > > Ok. So we know now what we are looking for.
> > > > 
> > > > [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > > ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > > [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> > > > [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> > > > [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> > > > 
> > > > [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> > > > 
> > > > 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> > > > 
> > > > So that happens in the context of the 8250 serial driver.
> > > > 
> > > > ...
> > > > 
> > > > Below is a patch which gives us the call path of the unnamed object
> > > > which causes the crash.
> > > 
> > > I've attached the boot log with that patch applied.
> > 
> > Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> > all the debug patches to see whether the issue goes away?
> > 
> > I had a deeper look down that code path and the issue is, that the
> > serial core is not compatible with the deferred kobject release.
> > 
> > The tty_io layer uses a kobject embedded in its internal tty device
> > representation and reuses that.
> 
> It does?  What kobject is that?  I've dug through the code and I can't
> find it.  I see where we create a new device in
> tty_register_device_attr() which is dynamic and should be torn down when
> free_tty_struct() is called eventually.

It's not about the dynamic stuff.
 
> > So it seems that for whatever reason the tty layer releases ttyS1 and
> > then initializes it again. So the deferred release will queue the
> > object for release while the tty layer happily reinitializes it.
> 
> That's not good, but I can't find that code path, any hints?

static int tty_cdev_add(struct tty_driver *driver, dev_t dev,
		unsigned int index, unsigned int count)
{
	/* init here, since reused cdevs cause crashes */
	cdev_init(&driver->cdevs[index], &tty_fops);

The comment is interesting ...

And cdevs is an array of  struct cdev:

struct cdev {
	struct kobject kobj;

Hope that helps.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-04-16 23:00                                                           ` Thomas Gleixner
@ 2014-04-17  2:38                                                             ` Greg KH
  2014-04-17  7:59                                                               ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Greg KH @ 2014-04-17  2:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Thu, Apr 17, 2014 at 01:00:53AM +0200, Thomas Gleixner wrote:
> On Sat, 5 Apr 2014, Greg KH wrote:
> > On Mon, Mar 31, 2014 at 01:18:34PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Mar 2014, Vince Weaver wrote:
> > > > On Wed, 26 Mar 2014, Thomas Gleixner wrote:
> > > > > Ok. So we know now what we are looking for.
> > > > > 
> > > > > [    1.579996] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > > > ÿ[    1.607279] 00:09: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > > > [    1.615032] kobject: 'ttyS1' (ffff88011772ac10): kobject_release, parent           (null) (delayed 250)
> > > > > [    1.624534] kobject: '(null)' (ffff8801177400f0): kobject_release, parent           (null) (delayed 500)
> > > > > [    1.654213] 0000:00:16.3: ttyS1 at I/O 0xf0e0 (irq = 19, base_baud = 115200) is a 16550A
> > > > > 
> > > > > [    3.294047] Invalid timer base: tmr ffff880117740150 tmr->base           (null) base ffff880118898000
> > > > > 
> > > > > 1634110us : obj: ffff880117740130 initialized kobject_delayed_cleanup+0x0/0x90
> > > > > 
> > > > > So that happens in the context of the 8250 serial driver.
> > > > > 
> > > > > ...
> > > > > 
> > > > > Below is a patch which gives us the call path of the unnamed object
> > > > > which causes the crash.
> > > > 
> > > > I've attached the boot log with that patch applied.
> > > 
> > > Vince, can you please disable CONFIG_DEBUG_KOBJECT_RELEASE and remove
> > > all the debug patches to see whether the issue goes away?
> > > 
> > > I had a deeper look down that code path and the issue is, that the
> > > serial core is not compatible with the deferred kobject release.
> > > 
> > > The tty_io layer uses a kobject embedded in its internal tty device
> > > representation and reuses that.
> > 
> > It does?  What kobject is that?  I've dug through the code and I can't
> > find it.  I see where we create a new device in
> > tty_register_device_attr() which is dynamic and should be torn down when
> > free_tty_struct() is called eventually.
> 
> It's not about the dynamic stuff.
>  
> > > So it seems that for whatever reason the tty layer releases ttyS1 and
> > > then initializes it again. So the deferred release will queue the
> > > object for release while the tty layer happily reinitializes it.
> > 
> > That's not good, but I can't find that code path, any hints?
> 
> static int tty_cdev_add(struct tty_driver *driver, dev_t dev,
> 		unsigned int index, unsigned int count)
> {
> 	/* init here, since reused cdevs cause crashes */
> 	cdev_init(&driver->cdevs[index], &tty_fops);
> 
> The comment is interesting ...
> 
> And cdevs is an array of  struct cdev:
> 
> struct cdev {
> 	struct kobject kobj;

Those are not "real" kobjects, and are never registered with the kobject
core.

I really need to go rename those one of these days, and just make them a
separate object, as they have nothing to do with a "normal" kobject
other than the reference count and the use of the kobject map stuff.

So if this is showing up as a problem, something else is going on here,
as this should not be an issue at all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-04-17  2:38                                                             ` Greg KH
@ 2014-04-17  7:59                                                               ` Thomas Gleixner
  2014-04-24 19:37                                                                 ` Greg KH
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2014-04-17  7:59 UTC (permalink / raw)
  To: Greg KH; +Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Wed, 16 Apr 2014, Greg KH wrote:
> On Thu, Apr 17, 2014 at 01:00:53AM +0200, Thomas Gleixner wrote:
> > And cdevs is an array of  struct cdev:
> > 
> > struct cdev {
> > 	struct kobject kobj;
> 
> Those are not "real" kobjects, and are never registered with the kobject
> core.
> 
> I really need to go rename those one of these days, and just make them a
> separate object, as they have nothing to do with a "normal" kobject
> other than the reference count and the use of the kobject map stuff.
> 
> So if this is showing up as a problem, something else is going on here,
> as this should not be an issue at all.

As far as I decoded it from Vince traces the issue is only with
DEBUG_KOBJECT_RELEASE=y. That deferres the release to a workqueue.

Now the cdevs in tty are allocated ahead and not freed on
release. They stay in the cdevs[] array of the tty drivers.

Now if a cdev is released the kobject debug stuff schedules the
delayed work, but the tty core can reuse the cdev.before the delayed
work has been executed and first thing it does is calling cdev_init()
which does a

     memset(cdev, 0, sizeof *cdev);

which of course includes the enqueued timer and the scheduled work. Go
figure how well that goes.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: rb tree hrtimer lockup bug (found by perf_fuzzer)
  2014-04-17  7:59                                                               ` Thomas Gleixner
@ 2014-04-24 19:37                                                                 ` Greg KH
  0 siblings, 0 replies; 37+ messages in thread
From: Greg KH @ 2014-04-24 19:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vince Weaver, LKML, H. Peter Anvin, Peter Zijlstra, Ingo Molnar

On Thu, Apr 17, 2014 at 09:59:01AM +0200, Thomas Gleixner wrote:
> On Wed, 16 Apr 2014, Greg KH wrote:
> > On Thu, Apr 17, 2014 at 01:00:53AM +0200, Thomas Gleixner wrote:
> > > And cdevs is an array of  struct cdev:
> > > 
> > > struct cdev {
> > > 	struct kobject kobj;
> > 
> > Those are not "real" kobjects, and are never registered with the kobject
> > core.
> > 
> > I really need to go rename those one of these days, and just make them a
> > separate object, as they have nothing to do with a "normal" kobject
> > other than the reference count and the use of the kobject map stuff.
> > 
> > So if this is showing up as a problem, something else is going on here,
> > as this should not be an issue at all.
> 
> As far as I decoded it from Vince traces the issue is only with
> DEBUG_KOBJECT_RELEASE=y. That deferres the release to a workqueue.
> 
> Now the cdevs in tty are allocated ahead and not freed on
> release. They stay in the cdevs[] array of the tty drivers.
> 
> Now if a cdev is released the kobject debug stuff schedules the
> delayed work, but the tty core can reuse the cdev.before the delayed
> work has been executed and first thing it does is calling cdev_init()
> which does a
> 
>      memset(cdev, 0, sizeof *cdev);
> 
> which of course includes the enqueued timer and the scheduled work. Go
> figure how well that goes.

Ugh, the kobject isn't "real", but it seems like it has changed over
time to be more real than I imagined.

I'll work to rip the kobject code out of the cdev soon, which should
resolve this issue, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-04-24 19:35 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-03 20:34 perf_fuzzer: lockup/reboot bug Vince Weaver
2014-03-04 21:32 ` Vince Weaver
2014-03-18 16:56   ` rb tree hrtimer lockup bug (found by perf_fuzzer) Vince Weaver
2014-03-18 18:21     ` Thomas Gleixner
2014-03-18 19:25       ` Vince Weaver
2014-03-18 20:52         ` Thomas Gleixner
2014-03-18 21:10           ` Vince Weaver
2014-03-18 21:45             ` Thomas Gleixner
2014-03-19 13:46               ` Vince Weaver
2014-03-19 13:58                 ` Thomas Gleixner
2014-03-19 14:42                   ` Vince Weaver
2014-03-19 15:05                     ` Thomas Gleixner
2014-03-19 17:04                       ` Vince Weaver
2014-03-20 10:47                         ` Thomas Gleixner
2014-03-20 14:47                           ` Vince Weaver
2014-03-20 15:12                             ` Thomas Gleixner
2014-03-20 21:25                               ` Vince Weaver
2014-03-21  9:02                                 ` Thomas Gleixner
2014-03-21 20:11                                   ` Vince Weaver
2014-03-22 10:24                                     ` Thomas Gleixner
2014-03-22 20:22                                       ` Thomas Gleixner
2014-03-23 15:14                                         ` Thomas Gleixner
2014-03-23 23:25                                           ` Thomas Gleixner
2014-03-25 21:06                                             ` Vince Weaver
2014-03-25 21:52                                               ` Thomas Gleixner
2014-03-26 21:33                                                 ` Vince Weaver
2014-03-26 22:00                                                   ` Thomas Gleixner
2014-03-27 13:41                                                     ` Vince Weaver
2014-03-31 11:18                                                       ` Thomas Gleixner
2014-03-31 11:46                                                         ` Ingo Molnar
2014-03-31 13:30                                                         ` Vince Weaver
2014-03-31 13:48                                                           ` Thomas Gleixner
2014-04-06  3:47                                                         ` Greg KH
2014-04-16 23:00                                                           ` Thomas Gleixner
2014-04-17  2:38                                                             ` Greg KH
2014-04-17  7:59                                                               ` Thomas Gleixner
2014-04-24 19:37                                                                 ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.