linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf: another perf_fuzzer generated lockup
@ 2015-05-08  4:40 Vince Weaver
  2015-05-09  0:04 ` Stephane Eranian
  2015-05-27 12:52 ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Vince Weaver @ 2015-05-08  4:40 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Stephane Eranian, Arnaldo Carvalho de Melo,
	Jiri Olsa, Ingo Molnar, Paul Mackerras


This is a new one I think, I hit it on the haswell machine running 
4.1-rc2.

The backtrace is complex enough I'm not really sure what's going on here.

The fuzzer has been having weird issues where it's been getting 
overflow signals from invalid fds.  This seems to happen
when an overflow signal interrupts the fuzzer mid-fork?
And the fuzzer code doesn't handle this well and attempts to call exit()
and/or kill the child from the signal handler that interrupted the 
fork() and that doesn't always go well.  I'm not sure if this is related, 
just that some of those actions seem to appear in the backtrace.


[33864.529861] ------------[ cut here ]------------
[33864.534824] WARNING: CPU: 1 PID: 9852 at kernel/watchdog.c:302 watchdog_overflow_callback+0x92/0xc0()
[33864.544682] Watchdog detected hard LOCKUP on cpu 1
[33864.549635] Modules linked in:
[33864.552943]  fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 lrw gf128mul evdev i915 iTCO_wdt iTCO_vendor_support glue_helper snd_timer ppdev ablk_helper psmouse drm_kms_helper cryptd snd drm pcspkr serio_raw lpc_ich soundcore parport_pc xhci_pci battery video processor i2c_i801 mei_me mei wmi i2c_algo_bit tpm_tis mfd_core xhci_hcd tpm parport button sg sr_mod sd_mod cdrom ehci_pci ahci ehci_hcd libahci libata e1000e ptp usbcore crc32c_intel scsi_mod fan usb_common pps_core thermal thermal_sys
[33864.622413] CPU: 1 PID: 9852 Comm: perf_fuzzer Not tainted 4.1.0-rc2+ #142
[33864.629776] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[33864.637685]  ffffffff81a209b5 ffff88011ea45aa0 ffffffff816d51d3 0000000000000000
[33864.645709]  ffff88011ea45af0 ffff88011ea45ae0 ffffffff81072dfa ffff88011ea45ac0
[33864.653731]  ffff880119b8f800 0000000000000000 ffff88011ea45c40 ffff88011ea45ef8
[33864.661783] Call Trace:
[33864.664409]  <NMI>  [<ffffffff816d51d3>] dump_stack+0x45/0x57
[33864.670618]  [<ffffffff81072dfa>] warn_slowpath_common+0x8a/0xc0
[33864.677071]  [<ffffffff81072e76>] warn_slowpath_fmt+0x46/0x50
[33864.683202]  [<ffffffff8104db64>] ? native_apic_wait_icr_idle+0x24/0x30
[33864.690280]  [<ffffffff81124f22>] watchdog_overflow_callback+0x92/0xc0
[33864.697294]  [<ffffffff811673b1>] __perf_event_overflow+0x91/0x270
[33864.703916]  [<ffffffff811673f9>] ? __perf_event_overflow+0xd9/0x270
[33864.710696]  [<ffffffff8102b60a>] ? x86_perf_event_set_period+0xda/0x180
[33864.717842]  [<ffffffff81167f89>] perf_event_overflow+0x19/0x20
[33864.724195]  [<ffffffff810338f2>] intel_pmu_handle_irq+0x1e2/0x450
[33864.730840]  [<ffffffff8102a41b>] perf_event_nmi_handler+0x2b/0x50
[33864.737436]  [<ffffffff810185d0>] nmi_handle+0xa0/0x150
[33864.743025]  [<ffffffff81018535>] ? nmi_handle+0x5/0x150
[33864.748733]  [<ffffffff8101887a>] default_do_nmi+0x4a/0x140
[33864.754705]  [<ffffffff81018a08>] do_nmi+0x98/0xe0
[33864.759858]  [<ffffffff816df42f>] end_repeat_nmi+0x1e/0x2e
[33864.765746]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.772004]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.778253]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
[33864.784498]  <<EOE>>  <IRQ>  [<ffffffff810bdd09>] __lock_acquire.isra.31+0x3b9/0x1000
[33864.792950]  [<ffffffff810bdd09>] ? __lock_acquire.isra.31+0x3b9/0x1000
[33864.800045]  [<ffffffff810bf2e5>] lock_acquire+0xa5/0x130
[33864.805817]  [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
[33864.812468]  [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
[33864.819084]  [<ffffffff816dc721>] _raw_spin_lock+0x31/0x40
[33864.824979]  [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
[33864.831623]  [<ffffffff81080afe>] __lock_task_sighand+0x6e/0x110
[33864.838096]  [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
[33864.845314]  [<ffffffff81080bcc>] do_send_sig_info+0x2c/0x80
[33864.851949]  [<ffffffff81167807>] ? perf_swevent_event+0x67/0x90
[33864.858980]  [<ffffffff8120092f>] send_sigio_to_task+0x12f/0x1a0
[33864.866005]  [<ffffffff81200805>] ? send_sigio_to_task+0x5/0x1a0
[33864.873047]  [<ffffffff812010c6>] ? send_sigio+0x56/0x100
[33864.879411]  [<ffffffff8120111e>] send_sigio+0xae/0x100
[33864.885564]  [<ffffffff81201207>] kill_fasync+0x97/0xf0
[33864.891713]  [<ffffffff8120117f>] ? kill_fasync+0xf/0xf0
[33864.897983]  [<ffffffff811667d4>] perf_event_wakeup+0xd4/0xf0
[33864.904662]  [<ffffffff81166705>] ? perf_event_wakeup+0x5/0xf0
[33864.911490]  [<ffffffff811668d0>] ? perf_pending_event+0xe0/0x110
[33864.918580]  [<ffffffff811668d0>] perf_pending_event+0xe0/0x110
[33864.925494]  [<ffffffff81156eac>] irq_work_run_list+0x4c/0x80
[33864.932197]  [<ffffffff81156ef8>] irq_work_run+0x18/0x40
[33864.938469]  [<ffffffff8101953f>] smp_trace_irq_work_interrupt+0x3f/0xc0
[33864.946263]  [<ffffffff816dea3e>] trace_irq_work_interrupt+0x6e/0x80
[33864.953646]  <EOI>  [<ffffffff811a1b47>] ? copy_page_range+0x527/0x9a0
[33864.961287]  [<ffffffff811a1b22>] ? copy_page_range+0x502/0x9a0
[33864.968265]  [<ffffffff810710d2>] copy_process.part.23+0xc92/0x1b80
[33864.975589]  [<ffffffff810813be>] ? SYSC_kill+0x8e/0x230
[33864.981879]  [<ffffffff81072178>] do_fork+0xd8/0x420
[33864.987807]  [<ffffffff812007e3>] ? f_setown+0x83/0xa0
[33864.993953]  [<ffffffff81200d30>] ? SyS_fcntl+0x310/0x650
[33865.000348]  [<ffffffff81417a23>] ? lockdep_sys_exit_thunk+0x12/0x14
[33865.007781]  [<ffffffff81072546>] SyS_clone+0x16/0x20
[33865.013830]  [<ffffffff816dd1b2>] system_call_fastpath+0x16/0x7a
[33865.020843] ---[ end trace d3bd7d73656f3cba ]---
[33865.026418] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 496.487 msecs
[33865.035874] perf interrupt took too long (3879951 > 5000), lowering kernel.perf_event_max_sample_rate to 25000


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf: another perf_fuzzer generated lockup
  2015-05-08  4:40 perf: another perf_fuzzer generated lockup Vince Weaver
@ 2015-05-09  0:04 ` Stephane Eranian
  2015-05-11 21:19   ` Vince Weaver
  2015-05-27 12:52 ` Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Stephane Eranian @ 2015-05-09  0:04 UTC (permalink / raw)
  To: Vince Weaver
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Ingo Molnar, Paul Mackerras

Vince,

On Thu, May 7, 2015 at 9:40 PM, Vince Weaver <vincent.weaver@maine.edu> wrote:
>
>
> This is a new one I think, I hit it on the haswell machine running
> 4.1-rc2.
>
> The backtrace is complex enough I'm not really sure what's going on here.
>
> The fuzzer has been having weird issues where it's been getting
> overflow signals from invalid fds.  This seems to happen
> when an overflow signal interrupts the fuzzer mid-fork?
> And the fuzzer code doesn't handle this well and attempts to call exit()
> and/or kill the child from the signal handler that interrupted the
> fork() and that doesn't always go well.  I'm not sure if this is related,
> just that some of those actions seem to appear in the backtrace.
>
>
Is there a way to figure out how the fuzzer had programmed the PMU
to get there? (besides adding PMU state dump in the kernel crashdump)?


>
> [33864.529861] ------------[ cut here ]------------
> [33864.534824] WARNING: CPU: 1 PID: 9852 at kernel/watchdog.c:302 watchdog_overflow_callback+0x92/0xc0()
> [33864.544682] Watchdog detected hard LOCKUP on cpu 1
> [33864.549635] Modules linked in:
> [33864.552943]  fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 lrw gf128mul evdev i915 iTCO_wdt iTCO_vendor_support glue_helper snd_timer ppdev ablk_helper psmouse drm_kms_helper cryptd snd drm pcspkr serio_raw lpc_ich soundcore parport_pc xhci_pci battery video processor i2c_i801 mei_me mei wmi i2c_algo_bit tpm_tis mfd_core xhci_hcd tpm parport button sg sr_mod sd_mod cdrom ehci_pci ahci ehci_hcd libahci libata e1000e ptp usbcore crc32c_intel scsi_mod fan usb_common pps_core thermal thermal_sys
> [33864.622413] CPU: 1 PID: 9852 Comm: perf_fuzzer Not tainted 4.1.0-rc2+ #142
> [33864.629776] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> [33864.637685]  ffffffff81a209b5 ffff88011ea45aa0 ffffffff816d51d3 0000000000000000
> [33864.645709]  ffff88011ea45af0 ffff88011ea45ae0 ffffffff81072dfa ffff88011ea45ac0
> [33864.653731]  ffff880119b8f800 0000000000000000 ffff88011ea45c40 ffff88011ea45ef8
> [33864.661783] Call Trace:
> [33864.664409]  <NMI>  [<ffffffff816d51d3>] dump_stack+0x45/0x57
> [33864.670618]  [<ffffffff81072dfa>] warn_slowpath_common+0x8a/0xc0
> [33864.677071]  [<ffffffff81072e76>] warn_slowpath_fmt+0x46/0x50
> [33864.683202]  [<ffffffff8104db64>] ? native_apic_wait_icr_idle+0x24/0x30
> [33864.690280]  [<ffffffff81124f22>] watchdog_overflow_callback+0x92/0xc0
> [33864.697294]  [<ffffffff811673b1>] __perf_event_overflow+0x91/0x270
> [33864.703916]  [<ffffffff811673f9>] ? __perf_event_overflow+0xd9/0x270
> [33864.710696]  [<ffffffff8102b60a>] ? x86_perf_event_set_period+0xda/0x180
> [33864.717842]  [<ffffffff81167f89>] perf_event_overflow+0x19/0x20
> [33864.724195]  [<ffffffff810338f2>] intel_pmu_handle_irq+0x1e2/0x450
> [33864.730840]  [<ffffffff8102a41b>] perf_event_nmi_handler+0x2b/0x50
> [33864.737436]  [<ffffffff810185d0>] nmi_handle+0xa0/0x150
> [33864.743025]  [<ffffffff81018535>] ? nmi_handle+0x5/0x150
> [33864.748733]  [<ffffffff8101887a>] default_do_nmi+0x4a/0x140
> [33864.754705]  [<ffffffff81018a08>] do_nmi+0x98/0xe0
> [33864.759858]  [<ffffffff816df42f>] end_repeat_nmi+0x1e/0x2e
> [33864.765746]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.772004]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.778253]  [<ffffffff810bc1bb>] ? check_chain_key+0xdb/0x1e0
> [33864.784498]  <<EOE>>  <IRQ>  [<ffffffff810bdd09>] __lock_acquire.isra.31+0x3b9/0x1000
> [33864.792950]  [<ffffffff810bdd09>] ? __lock_acquire.isra.31+0x3b9/0x1000
> [33864.800045]  [<ffffffff810bf2e5>] lock_acquire+0xa5/0x130
> [33864.805817]  [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
> [33864.812468]  [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
> [33864.819084]  [<ffffffff816dc721>] _raw_spin_lock+0x31/0x40
> [33864.824979]  [<ffffffff81080afe>] ? __lock_task_sighand+0x6e/0x110
> [33864.831623]  [<ffffffff81080afe>] __lock_task_sighand+0x6e/0x110
> [33864.838096]  [<ffffffff81080aaa>] ? __lock_task_sighand+0x1a/0x110
> [33864.845314]  [<ffffffff81080bcc>] do_send_sig_info+0x2c/0x80
> [33864.851949]  [<ffffffff81167807>] ? perf_swevent_event+0x67/0x90
> [33864.858980]  [<ffffffff8120092f>] send_sigio_to_task+0x12f/0x1a0
> [33864.866005]  [<ffffffff81200805>] ? send_sigio_to_task+0x5/0x1a0
> [33864.873047]  [<ffffffff812010c6>] ? send_sigio+0x56/0x100
> [33864.879411]  [<ffffffff8120111e>] send_sigio+0xae/0x100
> [33864.885564]  [<ffffffff81201207>] kill_fasync+0x97/0xf0
> [33864.891713]  [<ffffffff8120117f>] ? kill_fasync+0xf/0xf0
> [33864.897983]  [<ffffffff811667d4>] perf_event_wakeup+0xd4/0xf0
> [33864.904662]  [<ffffffff81166705>] ? perf_event_wakeup+0x5/0xf0
> [33864.911490]  [<ffffffff811668d0>] ? perf_pending_event+0xe0/0x110
> [33864.918580]  [<ffffffff811668d0>] perf_pending_event+0xe0/0x110
> [33864.925494]  [<ffffffff81156eac>] irq_work_run_list+0x4c/0x80
> [33864.932197]  [<ffffffff81156ef8>] irq_work_run+0x18/0x40
> [33864.938469]  [<ffffffff8101953f>] smp_trace_irq_work_interrupt+0x3f/0xc0
> [33864.946263]  [<ffffffff816dea3e>] trace_irq_work_interrupt+0x6e/0x80
> [33864.953646]  <EOI>  [<ffffffff811a1b47>] ? copy_page_range+0x527/0x9a0
> [33864.961287]  [<ffffffff811a1b22>] ? copy_page_range+0x502/0x9a0
> [33864.968265]  [<ffffffff810710d2>] copy_process.part.23+0xc92/0x1b80
> [33864.975589]  [<ffffffff810813be>] ? SYSC_kill+0x8e/0x230
> [33864.981879]  [<ffffffff81072178>] do_fork+0xd8/0x420
> [33864.987807]  [<ffffffff812007e3>] ? f_setown+0x83/0xa0
> [33864.993953]  [<ffffffff81200d30>] ? SyS_fcntl+0x310/0x650
> [33865.000348]  [<ffffffff81417a23>] ? lockdep_sys_exit_thunk+0x12/0x14
> [33865.007781]  [<ffffffff81072546>] SyS_clone+0x16/0x20
> [33865.013830]  [<ffffffff816dd1b2>] system_call_fastpath+0x16/0x7a
> [33865.020843] ---[ end trace d3bd7d73656f3cba ]---
> [33865.026418] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 496.487 msecs
> [33865.035874] perf interrupt took too long (3879951 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf: another perf_fuzzer generated lockup
  2015-05-09  0:04 ` Stephane Eranian
@ 2015-05-11 21:19   ` Vince Weaver
  0 siblings, 0 replies; 5+ messages in thread
From: Vince Weaver @ 2015-05-11 21:19 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Vince Weaver, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Jiri Olsa, Ingo Molnar, Paul Mackerras

On Fri, 8 May 2015, Stephane Eranian wrote:

> Vince,
> 
> On Thu, May 7, 2015 at 9:40 PM, Vince Weaver <vincent.weaver@maine.edu> wrote:
> >
> >
> > This is a new one I think, I hit it on the haswell machine running
> > 4.1-rc2.
> >
> > The backtrace is complex enough I'm not really sure what's going on here.
> >
> > The fuzzer has been having weird issues where it's been getting
> > overflow signals from invalid fds.  This seems to happen
> > when an overflow signal interrupts the fuzzer mid-fork?
> > And the fuzzer code doesn't handle this well and attempts to call exit()
> > and/or kill the child from the signal handler that interrupted the
> > fork() and that doesn't always go well.  I'm not sure if this is related,
> > just that some of those actions seem to appear in the backtrace.
> >
> >
> Is there a way to figure out how the fuzzer had programmed the PMU
> to get there? (besides adding PMU state dump in the kernel crashdump)?

Not easily.  In theory the fuzzer can regenerate state from the random 
seed, but some of these bugs seem to be timing related or race conditions, 
so they don't always replicate.  

Also I can make the fuzzer dump the state, but often it has 100+ events 
active and no way of knowing which ones are currently scheduled onto the 
CPU.

Dumping the PMU state might help, but at the same time there's all the 
other events going on such as software and tracepoint events and they 
might all be contributing.

This particular bug almost replicates; the system definitely pauses for a 
bit even if not long enough to trigger the watchdog.

I've been meaning to work on it more but we just finished with finals so I'm 
stuck doing more or less nothing but grading this next week.  After that I 
should have some time to work on this issue plus a couple other warnings 
the fuzzer has been showing.

Vince

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf: another perf_fuzzer generated lockup
  2015-05-08  4:40 perf: another perf_fuzzer generated lockup Vince Weaver
  2015-05-09  0:04 ` Stephane Eranian
@ 2015-05-27 12:52 ` Peter Zijlstra
  2015-05-27 13:58   ` Vince Weaver
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-05-27 12:52 UTC (permalink / raw)
  To: Vince Weaver
  Cc: LKML, Stephane Eranian, Arnaldo Carvalho de Melo, Jiri Olsa,
	Ingo Molnar, Paul Mackerras

On Fri, May 08, 2015 at 12:40:48AM -0400, Vince Weaver wrote:
> 
> This is a new one I think, I hit it on the haswell machine running 
> 4.1-rc2.
> 

Vince, should I still be running fast_repro99.sh ? Or is there another
preferred way to start the beast?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: perf: another perf_fuzzer generated lockup
  2015-05-27 12:52 ` Peter Zijlstra
@ 2015-05-27 13:58   ` Vince Weaver
  0 siblings, 0 replies; 5+ messages in thread
From: Vince Weaver @ 2015-05-27 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, LKML, Stephane Eranian, Arnaldo Carvalho de Melo,
	Jiri Olsa, Ingo Molnar, Paul Mackerras

On Wed, 27 May 2015, Peter Zijlstra wrote:

> On Fri, May 08, 2015 at 12:40:48AM -0400, Vince Weaver wrote:
> > 
> > This is a new one I think, I hit it on the haswell machine running 
> > 4.1-rc2.
> > 
> 
> Vince, should I still be running fast_repro99.sh ? Or is there another
> preferred way to start the beast?

yes, that's still probably the best way to start it.

I've recently been overhauling the fuzzer including a massive re-work of 
the mmap() code (to try to exercise the AUX codepath) so if it starts 
segfaulting on you that's probably why.

Vince

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-27 13:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08  4:40 perf: another perf_fuzzer generated lockup Vince Weaver
2015-05-09  0:04 ` Stephane Eranian
2015-05-11 21:19   ` Vince Weaver
2015-05-27 12:52 ` Peter Zijlstra
2015-05-27 13:58   ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).