All of lore.kernel.org
 help / color / mirror / Atom feed
* perf: fuzzer gets CPU stuck in perf_callchain()
@ 2015-02-19 16:54 Vince Weaver
  2015-02-19 16:57 ` Peter Zijlstra
  2015-02-19 17:03 ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Vince Weaver @ 2015-02-19 16:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt


Another bug found by the perf_fuzzer().  I think this one is different
than the one I sent the other day, it looks like something is going
very wrong in perf_callchain().

This one is reasonably reproducible, if there's any extra debugging that I 
can add.  This is on a Haswell machine with git as of a few days ago.

There are two traces below; they were generated by the same random seed
across different reboots and the trigger is slightly different (although 
both look like they hit the same bug in the callchain code).

[ 7938.495396] ------------[ cut here ]------------
[ 7938.500373] WARNING: CPU: 2 PID: 5692 at kernel/watchdog.c:290 watchdog_overflow_callback+0x92/0xc0()
[ 7938.510337] Watchdog detected hard LOCKUP on cpu 2
[ 7938.515301] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_controller snd_hda_codec i915 aes_x86_64 lrw gf128mul ppdev iTCO_wdt xhci_pci snd_hwdep snd_pcm tpm_tis evdev glue_helper iTCO_vendor_support xhci_hcd snd_timer mei_me drm_kms_helper ablk_helper snd tpm mei drm psmouse serio_raw pcspkr parport_pc cryptd soundcore lpc_ich parport wmi mfd_core battery i2c_i801 i2c_algo_bit button processor video sg sr_mod sd_mod cdrom ahci libahci ehci_pci libata ehci_hcd e1000e ptp crc32c_intel scsi_mod usbcore usb_common pps_core fan thermal thermal_sys
[ 7938.586970] CPU: 2 PID: 5692 Comm: perf_fuzzer Not tainted 3.19.0+ #127
[ 7938.594053] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 7938.601990]  ffffffff81a41afe ffff88011ea85aa0 ffffffff816bfed5 0000000000000000
[ 7938.610056]  ffff88011ea85af0 ffff88011ea85ae0 ffffffff8106ddfa 0000000000000000
[ 7938.618102]  ffff8801195a1800 0000000000000000 ffff88011ea85c40 ffff88011ea85ef8
[ 7938.626105] Call Trace:
[ 7938.628735]  <NMI>  [<ffffffff816bfed5>] dump_stack+0x45/0x57
[ 7938.634940]  [<ffffffff8106ddfa>] warn_slowpath_common+0x8a/0xc0
[ 7938.641394]  [<ffffffff8106de76>] warn_slowpath_fmt+0x46/0x50
[ 7938.647566]  [<ffffffff8111cde2>] watchdog_overflow_callback+0x92/0xc0
[ 7938.654617]  [<ffffffff8115dbe1>] __perf_event_overflow+0x91/0x270
[ 7938.661233]  [<ffffffff8102b56a>] ? x86_perf_event_set_period+0xca/0x170
[ 7938.668419]  [<ffffffff8115e6c9>] perf_event_overflow+0x19/0x20
[ 7938.674764]  [<ffffffff81032a7a>] intel_pmu_handle_irq+0x1ba/0x3a0
[ 7938.681432]  [<ffffffff8102a45b>] perf_event_nmi_handler+0x2b/0x50
[ 7938.688055]  [<ffffffff81018760>] nmi_handle+0xa0/0x150
[ 7938.693681]  [<ffffffff810186c5>] ? nmi_handle+0x5/0x150
[ 7938.699405]  [<ffffffff81018a0a>] default_do_nmi+0x4a/0x140
[ 7938.705429]  [<ffffffff81018b98>] do_nmi+0x98/0xe0
[ 7938.710573]  [<ffffffff816ca3d1>] end_repeat_nmi+0x1e/0x2e
[ 7938.716470]  [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.722929]  [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.729386]  [<ffffffff8140b300>] ? copy_from_user_nmi+0x20/0x70
[ 7938.735862]  <<EOE>>  <IRQ>  [<ffffffff8102c210>] ? perf_callchain_user+0xd0/0x240
[ 7938.744067]  [<ffffffff81161508>] perf_callchain+0x1f8/0x220
[ 7938.750152]  [<ffffffff8115da21>] perf_prepare_sample+0x291/0x3c0
[ 7938.756700]  [<ffffffff8115dc5e>] __perf_event_overflow+0x10e/0x270
[ 7938.763412]  [<ffffffff8115dc29>] ? __perf_event_overflow+0xd9/0x270
[ 7938.770254]  [<ffffffff8115e114>] ? perf_tp_event+0xc4/0x210
[ 7938.776302]  [<ffffffff8115da7c>] ? perf_prepare_sample+0x2ec/0x3c0
[ 7938.783021]  [<ffffffff816ca018>] ? trace_page_fault+0x28/0x30
[ 7938.789319]  [<ffffffff8115df89>] perf_swevent_overflow+0xa9/0xe0
[ 7938.795845]  [<ffffffff8115e027>] perf_swevent_event+0x67/0x90
[ 7938.802139]  [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
[ 7938.808025]  [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[ 7938.814132]  [<ffffffff810b643f>] ? __lock_acquire.isra.31+0x3af/0xfe0
[ 7938.821124]  [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7938.827926]  [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[ 7938.834291]  [<ffffffff81079c05>] ? __sigqueue_alloc+0x5/0x150
[ 7938.840541]  [<ffffffff81079d34>] ? __sigqueue_alloc+0x134/0x150
[ 7938.847009]  [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7938.853835]  [<ffffffff8107ba4e>] ? __lock_task_sighand+0x6e/0x110
[ 7938.861174]  [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
[ 7938.867972]  [<ffffffff8107ba58>] ? __lock_task_sighand+0x78/0x110
[ 7938.875256]  [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[ 7938.882197]  [<ffffffff810b7f10>] lock_release+0x130/0x260
[ 7938.888754]  [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
[ 7938.896510]  [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
[ 7938.903240]  [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
[ 7938.910365]  [<ffffffff811f68b5>] ? send_sigio_to_task+0x5/0x1a0
[ 7938.917455]  [<ffffffff811f7176>] ? send_sigio+0x56/0x100
[ 7938.923894]  [<ffffffff811f71ce>] send_sigio+0xae/0x100
[ 7938.930132]  [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
[ 7938.936359]  [<ffffffff811f722f>] ? kill_fasync+0xf/0xf0
[ 7938.942707]  [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
[ 7938.949522]  [<ffffffff8115cfe5>] ? perf_event_wakeup+0x5/0xf0
[ 7938.956367]  [<ffffffff8115d103>] perf_pending_event+0x33/0x60
[ 7938.963199]  [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
[ 7938.969932]  [<ffffffff8114e448>] irq_work_run+0x18/0x40
[ 7938.976212]  [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
[ 7938.983957]  [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
[ 7938.991362]  <EOI>  [<ffffffff816c743b>] ? _raw_spin_unlock+0x2b/0x40
[ 7938.998883]  [<ffffffff810f0993>] ? generic_exec_single+0x133/0x190
[ 7939.006127]  [<ffffffff81156100>] ? task_clock_event_add+0x40/0x40
[ 7939.013297]  [<ffffffff810f0a57>] smp_call_function_single+0x67/0xa0
[ 7939.020568]  [<ffffffff81154d69>] task_function_call+0x49/0x60
[ 7939.027277]  [<ffffffff8115b400>] ? perf_event_sched_in+0xa0/0xa0
[ 7939.034270]  [<ffffffff81154e10>] _perf_event_enable+0x90/0xf0
[ 7939.040963]  [<ffffffff81154d80>] ? task_function_call+0x60/0x60
[ 7939.047863]  [<ffffffff81154baf>] perf_event_for_each_child+0x3f/0xb0
[ 7939.055259]  [<ffffffff8115c4ac>] perf_event_task_enable+0x7c/0xd0
[ 7939.062321]  [<ffffffff81082d4d>] SyS_prctl+0x26d/0x4d0
[ 7939.068359]  [<ffffffff816c7ead>] system_call_fastpath+0x16/0x1b
[ 7939.075261] ---[ end trace 8e72f55d0f5cbae4 ]---
[ 7939.080732] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 585.271 msecs
[ 7939.090016] perf interrupt took too long (4572449 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[ 7940.065563] INFO: rcu_sched detected stalls on CPUs/tasks: { 2} (detected by 3, t=5252 jiffies, g=10488, c=10487, q=196)
[ 7940.077820] Task dump for CPU 2:
[ 7940.081779] perf_fuzzer     R  running task        0  5692   2549 0x0000000c
[ 7940.089901]  00007ffff86e99b8 000000000000002b 0000000000000003 ffff8800a4103cd8
[ 7940.098442]  ffffffff810b643f ffff88011897a4d0 0000000000000003 ffff88011897a4d0
[ 7940.106955]  ffff8800ced65020 ffff8800a4103c98 ffffffff810b52b8 ffff8800a4103cb8
[ 7940.115473] Call Trace:
[ 7940.118593]  [<ffffffff810b643f>] ? __lock_acquire.isra.31+0x3af/0xfe0
[ 7940.126120]  [<ffffffff810b52b8>] ? __lock_is_held+0x58/0x80
[ 7940.132660]  [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[ 7940.139989]  [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[ 7940.146848]  [<ffffffff8115b485>] ? __perf_event_enable+0x85/0x2a0
[ 7940.153971]  [<ffffffff816c743b>] ? _raw_spin_unlock+0x2b/0x40
[ 7940.160738]  [<ffffffff810f0993>] ? generic_exec_single+0x133/0x190
[ 7940.167964]  [<ffffffff81156100>] ? task_clock_event_add+0x40/0x40
[ 7940.175029]  [<ffffffff810f0a57>] ? smp_call_function_single+0x67/0xa0
[ 7940.182518]  [<ffffffff81154d69>] ? task_function_call+0x49/0x60
[ 7940.189447]  [<ffffffff8115b400>] ? perf_event_sched_in+0xa0/0xa0
[ 7940.196485]  [<ffffffff81154e10>] ? _perf_event_enable+0x90/0xf0
[ 7940.203430]  [<ffffffff81154d80>] ? task_function_call+0x60/0x60
[ 7940.210386]  [<ffffffff81154baf>] ? perf_event_for_each_child+0x3f/0xb0
[ 7940.217945]  [<ffffffff8115c4ac>] ? perf_event_task_enable+0x7c/0xd0
[ 7940.225255]  [<ffffffff81082d4d>] ? SyS_prctl+0x26d/0x4d0
[ 7940.231557]  [<ffffffff816c7ead>] ? system_call_fastpath+0x16/0x1b
[ 7947.852948] perf interrupt took too long (4536733 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[ 7956.620503] perf interrupt took too long (4501296 > 20000), lowering kernel.perf_event_max_sample_rate to 6250
[ 7965.387462] perf interrupt took too long (4466134 > 38461), lowering kernel.perf_event_max_sample_rate to 3250


[145965.719358] Call Trace:
[145965.722845]  <NMI>  [<ffffffff816bfed5>] dump_stack+0x45/0x57
[145965.729907]  [<ffffffff8106ddfa>] warn_slowpath_common+0x8a/0xc0
[145965.737170]  [<ffffffff8106de76>] warn_slowpath_fmt+0x46/0x50
[145965.744103]  [<ffffffff8101db0a>] ? native_sched_clock+0x2a/0x90
[145965.751356]  [<ffffffff8111cde2>] watchdog_overflow_callback+0x92/0xc0
[145965.759125]  [<ffffffff8115dbe1>] __perf_event_overflow+0x91/0x270
[145965.766466]  [<ffffffff8102b56a>] ? x86_perf_event_set_period+0xca/0x170
[145965.774393]  [<ffffffff8115e6c9>] perf_event_overflow+0x19/0x20
[145965.781478]  [<ffffffff81032a7a>] intel_pmu_handle_irq+0x1ba/0x3a0
[145965.788793]  [<ffffffff816ca020>] ? trace_page_fault+0x30/0x30
[145965.795777]  [<ffffffff8102a45b>] perf_event_nmi_handler+0x2b/0x50
[145965.803049]  [<ffffffff81018760>] nmi_handle+0xa0/0x150
[145965.809335]  [<ffffffff810186c5>] ? nmi_handle+0x5/0x150
[145965.815730]  [<ffffffff81018a95>] default_do_nmi+0xd5/0x140
[145965.822428]  [<ffffffff81018b98>] do_nmi+0x98/0xe0
[145965.828254]  [<ffffffff816ca3d1>] end_repeat_nmi+0x1e/0x2e
[145965.834803]  [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.841912]  [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.849073]  [<ffffffff816c9ff0>] ? general_protection+0x30/0x30
[145965.856187]  <<EOE>>  <IRQ>  [<ffffffff8140b46b>] ? copy_user_handle_tail+0x3b/0x80
[145965.865154]  [<ffffffff8140b343>] copy_from_user_nmi+0x63/0x70
[145965.872114]  [<ffffffff8102c210>] perf_callchain_user+0xd0/0x240
[145965.879227]  [<ffffffff81161508>] perf_callchain+0x1f8/0x220
[145965.885923]  [<ffffffff8115da21>] perf_prepare_sample+0x291/0x3c0
[145965.893151]  [<ffffffff8115dc5e>] __perf_event_overflow+0x10e/0x270
[145965.900562]  [<ffffffff8115dc29>] ? __perf_event_overflow+0xd9/0x270
[145965.908078]  [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[145965.914842]  [<ffffffff8115df89>] perf_swevent_overflow+0xa9/0xe0
[145965.922072]  [<ffffffff8115e027>] perf_swevent_event+0x67/0x90
[145965.929025]  [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
[145965.935615]  [<ffffffff810b7f10>] ? lock_release+0x130/0x260
[145965.942395]  [<ffffffff8107ba4e>] ? __lock_task_sighand+0x6e/0x110
[145965.949648]  [<ffffffff810b4f8a>] ? perf_trace_lock+0x12a/0x160
[145965.956671]  [<ffffffff8107ba58>] ? __lock_task_sighand+0x78/0x110
[145965.963919]  [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[145965.970858]  [<ffffffff8107bb4d>] ? do_send_sig_info+0x5d/0x80
[145965.977849]  [<ffffffff811406d8>] ? perf_trace_buf_prepare+0x98/0xf0
[145965.985330]  [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
[145965.992195]  [<ffffffff811f7176>] ? send_sigio+0x56/0x100
[145965.998659]  [<ffffffff811f71f0>] ? send_sigio+0xd0/0x100
[145966.005169]  [<ffffffff810b7f10>] lock_release+0x130/0x260
[145966.011716]  [<ffffffff816c76e3>] _raw_read_unlock+0x23/0x40
[145966.018442]  [<ffffffff811f71f0>] send_sigio+0xd0/0x100
[145966.024700]  [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
[145966.030968]  [<ffffffff811f722f>] ? kill_fasync+0xf/0xf0
[145966.037365]  [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
[145966.044210]  [<ffffffff8115cfe5>] ? perf_event_wakeup+0x5/0xf0
[145966.051060]  [<ffffffff8115d103>] perf_pending_event+0x33/0x60
[145966.057950]  [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
[145966.064798]  [<ffffffff8114e448>] irq_work_run+0x18/0x40
[145966.071127]  [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
[145966.078929]  [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
[145966.086359]  <EOI>  [<ffffffff816c74c2>] ? _raw_spin_unlock_irq+0x32/0x40
[145966.094337]  [<ffffffff816c74bb>] ? _raw_spin_unlock_irq+0x2b/0x40
[145966.101634]  [<ffffffff81092f7c>] finish_task_switch+0x8c/0x140
[145966.108630]  [<ffffffff81092f3d>] ? finish_task_switch+0x4d/0x140
[145966.115793]  [<ffffffff816c1c06>] __schedule+0x2d6/0x9f0
[145966.122115]  [<ffffffff810b4e89>] ? perf_trace_lock+0x29/0x160
[145966.128909]  [<ffffffff816c2357>] schedule+0x37/0x90
[145966.134802]  [<ffffffff816c67ac>] schedule_timeout+0x1bc/0x250
[145966.141578]  [<ffffffff816c37f8>] ? wait_for_completion+0x28/0x100
[145966.148660]  [<ffffffff816c3894>] ? wait_for_completion+0xc4/0x100
[145966.155795]  [<ffffffff816c389c>] wait_for_completion+0xcc/0x100
[145966.162731]  [<ffffffff8109ae80>] ? wake_up_state+0x20/0x20
[145966.169205]  [<ffffffff810d4830>] ? call_rcu_bh+0x20/0x20
[145966.175464]  [<ffffffff810d12fd>] wait_rcu_gp+0x5d/0x80
[145966.181586]  [<ffffffff810d1240>] ? ftrace_raw_output_rcu_utilization+0x70/0x70
[145966.189890]  [<ffffffff810d352a>] synchronize_sched+0x3a/0x50
[145966.196573]  [<ffffffff8114083b>] perf_trace_event_unreg.isra.1+0x3b/0x90
[145966.204345]  [<ffffffff81140bac>] perf_trace_destroy+0x3c/0x50
[145966.211075]  [<ffffffff81155aee>] tp_perf_event_destroy+0xe/0x10
[145966.217976]  [<ffffffff8115721c>] __free_event+0x2c/0x80
[145966.224102]  [<ffffffff8115f5f9>] SYSC_perf_event_open+0x629/0xc80
[145966.231246]  [<ffffffff8116009e>] SyS_perf_event_open+0xe/0x10
[145966.237982]  [<ffffffff816c7ead>] system_call_fastpath+0x16/0x1b
[145966.244932] ---[ end trace 7088940f25646d60 ]---

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 16:54 perf: fuzzer gets CPU stuck in perf_callchain() Vince Weaver
@ 2015-02-19 16:57 ` Peter Zijlstra
  2015-02-19 17:05   ` Vince Weaver
  2015-02-19 17:03 ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2015-02-19 16:57 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
> 
> Another bug found by the perf_fuzzer().  I think this one is different
> than the one I sent the other day, it looks like something is going
> very wrong in perf_callchain().
> 
> This one is reasonably reproducible, if there's any extra debugging that I 
> can add.  This is on a Haswell machine with git as of a few days ago.

Aw man, and I had such good hopes with you saying it took 4 days to find
problems :/

I'll try and get some time to stare at both.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 16:54 perf: fuzzer gets CPU stuck in perf_callchain() Vince Weaver
  2015-02-19 16:57 ` Peter Zijlstra
@ 2015-02-19 17:03 ` Peter Zijlstra
  2015-02-24  3:33   ` Vince Weaver
  2015-03-23 12:23   ` [tip:perf/urgent] perf: Fix irq_work 'tail' recursion tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2015-02-19 17:03 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:

> [ 7938.802139]  [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
> [ 7938.861174]  [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
> [ 7938.882197]  [<ffffffff810b7f10>] lock_release+0x130/0x260
> [ 7938.888754]  [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
> [ 7938.896510]  [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
> [ 7938.903240]  [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
> [ 7938.923894]  [<ffffffff811f71ce>] send_sigio+0xae/0x100
> [ 7938.930132]  [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
> [ 7938.942707]  [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
> [ 7938.956367]  [<ffffffff8115d103>] perf_pending_event+0x33/0x60
> [ 7938.963199]  [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
> [ 7938.969932]  [<ffffffff8114e448>] irq_work_run+0x18/0x40
> [ 7938.976212]  [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
> [ 7938.983957]  [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80

Ah, it looks like the never ending irq_work..

Could you try this?

---
 kernel/events/core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f04daabfd1cf..088e81ae43cb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4574,6 +4574,13 @@ static void perf_pending_event(struct irq_work *entry)
 {
 	struct perf_event *event = container_of(entry,
 			struct perf_event, pending);
+	int rctx;
+
+	rctx = perf_swevent_get_recursion_context();
+	/*
+	 * If we 'fail' here, that's OK, it means recursion is already disabled
+	 * and we won't recurse 'further'.
+	 */
 
 	if (event->pending_disable) {
 		event->pending_disable = 0;
@@ -4584,6 +4591,9 @@ static void perf_pending_event(struct irq_work *entry)
 		event->pending_wakeup = 0;
 		perf_event_wakeup(event);
 	}
+
+	if (rctx >= 0)
+		perf_swevent_put_recursion_context(rctx);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 16:57 ` Peter Zijlstra
@ 2015-02-19 17:05   ` Vince Weaver
  2015-02-19 17:24     ` Vince Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Vince Weaver @ 2015-02-19 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, 19 Feb 2015, Peter Zijlstra wrote:

> On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
> > 
> > Another bug found by the perf_fuzzer().  I think this one is different
> > than the one I sent the other day, it looks like something is going
> > very wrong in perf_callchain().
> > 
> > This one is reasonably reproducible, if there's any extra debugging that I 
> > can add.  This is on a Haswell machine with git as of a few days ago.
> 
> Aw man, and I had such good hopes with you saying it took 4 days to find
> problems :/
> 
> I'll try and get some time to stare at both.

well if it's any comfort it is definitely on the order of days now before 
issues crop up (rather than minutes).  And the Haswell machine is the 
problem-causing one.  I also have a core2 and a Cortex-A9 machine fuzzing 
away and they've been at it a week now without turning up anything.

Vince


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 17:05   ` Vince Weaver
@ 2015-02-19 17:24     ` Vince Weaver
  2015-02-19 17:32       ` Vince Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Vince Weaver @ 2015-02-19 17:24 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, 19 Feb 2015, Vince Weaver wrote:

> I also have a core2 and a Cortex-A9 machine fuzzing 
> away and they've been at it a week now without turning up anything.

I have to take that back, it turns out the Cortex-A9 machine was wedged 
for over a day, I just hadn't noticed because it hadn't dumped any kind
of message about the problem.  Hmm.

Vince

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 17:24     ` Vince Weaver
@ 2015-02-19 17:32       ` Vince Weaver
  0 siblings, 0 replies; 8+ messages in thread
From: Vince Weaver @ 2015-02-19 17:32 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, 19 Feb 2015, Vince Weaver wrote:

> I have to take that back, it turns out the Cortex-A9 machine was wedged 
> for over a day, I just hadn't noticed because it hadn't dumped any kind
> of message about the problem.  Hmm.

This turns out to be the find_get_context() bug that has been fixed in 
current git, the kernel I was running wasn't new enough to have the fix in 
it.  So false alarm.

Vince

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf: fuzzer gets CPU stuck in perf_callchain()
  2015-02-19 17:03 ` Peter Zijlstra
@ 2015-02-24  3:33   ` Vince Weaver
  2015-03-23 12:23   ` [tip:perf/urgent] perf: Fix irq_work 'tail' recursion tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Vince Weaver @ 2015-02-24  3:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, linux-kernel, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Jiri Olsa, Steven Rostedt

On Thu, 19 Feb 2015, Peter Zijlstra wrote:

> On Thu, Feb 19, 2015 at 11:54:40AM -0500, Vince Weaver wrote:
> 
> > [ 7938.802139]  [<ffffffff8115e114>] perf_tp_event+0xc4/0x210
> > [ 7938.861174]  [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
> > [ 7938.882197]  [<ffffffff810b7f10>] lock_release+0x130/0x260
> > [ 7938.888754]  [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
> > [ 7938.896510]  [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
> > [ 7938.903240]  [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
> > [ 7938.923894]  [<ffffffff811f71ce>] send_sigio+0xae/0x100
> > [ 7938.930132]  [<ffffffff811f72b7>] kill_fasync+0x97/0xf0
> > [ 7938.942707]  [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
> > [ 7938.956367]  [<ffffffff8115d103>] perf_pending_event+0x33/0x60
> > [ 7938.963199]  [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
> > [ 7938.969932]  [<ffffffff8114e448>] irq_work_run+0x18/0x40
> > [ 7938.976212]  [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
> > [ 7938.983957]  [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
> 
> Ah, it looks like the never ending irq_work..
> 
> Could you try this?

sorry for the delay. I managed to reproduce this problem again on 4.0-rc1, 
verified it was reproducible, then applied the patch below and I could no 
longer trigger the issue.

Tested-by: Vince Weaver <vincent.weaver@maine.edu>



> 
> ---
>  kernel/events/core.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f04daabfd1cf..088e81ae43cb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4574,6 +4574,13 @@ static void perf_pending_event(struct irq_work *entry)
>  {
>  	struct perf_event *event = container_of(entry,
>  			struct perf_event, pending);
> +	int rctx;
> +
> +	rctx = perf_swevent_get_recursion_context();
> +	/*
> +	 * If we 'fail' here, that's OK, it means recursion is already disabled
> +	 * and we won't recurse 'further'.
> +	 */
>  
>  	if (event->pending_disable) {
>  		event->pending_disable = 0;
> @@ -4584,6 +4591,9 @@ static void perf_pending_event(struct irq_work *entry)
>  		event->pending_wakeup = 0;
>  		perf_event_wakeup(event);
>  	}
> +
> +	if (rctx >= 0)
> +		perf_swevent_put_recursion_context(rctx);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:perf/urgent] perf: Fix irq_work 'tail' recursion
  2015-02-19 17:03 ` Peter Zijlstra
  2015-02-24  3:33   ` Vince Weaver
@ 2015-03-23 12:23   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-03-23 12:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, tglx, paulus, peterz, linux-kernel, mingo, jolsa, hpa,
	stable, rostedt, vincent.weaver

Commit-ID:  d525211f9d1be8b523ec7633f080f2116f5ea536
Gitweb:     http://git.kernel.org/tip/d525211f9d1be8b523ec7633f080f2116f5ea536
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 19 Feb 2015 18:03:11 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 23 Mar 2015 10:46:32 +0100

perf: Fix irq_work 'tail' recursion

Vince reported a watchdog lockup like:

	[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
	[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
	[<ffffffff810b7f10>] lock_release+0x130/0x260
	[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
	[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
	[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
	[<ffffffff811f71ce>] send_sigio+0xae/0x100
	[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
	[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
	[<ffffffff8115d103>] perf_pending_event+0x33/0x60
	[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
	[<ffffffff8114e448>] irq_work_run+0x18/0x40
	[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
	[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80

Which is caused by an irq_work generating new irq_work and therefore
not allowing forward progress.

This happens because processing the perf irq_work triggers another
perf event (tracepoint stuff) which in turn generates an irq_work ad
infinitum.

Avoid this by raising the recursion counter in the irq_work -- which
effectively disables all software events (including tracepoints) from
actually triggering again.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 453ef61..2fabc06 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4574,6 +4574,13 @@ static void perf_pending_event(struct irq_work *entry)
 {
 	struct perf_event *event = container_of(entry,
 			struct perf_event, pending);
+	int rctx;
+
+	rctx = perf_swevent_get_recursion_context();
+	/*
+	 * If we 'fail' here, that's OK, it means recursion is already disabled
+	 * and we won't recurse 'further'.
+	 */
 
 	if (event->pending_disable) {
 		event->pending_disable = 0;
@@ -4584,6 +4591,9 @@ static void perf_pending_event(struct irq_work *entry)
 		event->pending_wakeup = 0;
 		perf_event_wakeup(event);
 	}
+
+	if (rctx >= 0)
+		perf_swevent_put_recursion_context(rctx);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-03-23 12:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-19 16:54 perf: fuzzer gets CPU stuck in perf_callchain() Vince Weaver
2015-02-19 16:57 ` Peter Zijlstra
2015-02-19 17:05   ` Vince Weaver
2015-02-19 17:24     ` Vince Weaver
2015-02-19 17:32       ` Vince Weaver
2015-02-19 17:03 ` Peter Zijlstra
2015-02-24  3:33   ` Vince Weaver
2015-03-23 12:23   ` [tip:perf/urgent] perf: Fix irq_work 'tail' recursion tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.