Re: perf: fuzzer triggered warning in intel_pmu_drain_pebs_nhm()

From: Vince Weaver <vincent.weaver@maine.edu>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Stephane Eranian <eranian@gmail.com>,
	kan.liang@intel.com
Subject: Re: perf: fuzzer triggered warning in intel_pmu_drain_pebs_nhm()
Date: Fri, 3 Jul 2015 15:03:05 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.20.1507031459340.18107@vincent-weaver-1.umelst.maine.edu> (raw)
In-Reply-To: <20150703131336.GI19282@twins.programming.kicks-ass.net>

On Fri, 3 Jul 2015, Peter Zijlstra wrote:

> On Thu, Jul 02, 2015 at 11:18:10AM -0400, Vince Weaver wrote:
> > 
> > So sad to say the lack of fuzzer reports was because I was out of town for 
> > a bit, not due to the kernel suddenly getting amazingly better.
> > 
> > In any case I am running against current git and getting a lot of 
> > warnings, but most of them seem to be old ones.  This following one looks 
> > new though.
> > 
> > This is current linus-git on a Haswell machine with peterz's patch to fix 
> > the aux buffer spinlock recursion (I can still crash the kernel if that 
> > patch is not applied).
> > 
> > It corresponds to:
> > 
> > 	WARN_ON_ONCE(!event->attr.precise_ip);
> > 
> > [  584.352324] WARNING: CPU: 2 PID: 18924 at arch/x86/kernel/cpu/perf_event_intel_ds.c:1198 intel_pmu_drain_pebs_nhm+0x283/0x2e0()
> 
> I've not yet tried to reproduce, but the below could explain things.
> 
> On disabling an event we first clear our cpuc->pebs_enabled bits, only
> to then check them to see if there are any set, and if so, drain the
> buffer.
> 
> If we just cleared the last bit, we'll fail to drain the buffer.
> 
> If we then program another event on that counter and another PEBS event,
> we can hit the above WARN with the 'stale' entries left over from the
> previous event.

with that patch applied I still managed to hit this:

	WARN_ON_ONCE(!event->attr.precise_ip);

I'll let it run some more and see if the watchdog still gets triggered.

[ 2217.544901] ------------[ cut here ]------------
[ 2217.550351] WARNING: CPU: 2 PID: 9136 at arch/x86/kernel/cpu/perf_event_intel_ds.c:1198 intel_pmu_drain_pebs_nhm+0x283/0x2e0()
[ 2217.563534] Modules linked in: fuse snd_hda_codec_hdmi i915 x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel psmouse hmac drbg evdev serio_raw ansi_cprng snd_hda_codec_realtek drm_kms_helper snd_hda_codec_generic ppdev iTCO_wdt iTCO_vendor_support pcspkr drm i2c_algo_bit aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper snd_hda_intel cryptd mei_me mei snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer tpm_tis tpm wmi button processor video battery i2c_i801 parport_pc parport snd lpc_ich mfd_core soundcore sg sr_mod sd_mod cdrom ehci_pci ehci_hcd ahci libahci xhci_pci xhci_hcd e1000e libata ptp crc32c_intel scsi_mod pps_core usbcore usb_common fan thermal thermal_sys
[ 2217.640998] CPU: 2 PID: 9136 Comm: perf_fuzzer Tainted: G        W       4.1.0+ #163
[ 2217.649810] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[ 2217.658281]  ffffffff81a105a0 ffff88011ea85b10 ffffffff8169f823 0000000000000000
[ 2217.666818]  0000000000000000 ffff88011ea85b50 ffffffff8106ec8a ffff88011ea85ba0
[ 2217.675329]  0000000000000002 0000000000000001 ffff88011ea8bd80 ffff8801190400c0
[ 2217.683821] Call Trace:
[ 2217.686960]  <NMI>  [<ffffffff8169f823>] dump_stack+0x45/0x57
[ 2217.693638]  [<ffffffff8106ec8a>] warn_slowpath_common+0x8a/0xc0
[ 2217.700549]  [<ffffffff8106ed7a>] warn_slowpath_null+0x1a/0x20
[ 2217.707296]  [<ffffffff8102f783>] intel_pmu_drain_pebs_nhm+0x283/0x2e0
[ 2217.714775]  [<ffffffff81031834>] ? intel_pmu_disable_event+0xa4/0x130
[ 2217.722216]  [<ffffffff81032235>] intel_pmu_handle_irq+0x255/0x440
[ 2217.729339]  [<ffffffff8115413e>] ? perf_event_ctx_lock_nested+0x5e/0xf0
[ 2217.737026]  [<ffffffff81028e76>] perf_event_nmi_handler+0x26/0x40
[ 2217.744070]  [<ffffffff810181ad>] nmi_handle+0x9d/0x140
[ 2217.750160]  [<ffffffff81018115>] ? nmi_handle+0x5/0x140
[ 2217.756290]  [<ffffffff8101843a>] default_do_nmi+0x4a/0x120
[ 2217.762688]  [<ffffffff8101859d>] do_nmi+0x8d/0xc0
[ 2217.768280]  [<ffffffff816a979f>] end_repeat_nmi+0x1e/0x2e
[ 2217.774627]  [<ffffffff810309ba>] ? __intel_pmu_enable_all+0x5a/0xc0
[ 2217.781894]  [<ffffffff810309ba>] ? __intel_pmu_enable_all+0x5a/0xc0
[ 2217.789153]  [<ffffffff810309ba>] ? __intel_pmu_enable_all+0x5a/0xc0
[ 2217.796415]  <<EOE>>  <IRQ>  [<ffffffff81030a30>] intel_pmu_enable_all+0x10/0x20
[ 2217.804847]  [<ffffffff8102a95c>] x86_pmu_enable+0x25c/0x2e0
[ 2217.811383]  [<ffffffff811560e2>] perf_pmu_enable+0x22/0x30
[ 2217.817837]  [<ffffffff81157a80>] perf_mux_hrtimer_handler+0x120/0x1f0
[ 2217.825316]  [<ffffffff81157960>] ? perf_event_context_sched_in+0x150/0x150
[ 2217.833239]  [<ffffffff810dcf43>] __hrtimer_run_queues+0xd3/0x260
[ 2217.840239]  [<ffffffff810dd4bb>] hrtimer_interrupt+0xab/0x1b0
[ 2217.846930]  [<ffffffff8104b32c>] local_apic_timer_interrupt+0x3c/0x70
[ 2217.854367]  [<ffffffff816aa1a1>] smp_apic_timer_interrupt+0x41/0x60
[ 2217.861630]  [<ffffffff816a83eb>] apic_timer_interrupt+0x6b/0x70
[ 2217.868540]  <EOI> 
[ 2217.870633] ---[ end trace 3a31b4d07b4f3450 ]---
[ 2353.824071] Uhhuh. NMI received for unknown reason 31 on CPU 1.
[ 2353.831238] Do you have a strange power saving mode enabled?
[ 2353.838120] Dazed and confused, but trying to continue