Re: Crashes in perf_event_ctx_lock_nested

From: Don Zickus <dzickus@redhat.com>
To: Guenter Roeck <linux@roeck-us.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@kernel.org>
Subject: Re: Crashes in perf_event_ctx_lock_nested
Date: Tue, 31 Oct 2017 14:48:53 -0400	[thread overview]
Message-ID: <20171031184853.y4bliffza6hdhjkk@redhat.com> (raw)
In-Reply-To: <20171030224512.GA13592@roeck-us.net>

On Mon, Oct 30, 2017 at 03:45:12PM -0700, Guenter Roeck wrote:
> Hi Thomas,
> 
> we are seeing the following crash in v4.14-rc5/rc7 if CONFIG_HARDLOCKUP_DETECTOR
> is enabled.
> 
> [    5.908021] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> [    5.915836]
> ==================================================================
> [    5.917325] Unsafe core_pattern used with fs.suid_dumpable=2.
> [    5.917325] Pipe handler or fully qualified core dump path required.
> [    5.917325] Set kernel.core_pattern before fs.suid_dumpable.
> [    5.924046] udevd[147]: starting version 225
> [    5.948520] BUG: KASAN: null-ptr-deref in perf_event_ctx_lock_nested.isra.71+0x22/0x89
> [    5.957380] Read of size 8 at addr 0000000000000208 by task watchdog/2/21
> [    5.964973]
> [    5.966646] CPU: 2 PID: 21 Comm: watchdog/2 Not tainted 4.14.0-rc7 #30
> [    5.973947] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017
> [    5.982128] Call Trace:
> [    5.984874]  dump_stack+0x4d/0x63
> [    5.988585]  kasan_report+0x24b/0x295
> [    5.992691]  ? watchdog_nmi_enable+0x12/0x12
> [    5.997477]  __asan_load8+0x81/0x83
> [    6.001388]  perf_event_ctx_lock_nested.isra.71+0x22/0x89
> [    6.007437]  perf_event_enable+0xf/0x27
> [    6.011737]  hardlockup_detector_perf_enable+0x3e/0x40
> [    6.017493]  watchdog_nmi_enable+0xe/0x12
> [    6.021990]  watchdog_enable+0x8c/0xc5
> [    6.026195]  smpboot_thread_fn+0x27a/0x3c7
> [    6.030788]  ? sort_range+0x22/0x22
> [    6.034701]  kthread+0x221/0x231
> [    6.038321]  ? kthread_flush_work+0x120/0x120
> [    6.043204]  ret_from_fork+0x22/0x30
> [    6.047207]
> ==================================================================
> ...
> [    6.134561] BUG: unable to handle kernel NULL pointer dereference at 0000000000000208
> [    6.143316] IP: perf_event_ctx_lock_nested.isra.71+0x22/0x89
> [    6.149645] PGD 0 P4D 0 
> [    6.152478] Oops: 0000 [#1] PREEMPT SMP KASAN
> [    6.157350] Modules linked in:
> [    6.160766] CPU: 2 PID: 21 Comm: watchdog/2 Tainted: G    B 4.14.0-rc7 #30
> [    6.169422] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017
> [    6.177583] task: ffff8803eacd1100 task.stack: ffff8803eacf8000
> [    6.184206] RIP: 0010:perf_event_ctx_lock_nested.isra.71+0x22/0x89
> [    6.191118] RSP: 0018:ffff8803eacffe10 EFLAGS: 00010246
> [    6.196962] RAX: 0000000000000296 RBX: 0000000000000000 RCX: ffffffffa52ee8a5
> [    6.204941] RDX: d8ecf37b519af800 RSI: 0000000000000003 RDI: ffffffffa6274610
> [    6.212911] RBP: ffff8803eacffe30 R08: dffffc0000000000 R09: 0000000000000000
> [    6.220888] R10: ffffed007d59ffa9 R11: ffffc9000044c1a1 R12: 0000000000000000
> [    6.228867] R13: ffff8803eacd1100 R14: 0000000000000208 R15: ffffffffa535ce54
> [    6.231476] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode.  Opts: commit=600
> [    6.237449] EXT4-fs (mmcblk0p8): mounted filesystem with ordered data mode.  Opts: (null)
> [    6.255332] FS:  0000000000000000(0000) GS:ffff8803ed500000(0000) knlGS:0000000000000000
> [    6.264384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.270812] CR2: 0000000000000208 CR3: 0000000430615001 CR4: 00000000003606e0
> [    6.278789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    6.286761] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    6.294741] Call Trace:
> [    6.297480]  perf_event_enable+0xf/0x27
> [    6.301771]  hardlockup_detector_perf_enable+0x3e/0x40
> [    6.307515]  watchdog_nmi_enable+0xe/0x12 
> [    6.311990]  watchdog_enable+0x8c/0xc5
> [    6.316176]  smpboot_thread_fn+0x27a/0x3c7
> [    6.320757]  ? sort_range+0x22/0x22
> [    6.324650]  kthread+0x221/0x231
> [    6.328251]  ? kthread_flush_work+0x120/0x120
> [    6.333114]  ret_from_fork+0x22/0x30
> [    6.337107] Code: a5 e8 70 58 f6 ff 5b 5d c3 55 48 89 e5 41 56 4c 8d b7 08 02
> 00 00 41 55 41 54 49 89 fc 53 e8 1a e9 f5 ff 4c 89 f7 e8 8d d3 07 00 <49> 8b 9c
> 24 08 02 00 00 31 d2 be 01 00 00 00 48 8d bb b0 00 00
> [    6.358230] RIP: perf_event_ctx_lock_nested.isra.71+0x22/0x89 RSP: ffff8803eacffe10
> [    6.366779] CR2: 0000000000000208
> [    6.370477] ---[ end trace ff68e1917f0a2044 ]---
> [    6.383531] Kernel panic - not syncing: Fatal exception
> [    6.389640] Kernel Offset: 0x24200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> The problem is a heisenbug - slight changes in the code, such as added logging,
> can make it disappear.
> 
> I added some logging and a long msleep() in hardlockup_detector_perf_cleanup().
> Here is the result:
> 
> [    0.274361] NMI watchdog: ############ hardlockup_detector_perf_init
> [    0.274915] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    0.277049] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [    0.277593] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [    0.278027] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    1.312044] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
> [    1.385122] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [    1.386028] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [    1.466102] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [    1.475536] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [    1.535099] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [    1.535101] NMI watchdog: ############ hardlockup_detector_event_create(3)
> [    7.222816] NMI watchdog: ############ hardlockup_detector_perf_disable(0)
> [    7.230567] NMI watchdog: ############ hardlockup_detector_perf_disable(1)
> [    7.243138] NMI watchdog: ############ hardlockup_detector_perf_disable(2)
> [    7.250966] NMI watchdog: ############ hardlockup_detector_perf_disable(3)
> [    7.258826] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [    7.258827] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [    7.258831] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [    7.258833] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [    7.258834] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [    7.258835] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    7.260169] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [    7.260170] NMI watchdog: ############ hardlockup_detector_event_create(3)
> [    7.494251] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [    8.287135] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
> 
> Looks like there are a number of problems: hardlockup_detector_event_create()
> creates the event data structure even if it is already created, and
> hardlockup_detector_perf_cleanup() runs unprotected and in parallel to
> the enable/create functions.

Hi,

Thomas created a deferred cleanup mechanism to help with hotplugging, so yes
an event tries to be re-created if the deferred cleanup wasn't run yet.  There
is probably a bug there to remove the re-created event off of the deferred
'dead_events_mask'.  And maybe some locking there too, though I don't know
if adding locking around that queue re-introduces the deadlock to the
hotplug code.  See commit 941154bd6937a710ae9193a3c733c0029e5ae7b8 for
details.


> 
> ALso, the following message is seen twice.
> 
> [    0.278758] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> [    7.258838] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> 
> I don't offer a proposed patch since I have no idea how to best solve the
> problem.
> 
> Also, is the repeated enable/disable/cleanup as part of the normal boot
> really necessary ?

Yes, part of it.  The first event_create(0) is a hardware check.  Does the
hardware support perf counters.  The event is removed after the check. If
the check passes, allow all cpus to create the event.  Hence the second
event_create(0).  Otherwise, disable the counters from being enabled on
other cpus.

What doesn't make sense is the perf_disable(0-3) at the 7 second mark.  Not
sure why that happens.  But it seems to cause the problem by exposing a race
of disabling the event and then re-creating it.

What is odd, is the panic.

	hardlockup_detector_perf_enable ->
		hardlockup_detector_event_create
		perf_event_enable ->
			panic()

due to the delay of hardlockup_detector_perf_cleanup(), I would have assumed
hardlockup_detector_event_create() would have warned and returned
had the cleanup not been invoked yet.

The only race I can see here:

CPU A						CPU B
						hardlockup_detector_perf_cleanup
hardlockup_detector_perf_enable				 perf_event_release_kernel
	hardlockup_detector_event_create
							per_cpu(watchdog_ev, cpu) = NULL;
	perf_event_enable(this_cpu_read(watchdog_ev)
					^^^NULL


I am guessing adding a check in hardlockup_detector_perf_enable() would
prove that, but as you said, it might may the problem go into hiding.


The below patch probably doesn't solve the problem, just makes the race
condition a lot smaller?  Though I don't know if this causes more problems
internally to perf with two attached events temporarily.

Cheers,
Don

diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 71a62ceacdc8..0b9bd1e0bf57 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -221,12 +221,16 @@ void hardlockup_detector_perf_cleanup(void)
 		struct perf_event *event = per_cpu(watchdog_ev, cpu);
 
 		/*
+ 		 * Clear immediately to avoid delay of releasing event.
+ 		 */
+		per_cpu(watchdog_ev, cpu) = NULL;
+
+		/*
 		 * Required because for_each_cpu() reports  unconditionally
 		 * CPU0 as set on UP kernels. Sigh.
 		 */
 		if (event)
 			perf_event_release_kernel(event);
-		per_cpu(watchdog_ev, cpu) = NULL;
 	}
 	cpumask_clear(&dead_events_mask);
 }