From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932750AbdJaSsz (ORCPT ); Tue, 31 Oct 2017 14:48:55 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59926 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932543AbdJaSsy (ORCPT ); Tue, 31 Oct 2017 14:48:54 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 360DD5D9E0 Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=dzickus@redhat.com Date: Tue, 31 Oct 2017 14:48:53 -0400 From: Don Zickus To: Guenter Roeck Cc: Thomas Gleixner , linux-kernel@vger.kernel.org, Ingo Molnar Subject: Re: Crashes in perf_event_ctx_lock_nested Message-ID: <20171031184853.y4bliffza6hdhjkk@redhat.com> References: <20171030224512.GA13592@roeck-us.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171030224512.GA13592@roeck-us.net> User-Agent: NeoMutt/20171013 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 31 Oct 2017 18:48:54 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 30, 2017 at 03:45:12PM -0700, Guenter Roeck wrote: > Hi Thomas, > > we are seeing the following crash in v4.14-rc5/rc7 if CONFIG_HARDLOCKUP_DETECTOR > is enabled. > > [ 5.908021] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. > [ 5.915836] > ================================================================== > [ 5.917325] Unsafe core_pattern used with fs.suid_dumpable=2. > [ 5.917325] Pipe handler or fully qualified core dump path required. > [ 5.917325] Set kernel.core_pattern before fs.suid_dumpable. > [ 5.924046] udevd[147]: starting version 225 > [ 5.948520] BUG: KASAN: null-ptr-deref in perf_event_ctx_lock_nested.isra.71+0x22/0x89 > [ 5.957380] Read of size 8 at addr 0000000000000208 by task watchdog/2/21 > [ 5.964973] > [ 5.966646] CPU: 2 PID: 21 Comm: watchdog/2 Not tainted 4.14.0-rc7 #30 > [ 5.973947] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017 > [ 5.982128] Call Trace: > [ 5.984874] dump_stack+0x4d/0x63 > [ 5.988585] kasan_report+0x24b/0x295 > [ 5.992691] ? watchdog_nmi_enable+0x12/0x12 > [ 5.997477] __asan_load8+0x81/0x83 > [ 6.001388] perf_event_ctx_lock_nested.isra.71+0x22/0x89 > [ 6.007437] perf_event_enable+0xf/0x27 > [ 6.011737] hardlockup_detector_perf_enable+0x3e/0x40 > [ 6.017493] watchdog_nmi_enable+0xe/0x12 > [ 6.021990] watchdog_enable+0x8c/0xc5 > [ 6.026195] smpboot_thread_fn+0x27a/0x3c7 > [ 6.030788] ? sort_range+0x22/0x22 > [ 6.034701] kthread+0x221/0x231 > [ 6.038321] ? kthread_flush_work+0x120/0x120 > [ 6.043204] ret_from_fork+0x22/0x30 > [ 6.047207] > ================================================================== > ... > [ 6.134561] BUG: unable to handle kernel NULL pointer dereference at 0000000000000208 > [ 6.143316] IP: perf_event_ctx_lock_nested.isra.71+0x22/0x89 > [ 6.149645] PGD 0 P4D 0 > [ 6.152478] Oops: 0000 [#1] PREEMPT SMP KASAN > [ 6.157350] Modules linked in: > [ 6.160766] CPU: 2 PID: 21 Comm: watchdog/2 Tainted: G B 4.14.0-rc7 #30 > [ 6.169422] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017 > [ 6.177583] task: ffff8803eacd1100 task.stack: ffff8803eacf8000 > [ 6.184206] RIP: 0010:perf_event_ctx_lock_nested.isra.71+0x22/0x89 > [ 6.191118] RSP: 0018:ffff8803eacffe10 EFLAGS: 00010246 > [ 6.196962] RAX: 0000000000000296 RBX: 0000000000000000 RCX: ffffffffa52ee8a5 > [ 6.204941] RDX: d8ecf37b519af800 RSI: 0000000000000003 RDI: ffffffffa6274610 > [ 6.212911] RBP: ffff8803eacffe30 R08: dffffc0000000000 R09: 0000000000000000 > [ 6.220888] R10: ffffed007d59ffa9 R11: ffffc9000044c1a1 R12: 0000000000000000 > [ 6.228867] R13: ffff8803eacd1100 R14: 0000000000000208 R15: ffffffffa535ce54 > [ 6.231476] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: commit=600 > [ 6.237449] EXT4-fs (mmcblk0p8): mounted filesystem with ordered data mode. Opts: (null) > [ 6.255332] FS: 0000000000000000(0000) GS:ffff8803ed500000(0000) knlGS:0000000000000000 > [ 6.264384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 6.270812] CR2: 0000000000000208 CR3: 0000000430615001 CR4: 00000000003606e0 > [ 6.278789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 6.286761] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 6.294741] Call Trace: > [ 6.297480] perf_event_enable+0xf/0x27 > [ 6.301771] hardlockup_detector_perf_enable+0x3e/0x40 > [ 6.307515] watchdog_nmi_enable+0xe/0x12 > [ 6.311990] watchdog_enable+0x8c/0xc5 > [ 6.316176] smpboot_thread_fn+0x27a/0x3c7 > [ 6.320757] ? sort_range+0x22/0x22 > [ 6.324650] kthread+0x221/0x231 > [ 6.328251] ? kthread_flush_work+0x120/0x120 > [ 6.333114] ret_from_fork+0x22/0x30 > [ 6.337107] Code: a5 e8 70 58 f6 ff 5b 5d c3 55 48 89 e5 41 56 4c 8d b7 08 02 > 00 00 41 55 41 54 49 89 fc 53 e8 1a e9 f5 ff 4c 89 f7 e8 8d d3 07 00 <49> 8b 9c > 24 08 02 00 00 31 d2 be 01 00 00 00 48 8d bb b0 00 00 > [ 6.358230] RIP: perf_event_ctx_lock_nested.isra.71+0x22/0x89 RSP: ffff8803eacffe10 > [ 6.366779] CR2: 0000000000000208 > [ 6.370477] ---[ end trace ff68e1917f0a2044 ]--- > [ 6.383531] Kernel panic - not syncing: Fatal exception > [ 6.389640] Kernel Offset: 0x24200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > The problem is a heisenbug - slight changes in the code, such as added logging, > can make it disappear. > > I added some logging and a long msleep() in hardlockup_detector_perf_cleanup(). > Here is the result: > > [ 0.274361] NMI watchdog: ############ hardlockup_detector_perf_init > [ 0.274915] NMI watchdog: ############ hardlockup_detector_event_create(0) > [ 0.277049] NMI watchdog: ############ hardlockup_detector_perf_cleanup > [ 0.277593] NMI watchdog: ############ hardlockup_detector_perf_enable(0) > [ 0.278027] NMI watchdog: ############ hardlockup_detector_event_create(0) > [ 1.312044] NMI watchdog: ############ hardlockup_detector_perf_cleanup done > [ 1.385122] NMI watchdog: ############ hardlockup_detector_perf_enable(1) > [ 1.386028] NMI watchdog: ############ hardlockup_detector_event_create(1) > [ 1.466102] NMI watchdog: ############ hardlockup_detector_perf_enable(2) > [ 1.475536] NMI watchdog: ############ hardlockup_detector_event_create(2) > [ 1.535099] NMI watchdog: ############ hardlockup_detector_perf_enable(3) > [ 1.535101] NMI watchdog: ############ hardlockup_detector_event_create(3) > [ 7.222816] NMI watchdog: ############ hardlockup_detector_perf_disable(0) > [ 7.230567] NMI watchdog: ############ hardlockup_detector_perf_disable(1) > [ 7.243138] NMI watchdog: ############ hardlockup_detector_perf_disable(2) > [ 7.250966] NMI watchdog: ############ hardlockup_detector_perf_disable(3) > [ 7.258826] NMI watchdog: ############ hardlockup_detector_perf_enable(1) > [ 7.258827] NMI watchdog: ############ hardlockup_detector_perf_cleanup > [ 7.258831] NMI watchdog: ############ hardlockup_detector_perf_enable(2) > [ 7.258833] NMI watchdog: ############ hardlockup_detector_perf_enable(0) > [ 7.258834] NMI watchdog: ############ hardlockup_detector_event_create(2) > [ 7.258835] NMI watchdog: ############ hardlockup_detector_event_create(0) > [ 7.260169] NMI watchdog: ############ hardlockup_detector_perf_enable(3) > [ 7.260170] NMI watchdog: ############ hardlockup_detector_event_create(3) > [ 7.494251] NMI watchdog: ############ hardlockup_detector_event_create(1) > [ 8.287135] NMI watchdog: ############ hardlockup_detector_perf_cleanup done > > Looks like there are a number of problems: hardlockup_detector_event_create() > creates the event data structure even if it is already created, and > hardlockup_detector_perf_cleanup() runs unprotected and in parallel to > the enable/create functions. Hi, Thomas created a deferred cleanup mechanism to help with hotplugging, so yes an event tries to be re-created if the deferred cleanup wasn't run yet. There is probably a bug there to remove the re-created event off of the deferred 'dead_events_mask'. And maybe some locking there too, though I don't know if adding locking around that queue re-introduces the deadlock to the hotplug code. See commit 941154bd6937a710ae9193a3c733c0029e5ae7b8 for details. > > ALso, the following message is seen twice. > > [ 0.278758] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. > [ 7.258838] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. > > I don't offer a proposed patch since I have no idea how to best solve the > problem. > > Also, is the repeated enable/disable/cleanup as part of the normal boot > really necessary ? Yes, part of it. The first event_create(0) is a hardware check. Does the hardware support perf counters. The event is removed after the check. If the check passes, allow all cpus to create the event. Hence the second event_create(0). Otherwise, disable the counters from being enabled on other cpus. What doesn't make sense is the perf_disable(0-3) at the 7 second mark. Not sure why that happens. But it seems to cause the problem by exposing a race of disabling the event and then re-creating it. What is odd, is the panic. hardlockup_detector_perf_enable -> hardlockup_detector_event_create perf_event_enable -> panic() due to the delay of hardlockup_detector_perf_cleanup(), I would have assumed hardlockup_detector_event_create() would have warned and returned had the cleanup not been invoked yet. The only race I can see here: CPU A CPU B hardlockup_detector_perf_cleanup hardlockup_detector_perf_enable perf_event_release_kernel hardlockup_detector_event_create per_cpu(watchdog_ev, cpu) = NULL; perf_event_enable(this_cpu_read(watchdog_ev) ^^^NULL I am guessing adding a check in hardlockup_detector_perf_enable() would prove that, but as you said, it might may the problem go into hiding. The below patch probably doesn't solve the problem, just makes the race condition a lot smaller? Though I don't know if this causes more problems internally to perf with two attached events temporarily. Cheers, Don diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c index 71a62ceacdc8..0b9bd1e0bf57 100644 --- a/kernel/watchdog_hld.c +++ b/kernel/watchdog_hld.c @@ -221,12 +221,16 @@ void hardlockup_detector_perf_cleanup(void) struct perf_event *event = per_cpu(watchdog_ev, cpu); /* + * Clear immediately to avoid delay of releasing event. + */ + per_cpu(watchdog_ev, cpu) = NULL; + + /* * Required because for_each_cpu() reports unconditionally * CPU0 as set on UP kernels. Sigh. */ if (event) perf_event_release_kernel(event); - per_cpu(watchdog_ev, cpu) = NULL; } cpumask_clear(&dead_events_mask); }