From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754927AbaAJSzP (ORCPT ); Fri, 10 Jan 2014 13:55:15 -0500 Received: from mail-vb0-f49.google.com ([209.85.212.49]:55107 "EHLO mail-vb0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751172AbaAJSzN (ORCPT ); Fri, 10 Jan 2014 13:55:13 -0500 MIME-Version: 1.0 In-Reply-To: <20140110174141.GE8224@laptop.programming.kicks-ass.net> References: <52D011C9.7000209@hp.com> <20140110165822.GI7572@laptop.programming.kicks-ass.net> <20140110170223.GD8224@laptop.programming.kicks-ass.net> <20140110174141.GE8224@laptop.programming.kicks-ass.net> From: Andy Lutomirski Date: Fri, 10 Jan 2014 10:54:52 -0800 Message-ID: Subject: Re: SIGSEGV when using "perf record -g" with 3.13-rc* kernel To: Peter Zijlstra Cc: Waiman Long , Ingo Molnar , Arnaldo Carvalho de Melo , Linux Kernel Mailing List , Aswin Chandramouleeswaran , Scott J Norton , Linus Torvalds Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 10, 2014 at 9:41 AM, Peter Zijlstra wrote: > On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote: >> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote: >> > On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote: >> > > Peter, >> > > >> > > Call Trace: >> > > [] dump_stack+0x49/0x62 >> > > [] warn_slowpath_common+0x8c/0xc0 >> > > [] warn_slowpath_null+0x1a/0x20 >> > > [] force_sig_info+0x131/0x140 >> > > [] force_sig_info_fault+0x5f/0x70 >> > > [] ? search_exception_tables+0x2a/0x50 >> > > [] ? fixup_exception+0x1d/0x70 >> > > [] no_context+0x159/0x1f0 >> > > [] __bad_area_nosemaphore+0x12d/0x230 >> > > [] ? __bad_area_nosemaphore+0x12d/0x230 >> > > [] bad_area_nosemaphore+0x13/0x20 >> > > [] __do_page_fault+0x362/0x480 >> > > [] ? __do_page_fault+0x362/0x480 >> > > [] do_page_fault+0xe/0x10 >> > > [] page_fault+0x22/0x30 >> > > [] ? bad_to_user+0x5e/0x66b >> > > [] copy_from_user_nmi+0x76/0x90 >> > > [] perf_callchain_user+0xd0/0x360 >> > > [] perf_callchain+0x1af/0x1f0 >> > > [] perf_prepare_sample+0x2f3/0x3a0 >> > > [] __perf_event_overflow+0x10f/0x220 >> > > [] perf_event_overflow+0x14/0x20 >> > > [] intel_pmu_handle_irq+0x1de/0x3c0 >> > > [] ? emulate_vsyscall+0x144/0x390 >> > > [] perf_event_nmi_handler+0x34/0x60 >> > > [] nmi_handle+0x8a/0x170 >> > > [] default_do_nmi+0x68/0x210 >> > > [] do_nmi+0x90/0xe0 >> > > [] end_repeat_nmi+0x1e/0x2e >> > > [] ? emulate_vsyscall+0x144/0x390 >> > > [] ? emulate_vsyscall+0x144/0x390 >> > > [] ? emulate_vsyscall+0x144/0x390 >> > > <> [] __bad_area_nosemaphore+0x21d/0x230 >> > > [] bad_area_nosemaphore+0x13/0x20 >> > > [] __do_page_fault+0x362/0x480 >> > > [] ? vm_mmap_pgoff+0xbc/0xe0 >> > > [] do_page_fault+0xe/0x10 >> > > [] page_fault+0x22/0x30 >> > > ---[ end trace 037bf09d279751ec ]--- >> > > >> > > So this is a double page faults. Looking at relevant changes in >> > > 3.13 kernel, I spotted the following one patch that modified the >> > > perf_callchain_user() function shown up in the stack trace above: >> > > >> > >> > Hurm, that's an expected double fault, not something we should take the >> > process down for. >> > >> > I'll have to look at how all that works for a bit. > > Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and > context on vsyscall emulation faults"). > > It looks like your initial userspace fault hit the magic button and ends > up in emulate_vsyscall. Right at that point we trigger a PMI, which > tries to do a stack-trace. That stack-trace also stumbles into unmapped > memory (might be the same) and faults again. > > Now at that point, we usually just give up on the callchain and proceed > like normal, however because of this double fault emulate-vsyscall > SIGSEGV magic you loose. > > So the below might well be a valid fix.. Anybody? Andy? Yuck -- when I wrote that thing, I hadn't imagined that an interrupt (there's nothing particularly special about NMIs here, I think) would try to access user memory. The fix below looks okay, but IMO it needs a big fat comment explaining what's going on. Is there a way to ask whether the previous entry into the kernel came from user space? The valid "sig_on_uaccess_error" case happens when the current fault was triggered by a fault from userspace. The invalid case (and any invalid case from, say, an int3 that a tracepoint stuck in there) would be a page fault triggered by a fault handler that in turn started in kernel space (in particular, in emulate_vsyscall). For that matter, why does current_thread_info() work from an NMI at all? Does the NMI vector not have its own stack? The call that installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK). In any case, this at least needs a comment. I don't see why this same bug couldn't be triggered by non-NMI based tracing mechanisms, though. Sigh, corner cases of corner cases... > >> How easily can you reproduce this? Could you test something like the >> below, which would allow us to take double faults from NMI context. >> >> --- >> arch/x86/mm/fault.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index 9ff85bb8dd69..18c498d4274d 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code, >> >> /* Are we prepared to handle this kernel fault? */ >> if (fixup_exception(regs)) { >> - if (current_thread_info()->sig_on_uaccess_error && signal) { >> + if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) { >> tsk->thread.trap_nr = X86_TRAP_PF; >> tsk->thread.error_code = error_code | PF_USER; >> tsk->thread.cr2 = address; --Andy