From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758104AbaAJT4b (ORCPT ); Fri, 10 Jan 2014 14:56:31 -0500 Received: from mail-ve0-f181.google.com ([209.85.128.181]:54970 "EHLO mail-ve0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753397AbaAJT42 (ORCPT ); Fri, 10 Jan 2014 14:56:28 -0500 MIME-Version: 1.0 In-Reply-To: <52D04D6D.9010504@hp.com> References: <52D011C9.7000209@hp.com> <20140110165822.GI7572@laptop.programming.kicks-ass.net> <20140110170223.GD8224@laptop.programming.kicks-ass.net> <20140110174141.GE8224@laptop.programming.kicks-ass.net> <52D04D6D.9010504@hp.com> From: Andy Lutomirski Date: Fri, 10 Jan 2014 11:56:07 -0800 Message-ID: Subject: Re: SIGSEGV when using "perf record -g" with 3.13-rc* kernel To: Waiman Long Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Linux Kernel Mailing List , Aswin Chandramouleeswaran , Scott J Norton , Linus Torvalds Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 10, 2014 at 11:43 AM, Waiman Long wrote: > On 01/10/2014 01:54 PM, Andy Lutomirski wrote: >> >> On Fri, Jan 10, 2014 at 9:41 AM, Peter Zijlstra >> wrote: >>> >>> On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote: >>>> >>>> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote: >>>>> >>>>> On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote: >>>>>> >>>>>> Peter, >>>>>> >>>>>> Call Trace: >>>>>> [] dump_stack+0x49/0x62 >>>>>> [] warn_slowpath_common+0x8c/0xc0 >>>>>> [] warn_slowpath_null+0x1a/0x20 >>>>>> [] force_sig_info+0x131/0x140 >>>>>> [] force_sig_info_fault+0x5f/0x70 >>>>>> [] ? search_exception_tables+0x2a/0x50 >>>>>> [] ? fixup_exception+0x1d/0x70 >>>>>> [] no_context+0x159/0x1f0 >>>>>> [] __bad_area_nosemaphore+0x12d/0x230 >>>>>> [] ? __bad_area_nosemaphore+0x12d/0x230 >>>>>> [] bad_area_nosemaphore+0x13/0x20 >>>>>> [] __do_page_fault+0x362/0x480 >>>>>> [] ? __do_page_fault+0x362/0x480 >>>>>> [] do_page_fault+0xe/0x10 >>>>>> [] page_fault+0x22/0x30 >>>>>> [] ? bad_to_user+0x5e/0x66b >>>>>> [] copy_from_user_nmi+0x76/0x90 >>>>>> [] perf_callchain_user+0xd0/0x360 >>>>>> [] perf_callchain+0x1af/0x1f0 >>>>>> [] perf_prepare_sample+0x2f3/0x3a0 >>>>>> [] __perf_event_overflow+0x10f/0x220 >>>>>> [] perf_event_overflow+0x14/0x20 >>>>>> [] intel_pmu_handle_irq+0x1de/0x3c0 >>>>>> [] ? emulate_vsyscall+0x144/0x390 >>>>>> [] perf_event_nmi_handler+0x34/0x60 >>>>>> [] nmi_handle+0x8a/0x170 >>>>>> [] default_do_nmi+0x68/0x210 >>>>>> [] do_nmi+0x90/0xe0 >>>>>> [] end_repeat_nmi+0x1e/0x2e >>>>>> [] ? emulate_vsyscall+0x144/0x390 >>>>>> [] ? emulate_vsyscall+0x144/0x390 >>>>>> [] ? emulate_vsyscall+0x144/0x390 >>>>>> <> [] __bad_area_nosemaphore+0x21d/0x230 >>>>>> [] bad_area_nosemaphore+0x13/0x20 >>>>>> [] __do_page_fault+0x362/0x480 >>>>>> [] ? vm_mmap_pgoff+0xbc/0xe0 >>>>>> [] do_page_fault+0xe/0x10 >>>>>> [] page_fault+0x22/0x30 >>>>>> ---[ end trace 037bf09d279751ec ]--- >>>>>> >>>>>> So this is a double page faults. Looking at relevant changes in >>>>>> 3.13 kernel, I spotted the following one patch that modified the >>>>>> perf_callchain_user() function shown up in the stack trace above: >>>>>> >>>>> Hurm, that's an expected double fault, not something we should take the >>>>> process down for. >>>>> >>>>> I'll have to look at how all that works for a bit. >>> >>> Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and >>> context on vsyscall emulation faults"). >>> >>> It looks like your initial userspace fault hit the magic button and ends >>> up in emulate_vsyscall. Right at that point we trigger a PMI, which >>> tries to do a stack-trace. That stack-trace also stumbles into unmapped >>> memory (might be the same) and faults again. >>> >>> Now at that point, we usually just give up on the callchain and proceed >>> like normal, however because of this double fault emulate-vsyscall >>> SIGSEGV magic you loose. >>> >>> So the below might well be a valid fix.. Anybody? Andy? >> >> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt >> (there's nothing particularly special about NMIs here, I think) would >> try to access user memory. The fix below looks okay, but IMO it needs >> a big fat comment explaining what's going on. >> >> Is there a way to ask whether the previous entry into the kernel came >> from user space? The valid "sig_on_uaccess_error" case happens when >> the current fault was triggered by a fault from userspace. The >> invalid case (and any invalid case from, say, an int3 that a >> tracepoint stuck in there) would be a page fault triggered by a fault >> handler that in turn started in kernel space (in particular, in >> emulate_vsyscall). > > > The processes that got the SIGSEGV were all running shell scripts. I am not > totally sure that they were running in user space when getting the PMIs, but > are likely the case. The error you're seeing is: - Userspace code calls into the vsyscall page (e.g. old code calling gettimeofday). - Calls to the vsyscall page trap, so you end up executing in kernel space in emulate_vsyscall. - perf is running, so you end up in an NMI handler that triggers while sig_on_uaccess_fault is true. - The perf callchain code tries to read from a bad userspace pointer (not sure why -- the ip value in the vsyscall page *is* readable). That traps (as expected), but the trap handler injects SIGSEGV due to sig_on_uaccess_fault==1. The cleanest fix I can think of is to change the condition to sig_on_uaccess_fault==1 && (the get_user caller was a single kernel entry away from userspace). In the failure you're seeing, there was an NMI in there. --Andy