From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752100Ab1GYPiM (ORCPT ); Mon, 25 Jul 2011 11:38:12 -0400 Received: from mail-pz0-f42.google.com ([209.85.210.42]:39541 "EHLO mail-pz0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751997Ab1GYPiK (ORCPT ); Mon, 25 Jul 2011 11:38:10 -0400 Message-ID: <4E2D8DE1.6030604@gmail.com> Date: Mon, 25 Jul 2011 09:38:09 -0600 From: David Ahern User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc15 Thunderbird/3.1.10 MIME-Version: 1.0 To: Benjamin Herrenschmidt , Anton Blanchard CC: Kumar Gala , Becky Bruce , Paul Mackerras , linux-perf-users@vger.kernel.org, LKML , linuxppc-dev@lists.ozlabs.org Subject: Re: perf PPC: kernel panic with callchains and context switch events References: <4E274F5F.7000604@gmail.com> <4E2C53E0.3020400@gmail.com> <1311558949.25044.614.camel@pasglop> In-Reply-To: <1311558949.25044.614.camel@pasglop> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Ben: On 07/24/2011 07:55 PM, Benjamin Herrenschmidt wrote: > On Sun, 2011-07-24 at 11:18 -0600, David Ahern wrote: >> On 07/20/2011 03:57 PM, David Ahern wrote: >>> I am hoping someone familiar with PPC can help understand a panic that >>> is generated when capturing callchains with context switch events. >>> >>> Call trace is below. The short of it is that walking the callchain >>> generates a page fault. To handle the page fault the mmap_sem is needed, >>> but it is currently held by setup_arg_pages. setup_arg_pages calls >>> shift_arg_pages with the mmap_sem held. shift_arg_pages then calls >>> move_page_tables which has a cond_resched at the top of its for loop. If >>> the cond_resched() is removed from move_page_tables everything works >>> beautifully - no panics. >>> >>> So, the question: is it normal for walking the stack to trigger a page >>> fault on PPC? The panic is not seen on x86 based systems. >> >> Can anyone confirm whether page faults while walking the stack are >> normal for PPC? We really want to use the context switch event with >> callchains and need to understand whether this behavior is normal. Of >> course if it is normal, a way to address the problem without a panic >> will be needed. > > Now that leads to interesting discoveries :-) Becky, can you read all > the way and let me know what you think ? > > So, trying to walk the user stack directly will potentially cause page > faults if it's done by direct access. So if you're going to do it in a > spot where you can't afford it, you need to pagefault_disable() I > suppose. I think the problem with our existing code is that it's missing > those around __get_user_inatomic(). > > In fact, arguably, we don't want the hash code from modifying the hash > either (or even hashing things in). Our 64-bit code handles it today in > perf_callchain.c in a way that involves pretty much duplicating the > functionality of __get_user_pages_fast() as used by x86 (see below), but > as a fallback from a direct access which misses the pagefault_disable() > as well. > > I think it comes from an old assumption that this would always be called > from an nmi, and the explicit tracepoints broke that assumption. > > In fact we probably want to bump the NMI count, not just the IRQ count > as pagefault_disable() does, to make sure we prevent hashing. > > x86 does things differently, using __get_user_pages_fast() (a variant of > get_user_page_fast() that doesn't fallback to normal get_user_pages()). > > Now, we could do the same (use __gup_fast too), but I can see a > potential issue with ppc 32-bit platforms that have 64-bit PTEs, since > we could end up GUP'ing in the middle of the two accesses. > > Becky: I think gup_fast is generally broken on 32-bit with 64-bit PTE > because of that, the problem isn't specific to perf backtraces, I'll > propose a solution further down. > > Now, on x86, there is a similar problem with PAE, which is handled by > > - having gup disable IRQs > - rely on the fact that to change from a valid value to another valid > value, the PTE will first get invalidated, which requires an IPI > and thus will be blocked by our interrupts being off > > We do the first part, but the second part will break if we use HW TLB > invalidation broadcast (yet another reason why those are bad, I think I > will write a blog entry about it one of these days). > > I think we can work around this while keeping our broadcast TLB > invalidations by having the invalidation code also increment a global > generation count (using the existing lock used by the invalidation code, > all 32-bit platforms have such a lock). > > From there, gup_fast can be changed to, with proper ordering, check the > generation count around the loading of the PTE and loop if it has > changed, kind-of a seqlock. > > We also need the NMI count bump if we are going to try to keep the > attempt at doing a direct access first for perfs. > > Becky, do you feel like giving that a shot or should I find another > victim ? (Or even do it myself ... ) :-) Did you have something in mind besides the patch Anton sent? We'll give that one a try and see how it works. (Thanks, Anton!) David > > Cheers, > Ben. > >> Thanks, >> David >> >>> >>> []rb_erase+0x1b4/0x3e8 >>> []__dequeue_entity+0x50/0xe8 >>> []set_next_entity+0x178/0x1bc >>> []pick_next_task_fair+0xb0/0x118 >>> []schedule+0x500/0x614 >>> []rwsem_down_failed_common+0xf0/0x264 >>> []rwsem_down_read_failed+0x34/0x54 >>> []down_read+0x3c/0x54 >>> []do_page_fault+0x114/0x5e8 >>> []handle_page_fault+0xc/0x80 >>> []perf_callchain+0x224/0x31c >>> []perf_prepare_sample+0x240/0x2fc >>> []__perf_event_overflow+0x280/0x398 >>> []perf_swevent_overflow+0x9c/0x10c >>> []perf_swevent_ctx_event+0x1d0/0x230 >>> []do_perf_sw_event+0x84/0xe4 >>> []perf_sw_event_context_switch+0x150/0x1b4 >>> []perf_event_task_sched_out+0x44/0x2d4 >>> []schedule+0x2c0/0x614 >>> []__cond_resched+0x34/0x90 >>> []_cond_resched+0x4c/0x68 >>> []move_page_tables+0xb0/0x418 >>> []setup_arg_pages+0x184/0x2a0 >>> []load_elf_binary+0x394/0x1208 >>> []search_binary_handler+0xe0/0x2c4 >>> []do_execve+0x1bc/0x268 >>> []sys_execve+0x84/0xc8 >>> []ret_from_syscall+0x0/0x3c >>> >>> Thanks, >>> David >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yx0-f179.google.com (mail-yx0-f179.google.com [209.85.213.179]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by ozlabs.org (Postfix) with ESMTPS id D2CF5B6FD7 for ; Tue, 26 Jul 2011 01:38:13 +1000 (EST) Received: by yxm34 with SMTP id 34so2369798yxm.38 for ; Mon, 25 Jul 2011 08:38:10 -0700 (PDT) Message-ID: <4E2D8DE1.6030604@gmail.com> Date: Mon, 25 Jul 2011 09:38:09 -0600 From: David Ahern MIME-Version: 1.0 To: Benjamin Herrenschmidt , Anton Blanchard Subject: Re: perf PPC: kernel panic with callchains and context switch events References: <4E274F5F.7000604@gmail.com> <4E2C53E0.3020400@gmail.com> <1311558949.25044.614.camel@pasglop> In-Reply-To: <1311558949.25044.614.camel@pasglop> Content-Type: text/plain; charset=UTF-8 Cc: LKML , linux-perf-users@vger.kernel.org, Paul Mackerras , linuxppc-dev@lists.ozlabs.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi Ben: On 07/24/2011 07:55 PM, Benjamin Herrenschmidt wrote: > On Sun, 2011-07-24 at 11:18 -0600, David Ahern wrote: >> On 07/20/2011 03:57 PM, David Ahern wrote: >>> I am hoping someone familiar with PPC can help understand a panic that >>> is generated when capturing callchains with context switch events. >>> >>> Call trace is below. The short of it is that walking the callchain >>> generates a page fault. To handle the page fault the mmap_sem is needed, >>> but it is currently held by setup_arg_pages. setup_arg_pages calls >>> shift_arg_pages with the mmap_sem held. shift_arg_pages then calls >>> move_page_tables which has a cond_resched at the top of its for loop. If >>> the cond_resched() is removed from move_page_tables everything works >>> beautifully - no panics. >>> >>> So, the question: is it normal for walking the stack to trigger a page >>> fault on PPC? The panic is not seen on x86 based systems. >> >> Can anyone confirm whether page faults while walking the stack are >> normal for PPC? We really want to use the context switch event with >> callchains and need to understand whether this behavior is normal. Of >> course if it is normal, a way to address the problem without a panic >> will be needed. > > Now that leads to interesting discoveries :-) Becky, can you read all > the way and let me know what you think ? > > So, trying to walk the user stack directly will potentially cause page > faults if it's done by direct access. So if you're going to do it in a > spot where you can't afford it, you need to pagefault_disable() I > suppose. I think the problem with our existing code is that it's missing > those around __get_user_inatomic(). > > In fact, arguably, we don't want the hash code from modifying the hash > either (or even hashing things in). Our 64-bit code handles it today in > perf_callchain.c in a way that involves pretty much duplicating the > functionality of __get_user_pages_fast() as used by x86 (see below), but > as a fallback from a direct access which misses the pagefault_disable() > as well. > > I think it comes from an old assumption that this would always be called > from an nmi, and the explicit tracepoints broke that assumption. > > In fact we probably want to bump the NMI count, not just the IRQ count > as pagefault_disable() does, to make sure we prevent hashing. > > x86 does things differently, using __get_user_pages_fast() (a variant of > get_user_page_fast() that doesn't fallback to normal get_user_pages()). > > Now, we could do the same (use __gup_fast too), but I can see a > potential issue with ppc 32-bit platforms that have 64-bit PTEs, since > we could end up GUP'ing in the middle of the two accesses. > > Becky: I think gup_fast is generally broken on 32-bit with 64-bit PTE > because of that, the problem isn't specific to perf backtraces, I'll > propose a solution further down. > > Now, on x86, there is a similar problem with PAE, which is handled by > > - having gup disable IRQs > - rely on the fact that to change from a valid value to another valid > value, the PTE will first get invalidated, which requires an IPI > and thus will be blocked by our interrupts being off > > We do the first part, but the second part will break if we use HW TLB > invalidation broadcast (yet another reason why those are bad, I think I > will write a blog entry about it one of these days). > > I think we can work around this while keeping our broadcast TLB > invalidations by having the invalidation code also increment a global > generation count (using the existing lock used by the invalidation code, > all 32-bit platforms have such a lock). > > From there, gup_fast can be changed to, with proper ordering, check the > generation count around the loading of the PTE and loop if it has > changed, kind-of a seqlock. > > We also need the NMI count bump if we are going to try to keep the > attempt at doing a direct access first for perfs. > > Becky, do you feel like giving that a shot or should I find another > victim ? (Or even do it myself ... ) :-) Did you have something in mind besides the patch Anton sent? We'll give that one a try and see how it works. (Thanks, Anton!) David > > Cheers, > Ben. > >> Thanks, >> David >> >>> >>> []rb_erase+0x1b4/0x3e8 >>> []__dequeue_entity+0x50/0xe8 >>> []set_next_entity+0x178/0x1bc >>> []pick_next_task_fair+0xb0/0x118 >>> []schedule+0x500/0x614 >>> []rwsem_down_failed_common+0xf0/0x264 >>> []rwsem_down_read_failed+0x34/0x54 >>> []down_read+0x3c/0x54 >>> []do_page_fault+0x114/0x5e8 >>> []handle_page_fault+0xc/0x80 >>> []perf_callchain+0x224/0x31c >>> []perf_prepare_sample+0x240/0x2fc >>> []__perf_event_overflow+0x280/0x398 >>> []perf_swevent_overflow+0x9c/0x10c >>> []perf_swevent_ctx_event+0x1d0/0x230 >>> []do_perf_sw_event+0x84/0xe4 >>> []perf_sw_event_context_switch+0x150/0x1b4 >>> []perf_event_task_sched_out+0x44/0x2d4 >>> []schedule+0x2c0/0x614 >>> []__cond_resched+0x34/0x90 >>> []_cond_resched+0x4c/0x68 >>> []move_page_tables+0xb0/0x418 >>> []setup_arg_pages+0x184/0x2a0 >>> []load_elf_binary+0x394/0x1208 >>> []search_binary_handler+0xe0/0x2c4 >>> []do_execve+0x1bc/0x268 >>> []sys_execve+0x84/0xc8 >>> []ret_from_syscall+0x0/0x3c >>> >>> Thanks, >>> David >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev > >