From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752852AbcDTPkr (ORCPT ); Wed, 20 Apr 2016 11:40:47 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:34741 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751124AbcDTPkn (ORCPT ); Wed, 20 Apr 2016 11:40:43 -0400 MIME-Version: 1.0 In-Reply-To: <20160420110402.GY3408@twins.programming.kicks-ass.net> References: <1459960170-4454-2-git-send-email-dsafonov@virtuozzo.com> <57064E6C.2030202@virtuozzo.com> <5707B70F.9080402@virtuozzo.com> <5707D9F1.3090102@virtuozzo.com> <570E79EF.7030408@virtuozzo.com> <20160420110402.GY3408@twins.programming.kicks-ass.net> From: Andy Lutomirski Date: Wed, 20 Apr 2016 08:40:23 -0700 Message-ID: Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode To: Peter Zijlstra Cc: Dmitry Safonov , Thomas Gleixner , Shuah Khan , Ingo Molnar , Dave Hansen , Borislav Petkov , khorenko@virtuozzo.com, X86 ML , Andrew Morton , xemul@virtuozzo.com, linux-kselftest@vger.kernel.org, Cyrill Gorcunov , Dmitry Safonov <0x7f454c46@gmail.com>, "linux-kernel@vger.kernel.org" , "H. Peter Anvin" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 20, 2016 at 4:04 AM, Peter Zijlstra wrote: > On Thu, Apr 14, 2016 at 11:27:35AM -0700, Andy Lutomirski wrote: >> On Wed, Apr 13, 2016 at 9:55 AM, Dmitry Safonov wrote: >> > On 04/08/2016 11:44 PM, Andy Lutomirski wrote: >> >> >> >> Feel free to ask for help on some of these details. user_64bit_mode >> >> will be helpful too. >> > >> > Hello again, >> > >> > here are some questions on TIF_IA32 removal: >> > - in function intel_pmu_pebs_fixup_ip: there is need to >> > know if process was it native/compat mode for instruction >> > interpreter for IP + one instruction fixup. There are >> > registers, but they are from PEBS, which does not contain >> > segment descriptors (even for PEBSv3). Other values >> > are from interrupt regs (look at setup_pebs_sample_data). >> > So, I guess, we may use user_64bit_mode on interrupt >> > register set, which will be racy with changing task's mode, >> > but quite ok? >> >> Here's my understanding: >> >> We don't actually know the mode, and there's no way we could get it >> exactly. User code could have changed the mode between when the PEBS >> event was written and when we got the interrupt, and there's no way >> for us to tell. >> >> The regs passed to the interrupt aren't particularly helpful -- if we >> get the overflow event from kernel mode, the regs will be kernel regs, >> not user regs. >> >> What we can do is to the the regs returned by perf_get_regs_user, >> which I imagine perf is already doing. Peter, is this the case? > > *confused*, how is perf_get_regs_user() connected to the PEBS fixup? > > Ah, you want to use perf_get_regs_user() instead of task_pt_regs() > because of how an NMI during interrupt entry would mess up the > task_pt_regs() contents. > > At that point you can use regs_user->abi, right? Yes, exactly. Do LBR, PEBS, and similar report user regs or do they merely want to know the instruction format? If the latter, I could whip up a tiny function to do just that (like perf_get_regs_user but just for ABI -- it would be simpler). [merging some emails] >> Peter, I got lost in the code that calls this. Are regs coming from >> the overflow interrupt's regs, current_pt_regs(), or >> perf_get_regs_user? > > So get_perf_callchain() will get regs from: > > - interrupt/NMI regs > - perf_arch_fetch_caller_regs() > > And when user && !user_mode(), we'll use: > > - task_pt_regs() (which arguably should maybe be perf_get_regs_user()) Could you point me to this bit of the code? > > to call perf_callchain_user(), which then, ands up calling > perf_callchain_user32() which is expected to NO-OP for 64bit userspace. > >> If it's the perf_get_regs_user, then this should be okay, but passing >> in the ABI field directly would be even nicer. If they're coming from >> the overflow interrupt's regs or current_pt_regs(), could we change >> that? >> >> It might also be nice to make sure that we call perf_get_regs_user >> exactly once per overflow interrupt -- i.e. we could push it into the >> main code rather than the regs sampling code. > > The risk there is that we might not need the user regs at all to handle > the overflow thingy, so doing it unconditionally would be unwanted. One call to perf_get_user_regs per interrupt shouldn't be too bad -- certainly much better then one per PEBS record. One call to get user ABI per overflow would be even less bad, but at that point, folding it in to the PEBS code wouldn't be so bad either. If I'm understanding this right (a big, big if), if we get a PEBS overflow while running in user mode, we'll dump out the user regs (and call perf_get_regs_user) and all the PEBS entries (subject to exclude_kernel and with all the decoding magic). So, in that case, we call perf_get_user_regs. If we get a PEBS overflow while running in kernel mode, we'll report the kernel regs (if !exclude_kernel) and report the PEBS data as well. If any of those records are in user mode, then, ideally, we'd invoke perf_get_regs_user or similar *once* to get the ABI. Although, if we can get the user ABI efficiently enough, then maybe we don't care if we call it once per PEBS record. On x86, the only weird cases are NMIs or MCEs that land in the syscall, syscall32, and sysenter prologues (easy to handle fully correctly if we care because the IP that we interrupted tells us the ABI) and the bullshit SYSENTER+TF thing. Even the latter isn't so hard to get right. --Andy