From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933086AbcDYQur (ORCPT ); Mon, 25 Apr 2016 12:50:47 -0400 Received: from mail-oi0-f48.google.com ([209.85.218.48]:36553 "EHLO mail-oi0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932686AbcDYQuq (ORCPT ); Mon, 25 Apr 2016 12:50:46 -0400 MIME-Version: 1.0 In-Reply-To: <20160425151638.GL3430@twins.programming.kicks-ass.net> References: <5707D9F1.3090102@virtuozzo.com> <570E79EF.7030408@virtuozzo.com> <20160420110402.GY3408@twins.programming.kicks-ass.net> <20160420190507.GF3408@twins.programming.kicks-ass.net> <20160421201241.GM3430@twins.programming.kicks-ass.net> <20160425151638.GL3430@twins.programming.kicks-ass.net> From: Andy Lutomirski Date: Mon, 25 Apr 2016 09:50:15 -0700 Message-ID: Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode To: Peter Zijlstra Cc: Dmitry Safonov , Thomas Gleixner , Shuah Khan , Ingo Molnar , Dave Hansen , Borislav Petkov , khorenko@virtuozzo.com, X86 ML , Andrew Morton , xemul@virtuozzo.com, linux-kselftest@vger.kernel.org, Cyrill Gorcunov , Dmitry Safonov <0x7f454c46@gmail.com>, "linux-kernel@vger.kernel.org" , "H. Peter Anvin" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 25, 2016 at 8:16 AM, Peter Zijlstra wrote: > On Thu, Apr 21, 2016 at 04:27:19PM -0700, Andy Lutomirski wrote: >> > Did that help? Or did I confuse you moar? >> > >> >> I think I'm starting to get it. What if we rearrange slightly, like this: >> >> perf_sample_data already has a struct perf_regs in it. We could add a >> flags field to the first chunk of perf_sample_data: >> >> u64 sample_flags; > > I actually considered that for another problem. Didn't like it then, but > seeing how I still haven't figured out a better way and you're now > proposing this too, maybe... > > Part of the problem is that this will completely exhaust that first > cacheline :/ What do you mean? You have a whole 63 bits left :) Another option would be to initialize regs_user.regs to PERF_REGS_NOT_YET_FILLED (#defined to ~0 or whatever). That would involve a *write* to an otherwise possibly unused cacheline, which is less than ideal but is probably considerably less bad than reading the cacheline. > >> perf_sample_data_init sets sample_flags to zero. > > And while we're on struct perf_sample_data, that thing has gotten > insanely large. We carry it on-stack! > > It should be fairly easy to take regs_user_copy out and use a per-cpu > array of them things for this I think, see below. > >> Now we rename perf_sample_regs_user to __perf_sample_regs_user and >> make it non-static. We also teach it to set do data->sample_flags |= >> PERF_SAMPLE_FLAGS_HAS_REGS_USER. We add: >> >> static void perf_fetch_regs_user(struct perf_sample_data *data, struct >> pt_regs *interrupt_regs) >> { >> if (data->sample_flags & PERF_SAMPLE_FLAGS_HAS_REGS_USER) >> return; >> >> __perf_sample_regs_user(&data->regs_user, interrupt_regs, >> &data->regs_user_copy); >> } > > I meant to change perf_prepare_sample() to do: > > u64 sample_type = event->attr.sample_type & ~data.sample_type; > > or something similar, such that we can override/avoid some of the work > there. I'm not sure I follow, but that's okay. > >> (Hmm. This only really works well if we can guarantee that >> interrupt_regs remains valid for the life of the perf_sample_data >> object. Could we perhaps move the interrupt_regs pointer *into* >> perf_sample_data and stop passing it all over the place?) > > So the problem with that is that we'll now overflow the one cacheline, > and the last time I really looked at this that made samples that much > slower. > > It might be time to re-evaluate this stuff, since pretty much everything > will eventually write into perf_sample_data::ip etc.. which is the > second line anyway. > > Also, looking at it, we actually have a pointer in there for this, > perf_sample_data::regs_intr::regs, but its at the very tail of this > monster, 4 cachelines off the normal path. > >> We change all the callers of perf_sample_regs_user to use >> perf_fetch_regs_user instead. > > There's only the one site currently, but yeah. > >> What do you think? If you like it, I can probably find some time to >> give it a shot, but I don't guarantee that I won't miss some subtlety >> in its interaction with the rest of the event output code. > > Sure give it a go, I'll stomp on it to fix the pebs-time issue (we need > to skip perf_prepare_sample's PERF_SAMPLE_TIME branch for that). Will do. No promises about the time frame -- my queue overfloweth right now. But I do have a draft patch or two that I should be able to dust off a bit over the next few days. > >> On a vaguely related note, why is the big prebs-to-pt_regs copy >> conditional on (sample_type & PERF_SAMPLE_REGS_INTR)? I bet it would >> be faster to make it unconditional, because you could avoid copying >> over the entire pt_regs struct if PERF_SAMPLE_REGS_INTR isn't set. > > Hmm, yes.. that code did move about a bit, not sure what it looked like > originally. > > In any case, That fully copy is overkill in the simple case as well, I > think that could get away with only copying cs,flags. > I'd be more comfortable with it if we always either populated all or none of it or otherwise made sure that unpopulated regs never leaked out into a sample. > > Compile tested only.. > > --- > Subject: perf: Replace perf_sample_data::regs_user_copy with per-cpu storage > > struct perf_sample_data is immense, and we carry it on stack, shrink it > some. > > struct perf_sample_data { > /* size: 384, cachelines: 6, members: 19 */ > } > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/perf_event.h | 2 -- > kernel/events/core.c | 23 +++++++++++++++++------ > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 85749ae8cb5f..dd2cab6c5bbb 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -795,8 +795,6 @@ struct perf_sample_data { > * on arch details. > */ > struct perf_regs regs_user; > - struct pt_regs regs_user_copy; > - > struct perf_regs regs_intr; > u64 stack_user_size; > } ____cacheline_aligned; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index eabeb2aec00f..72754607d2cd 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5146,15 +5146,27 @@ perf_output_sample_regs(struct perf_output_handle *handle, > } > } > > -static void perf_sample_regs_user(struct perf_regs *regs_user, > - struct pt_regs *regs, > - struct pt_regs *regs_user_copy) > +static DEFINE_PER_CPU(struct pt_regs, __regs_user[4]); > + > +static struct pt_regs *regs_user_ptr(void) > +{ > + if (in_nmi()) > + return this_cpu_ptr(&__regs_user[0]); > + if (in_interrupt()) > + return this_cpu_ptr(&__regs_user[1]); > + if (in_serving_softirq()) > + return this_cpu_ptr(&__regs_user[2]); > + return this_cpu_ptr(&__regs_user[3]); > +} > + There's already something very similar in kernel/trace/trace_event_perf.c and core.c (perf_swevent_get_recursion_context()) that explicitly counts recursion. Could they maybe be merged? I.e. there could just be a per-cpu pile of pt_regs structs and a simple allocator for them? E.g. perf_sample_data_init could increment some counter and perf_sample_data_free (which doesn't currently exist) could decrement the counter. I don't personally mind keeping one of these on the stack -- it's not *that* big. But maybe there's a much better solution. There is only ever one set of user regs at a time. If perf events nest, then the user regs are exactly the same. I wonder if this means that there could be a single percpu copy of this mess. It might not be quite that simple, because an NMI could hit in the middle of populating the thing, though. Grumble. > +static void > +perf_sample_regs_user(struct perf_regs *regs_user, struct pt_regs *regs) > { > if (user_mode(regs)) { > regs_user->abi = perf_reg_abi(current); > regs_user->regs = regs; > } else if (current->mm) { > - perf_get_regs_user(regs_user, regs, regs_user_copy); > + perf_get_regs_user(regs_user, regs, regs_user_ptr()); > } else { > regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE; > regs_user->regs = NULL; > @@ -5638,8 +5650,7 @@ void perf_prepare_sample(struct perf_event_header *header, > } > > if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER)) > - perf_sample_regs_user(&data->regs_user, regs, > - &data->regs_user_copy); > + perf_sample_regs_user(&data->regs_user, regs); The rest looks reasonable. --Andy