Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event

From: Peter Zijlstra <peterz@infradead.org>
To: Will Deacon <will@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>, Mark Rutland <mark.rutland@arm.com>,
	Mike Leach <mike.leach@linaro.org>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H . Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Mathieu Poirier <mathieu.poirier@linaro.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Jiri Olsa <jolsa@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event
Date: Mon, 11 Nov 2019 21:32:43 +0100	[thread overview]
Message-ID: <20191111203243.GT4131@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20191111172935.GA11972@willie-the-truck>

On Mon, Nov 11, 2019 at 05:29:35PM +0000, Will Deacon wrote:
> On Mon, Nov 11, 2019 at 05:05:05PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:
> > 
> > > Backing up though, I think I'm missing details about what this thread is
> > > trying to achieve. You're adding perf events so that coresight trace can
> > > take into account modifications of the kernel text, right?
> > 
> > Yes, because ARM-CS / Intel-PT need to know the exact text at any one
> > time in order to correctly decode their traces.
> > 
> > > If so:
> > >
> > >   * Does this need to take into account more than just jump_label()?
> > 
> > jump_label seems to be the initial target Adrian set, but yes, it needs
> > to cover 'everything'.
> 
> Including alternatives, which are what get me worried since the potential
> for recursion is pretty high there (on arm64, at least).

So I had not considered alternatives because they're typically ran once
at boot (and module load) and never seen again. That would make them
just part of loading new text.

But you mentioned wanting to run them at hotplug time... which is more
'interresting'.

> > That is, all single instruction patching crud around:
> > 
> >  - optimized kprobes
> >    (regular kprobes are exempt because they rely on exceptions)
> >  - jump_labels
> >  - static_call (still pending but still)
> >  - ftrace fentry
> > 
> > We also need a solution for whole new chunks of text:
> > 
> >  - modules
> >  - ftrace trampolines
> >  - optprobe trampolines
> >  - JIT stuff
> > 
> > but that is so far not included; I had some ideas about a /dev/mem based
> > interface that would report new ranges and wait for acks (from open
> > file-desc) before freeing them.
> 
> I think that it would be nice not to end up with two very different
> interfaces for this. But I see this is still just an RFC, so maybe the
> full picture will emerge once we solve more of these use-cases.

The general distinction is between new text mappings and changing them
once they exist.

In general a text mapping is large and doesn't change (much). Once you
get an event it exist you can copy it out at your convenience, all you
really need to make sure of it that it doesn't dissapear before you've
completed your copy.

OTOH dynamic text like jump_labels can happen quite frequently and we
cannot wait for all observers to have observed/copied the new state
before we allow changing it again -- ie. we need a buffered event.

So we don't want to stick whole new text things into a buffer (a module
might be larger than the buffer) but we cannot be lazy with text
updates.

That is, yes it sucks, but these are two different cases.

> > Instead we rely on exceptions; exceptions are differently encoded in the
> > CS/PT data streams.
> > 
> > The scheme used is:
> > 
> >  - overwrite target instruction with an exception (INT3 on x86, BRK on arm)
> >  - sync (IPI broadcast CPUID or I-FLUSH completion)
> 
> Hmm. Wouldn't this sync also need to drain the trace buffers for all other
> CPUs so that we make sure that the upcoming TEXT_POKE event occurs after
> all prior trace data, which could've been from before the breakpoint was
> installed?

All we need to ensure is that the breakpoint is visible before the
event. That way we know that before the event we have the old
instruction, after the event we have the new instruction, and any
ambiguity must be resolved with exception packets.

That is, if there is concurrency, the trace buffer will be 'flushed' by
the exception. If there is no concurrency, we don't care and can assume
old/new depending on timestamps relative to the event.

> > at this point we know the instruction _will_ trap and CS/PT can observe
> > this alternate flow. That is, the exception handler will emulate the
> > instruction.
> > 
> >  - emit the TEXT_POKE event with both the old and new instruction
> >    included
> > 
> >  - overwrite the target instruction with the new instruction
> >  - sync
> > 
> > at this point the new instruction should be valid.
> > 
> > Using this scheme we can at all times follow the instruction flow.
> > Either it is an exception and the exception encoding helps us navigate,
> > or, on either size, we'll know the old/new instruction.