Re: Additional debug info to aid cacheline analysis

From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-toolchains@vger.kernel.org,
	Stephane Eranian <eranian@google.com>,
	linux-kernel@ver.kernel.org, Ingo Molnar <mingo@kernel.org>,
	Jiri Olsa <jolsa@kernel.org>,
	namhyung@kernel.org, irogers@google.com, kim.phillips@amd.com,
	Mark Rutland <mark.rutland@arm.com>,
	andrii@kernel.org
Subject: Re: Additional debug info to aid cacheline analysis
Date: Tue, 6 Oct 2020 16:00:54 -0300	[thread overview]
Message-ID: <20201006190054.GA187024@kernel.org> (raw)
In-Reply-To: <20201006131703.GR2628@hirez.programming.kicks-ass.net>

Em Tue, Oct 06, 2020 at 03:17:03PM +0200, Peter Zijlstra escreveu:
> Hi all,

> I've been trying to float this idea for a fair number of years, and I
> think at least Stephane has been talking to tools people about it, but
> I'm not sure what, if anything, ever happened with it, so let me post it
> here :-)

> Basically, what I want is a (perf) tool for cacheline optimizations.
> Something very much like the excellent pahole tool, but with hit/miss
> information added.

> Now, some PMUs provide the data address for various relevant events, but
> that gets us the problem of mapping a 'random' address to a type and
> offset. And esp. for dynamic objects, that's a difficult problem.

> However, the compiler actually knows what type and offset (most) memory
> references are, so if perf can get us the exact IP (Intel PEBS / AMD
> IBS, as opposed to one with skid on) we could get the type from debug
> info.

> And therein lies the rub, existing debug info (DWARF) does contain type
> information, but in a way that is (I've been told) _very_ hard to use
> for this purpose.

> So could the compiler emit extra debug info for every instruction with a
> memory reference on to facilitate this?

I guess this is what is done to enable CO-RE, there you have to mark
areas of interest, i.e. in your program you enclose access to fields of
kernel data structures you use in your BPF program so that when loading
it libbpf can check at the fields used in your program and in the kernel
(/sys/kernel/btf/vmlinux) and figure out if those fields moved, then it
fixes up the offsets from the start of the struct.

You want those relocation records for all types in the kernel, not to
fixup things, but to figure out that some load or store in some struct
member is for a type.

https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html

<quote>
Compiler support

To enable BPF CO-RE and let BPF loader (i.e., libbpf) to adjust BPF
program to a particular kernel running on target host, Clang was
extended with few built-ins. They emit BTF relocations which capture a
high-level description of what pieces of information BPF program code
intended to read. If you were going to access task_struct->pid field,
Clang would record that it was exactly a field named "pid" of type
“pid_t” residing within a struct task_struct. This is done so that even
if target kernel has a task_struct layout in which “pid” field got moved
to a different offset within a task_struct structure (e.g., due to extra
field added before “pid” field), or even if it was moved into some
nested anonymous struct or union (and this is completely transparent in
C code, so no one ever pays attention to details like that), we’ll still
be able to find it just by its name and type information. This is called
a field offset relocation.

It is possible to capture (and subsequently relocate) not just a field
offset, but other field aspects, like field existence or size. Even for
bitfields (which are notoriously "uncooperative" kinds of data in the C
language, resisting efforts to make them relocatable) it is still
possible to capture enough information to make them relocatable, all
transparently to BPF program developer.
</quote>

<quote>
High-level BPF CO-RE mechanics

BPF CO-RE brings together necessary pieces of functionality and data at
all levels of the software stack: kernel, user-space BPF loader library
(libbpf), and compiler (Clang) – to make it possible and easy to write
BPF programs in a portable manner, handling discrepancies between
different kernels within the same pre-compiled BPF program. BPF CO-RE
requires a careful integration and cooperation of the following
components:

BTF type information, which allows to capture crucial pieces of
information about kernel and BPF program types and code, enabling all
the other parts of BPF CO-RE puzzle;

compiler (Clang) provides means for BPF program C code to express the
intent and record relocation information;

BPF loader (libbpf) ties BTFs from kernel and BPF program together to
adjust compiled BPF code to specific kernel on target hosts;

kernel, while staying completely BPF CO-RE-agnostic, provides advanced
BPF features to enable some of the more advanced scenarios.

Working in ensemble, these components enable unprecedented ability to
develop portable BPF programs with ease, adaptability, and expressivity,
previously achievable only through compiling BPF program’s C code in
runtime through BCC, but without paying a high price of the BCC way.
</quote>

- Arnaldo