Re: Question: missing vmlinux BTF variable declarations

From: Stephen Brennan <stephen.s.brennan@oracle.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>,
	Yonghong Song <yhs@fb.com>, Shung-Hsi Yu <shung-hsi.yu@suse.com>,
	bpf <bpf@vger.kernel.org>, Omar Sandoval <osandov@osandov.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
Subject: Re: Question: missing vmlinux BTF variable declarations
Date: Wed, 27 Apr 2022 11:24:42 -0700	[thread overview]
Message-ID: <87r15iv0yd.fsf@stepbren-lnx.us.oracle.com> (raw)
In-Reply-To: <CAEf4BzbiFNnsu9pji5ifzj4nVEyAYYdqP=QVZ3XFwzL48prP3A@mail.gmail.com>

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Wed, Mar 16, 2022 at 11:11 PM Stephen Brennan <stephen@brennan.io> wrote:
>>
>> Arnaldo Carvalho de Melo <acme@kernel.org> writes:
>> [...]
>> >> I think that kallsyms, BTF, and ORC together will be enough to provide a
>> >> lite debugging experience. Some things will be missing:
>> >
>> >> - mapping backtrace addresses to source code lines
>> >
>> > So, BTF has provisions for that, and its present in the eBPF programs,
>> > perf annotate uses it, see tools/perf/util/annotate.c,
>> > symbol__disassemble_bpf(), it goes like:
>> >
>> >         struct bpf_prog_linfo *prog_linfo = NULL;
>> >
>> >         info_node = perf_env__find_bpf_prog_info(dso->bpf_prog.env,
>> >                                                  dso->bpf_prog.id);
>> >         if (!info_node) {
>> >                 ret = SYMBOL_ANNOTATE_ERRNO__BPF_MISSING_BTF;
>> >                 goto out;
>> >         }
>> >         info_linear = info_node->info_linear;
>> >         sub_id = dso->bpf_prog.sub_id;
>> >
>> >         info.buffer = (void *)(uintptr_t)(info_linear->info.jited_prog_insns);
>> >         info.buffer_length = info_linear->info.jited_prog_len;
>> >
>> >         if (info_linear->info.nr_line_info)
>> >                 prog_linfo = bpf_prog_linfo__new(&info_linear->info);
>> >
>> >                 addr = pc + ((u64 *)(uintptr_t)(info_linear->info.jited_ksyms))[sub_id];
>> >                 count = disassemble(pc, &info);
>> >
>> >                 if (prog_linfo)
>> >                         linfo = bpf_prog_linfo__lfind_addr_func(prog_linfo,
>> >                                                                 addr, sub_id,
>> >                                                                 nr_skip);
>> >                               if (linfo && btf) {
>> >                         srcline = btf__name_by_offset(btf, linfo->line_off);
>> >                         nr_skip++;
>> >                 } else
>> >                         srcline = NULL;
>> >
>> > etc.
>> >
>> > Having this for the kernel proper is thus doable, but then we go on
>> > making BTF info grow.
>> >
>> > Perhaps having this as optional, distros or appliances wanting to have a
>> > kernel with this extra info would add it and then tools would use it if
>> > available?
>>
>> I didn't know about the source code mapping support! And I certainly see
>> the utility of it for BPF programs. However, I'm not sure that a "lite"
>> kernel debugging experience *needs* source line mapping. I suppose I
>> should have made it more clear, but I don't think of that list of
>> "missing" features as a checklist of things we'd want feature parity
>> for.
>>
>> The advantage of BTF for debugging would be that it is small, and that
>> it is part of the kernel image without referencing any other file,
>> build-id, or kernel version. Ideally, a debugger could load a crash dump
>> with no additional information, and support a reasonable level of
>> debugging. I think looking up typed data structure values via global
>> symbols is part of that level, as well as simple backtraces and other
>> memory access.
>>
>> I wouldn't want to try to re-implement DWARF for debuginfo. If you have
>> the DWARF debuginfo, then your experience should be much better.
>>
>> >> - intelligent stack frame information from DWARF CFI (e.g.
>> >>   register/variable values)
>> >> - probably other things, I'm not a DWARF expert.
>> [...]
>> >> > Currently on my local machine, the vmlinux BTF's size is 4.2MB and
>> >> > adding 1MB would be a big increase. CONFIG_DEBUG_INFO_BTF_ALL is a good
>> >> > idea. But we might be able to just add global variables without this
>> >> > new config if we have strong use case.
>> >
>> >> And unfortunately 1MiB is really just a shot in the dark, guessing
>> >> around 70k variables with no string data.
>> >
>> > Maybe we can have a separate BTF file with all this extra info that
>> > could be fetched from somewhere, keyed by build-id, like is now possible
>> > with debuginfod and DWARF?
>>
>> For me, this ranges into the territory of duplicating DWARF. If you lose
>> the one key advantage of "debuginfoless debugging", then you might as
>> well use the build-id to lookup DWARF debuginfo as we can today.
>>
>> This is why I'm trying to propose the means of combining the kallsyms
>> string data with BTF. Anything that can make the overall size increase
>> manageable so that all the necessary data can stay in the kernel image.
>
> I think this quirk of using kallsyms strings is a no-go. But we should
> experiment and see how much bigger BTF becomes when including all the
> variables. Can you try to prototype pahole's support for this?

Hi Andrii,

Sorry for such a delay here. I tried to prototype this last month but
encountered some issues I couldn't resolve. But recently I picked it up
and I've created a prototype [1] which outputs all variables. (It's a
quite bad prototype, it strips out some useful logic regarding the
BTF_VAR_DATASEC for percpu variables. But I think it's good enough).

On my 5.4-based kernel I saw an increase in BTF section size from 3.8
MiB all the way to 6.1 MiB, or more precisely:

BTF section before: 3905938 bytes
BTF section after:  6391989 bytes (+2486051, +63.6%)

So almost a 2.5 MiB increase. My prototype doesn't output the
btf_var_secinfo structs for percpu variables anymore, which probably
breaks some BPF and reduces BTF slightly. But it also is outputting
a few thousand "dwarf variables" which were correctly filtered before,
so I think it's a wash and it's a pretty good comparison.

Clearly it can't be added without a configuration option, as 2.5 MiB is
pretty huge for a kernel memory addition. But I don't think it's so huge
that nobody would enable it. I know I would :)

[1]: https://github.com/brenns10/dwarves/tree/remove_percpu_restriction_1

> As you
> said, we can guard this extra information with KConfig and pahole
> flags, so distros can always opt-out of bigger BTF if that's too
> prohibitive. As it is right now, without firm understanding how big
> the final BTF is it's hard to make a good decision about go or no-go
> for this.

Hopefully this comparison sheds some light on that now!

>
> As for including source code itself, it going to be prohibitively
> huge, so it's probably out of the question for now as well.

Yeah, I wouldn't advocate for that.

Now, to share some of the cool possibilities that this enables. I have:
- prototype pahole [1] used for the kernel build,
- a prototype drgn with BTF+kallsyms support [2],
- some small kernel patches which add symbols to vmcoreinfo, so that
  drgn can find the kallsyms section. I'm happy to share these, I just
  haven't sent them anywhere yet.

[2]: https://github.com/brenns10/drgn/tree/kallsyms_plus_btf

Combining these three things, I've got a debugger which can open up a
vmcore _without DWARF debuginfo_ and allow you to print out typed
variable values. It just relies on BTF + kallsyms.

So the proof of concept is proven, and I'm quite excited about it!

Stephen