From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Yonghong Song <yhs@fb.com>, Andrii Nakryiko <andrii@kernel.org>,
Jiri Olsa <jolsa@kernel.org>,
dwarves@vger.kernel.org, bpf <bpf@vger.kernel.org>,
Kernel Team <kernel-team@fb.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22
Date: Tue, 15 Jun 2021 16:38:12 -0300 [thread overview]
Message-ID: <YMkBpBfqW+AcpyNN@kernel.org> (raw)
In-Reply-To: <CAEf4BzaDNim+kFQx64i9EZogctGZNFigQBsog7eC6DjrfjTbEA@mail.gmail.com>
Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > I think it's very fragile and it will be easy to get
> > > > broken/invalid/incomplete BTF. Yonghong already brought up the case
> > > I thought about that as it would be almost like the compiler generating
> > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > probably it is best to go with the second approach I outlined and you
> > > agreed to be less fragile, so I'll go with that, thanks for your
> > > comments.
> > So, just to write some notes here from what I saw so far:
> > 1. In the LTO cases there are inter-CU references, so the current code
> > combines all CUs into one and we end up not being able to parallelize
> > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > think the current algorithm is ideal, can be improved.
> Yeah, let's worry about LTO later.
> > 2. The case where there's no inter CU refs, which so far is the most
> > common, seems easier, we create N threads, all sharing the dwarf_loader
> > state and the btf_encoder, as-is now. we can process one CU per thread,
> > and as soon as we finish it, just grab a lock and call
> > btf_encoder__encode_cu() with the just produced CU data structures
> > (tags, types, functions, variables, etc), consume them and delete the
> > CU.
> >
> > So each thread will consume one CU, push it to the 'struct btf' class
> > as-is now and then ask for the next CU, using the dwarf_loader state,
> > still under that lock, then go back to processing dwarf tags, then
> > lock, btf add types, rinse, repeat.
>
> Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> appending to it for each processed CU until we run out of CUs be
> simpler?
I thought about this as a logical next step, I would love to have a
'btf__merge_argv(struct btf *btf[]), is there one?
But from what I've read after this first paragraph of yours, lemme try
to rephrase:
1. pahole calls btf_encoder__new(...)
Creates a single struct btf.
2. dwarf_loader will create N threads, each will call a
dwarf_get_next_cu() that is locked and will return a CU to process, when
it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
lock. Rinse repeat.
Until all the threads have consumed all CUs.
then btf_encoder__encode(), which should be probably renamed to
btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
or raw file.
My first reaction to your first paragraph was:
Yeah, we can have multiple 'struct btf' instances, one per thread, that
will each contain a subset of DWARF CU's encoded as BTF, and then I have
to merge the per-thread BTF and then dedup. O think my rephrase above is
better, no?
> So each thread does as much as possible locally without any
> locks. And only at the very end we merge everything together and then
> dedup. Or we can even dedup inside each worker before merging final
> btf, that probably would give quite a lot of speed up and some memory
> saving. Would be interesting to experiment with that.
>
> So I like the idea of a fixed pool of threads (can be customized, and
> I'd default to num_workers == num_cpus), but I think we can and should
> keep them independent for as long as possible.
Sure, this should map the whatever the user passes to -j in the kernel
make command line, if nothing is passed as an argument, then default to
getconf(_NPROCESSORS_ONLN).
There is a nice coincidence here where we probably don't care about -J
anymore and want to deal only with -j (detached btf) that is the same as
what 'make' expects to state how many "jobs" (thread pool size) the user
wants 8-)
> Another disadvantage of generating small struct btf and then lock +
> merge is that we don't get as efficient string re-use, we'll churn
> more on string memory allocation. Keeping bigger local struct btfs
> allow for more efficient memory re-use (and probably a tiny bit of CPU
> savings).
I think we're in the same page, the contention for adding the CU to a
single 'struct btf' (amongst all DWARF loading threads) after we just
produced it should be minimal, so we grab all the advantages: locality
of reference, minimal contention as DWARF reading/creating the pahole
internal, neutral, data structures should be higher than adding
types/functions/variables via the libbpf BTF API.
I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
trying to do now is to paralellize the DWARF _loading_, right?
> So please consider that, it also seems simpler overall.
> > The ordering will be different than what we have now, as some smaller
> > CUs (object files with debug) will be processed faster so will get its
> > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > a difference, right?
> Right, order doesn't matter.
> > I think I'm done with refactoring the btf_encoder code, which should be
> > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > what the previous loader (DWARF) produced and feeding libbpf.
> Great.
> > I thought about fancy thread pools, etc, researching some pre-existing
> > thing or doing some kthread + workqueue lifting from the kernel but will
> > instead start with the most spartan code, we can improve later.
> Agree, simple is good. Really curious how much faster we can get. I
> think anything fancy will give a relatively small improvement. The
> biggest one will come from any parallelization.
And I think that is possible, modulo elfutils libraries saying no, I
hope that will not be the case.
- Arnaldo
next prev parent reply other threads:[~2021-06-15 19:38 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-27 15:20 [RFT] Testing 1.22 Arnaldo Carvalho de Melo
2021-05-27 16:54 ` Andrii Nakryiko
2021-05-27 19:04 ` Arnaldo
2021-05-27 19:14 ` Andrii Nakryiko
2021-05-27 19:55 ` Arnaldo
2021-05-27 20:41 ` Andrii Nakryiko
2021-05-27 21:08 ` Jiri Olsa
2021-05-27 21:57 ` Arnaldo
2021-05-28 19:45 ` Arnaldo Carvalho de Melo
2021-05-29 2:36 ` Andrii Nakryiko
2021-06-01 18:31 ` Arnaldo Carvalho de Melo
2021-06-01 18:32 ` Arnaldo Carvalho de Melo
2021-05-30 0:40 ` Andrii Nakryiko
2021-06-01 18:24 ` Arnaldo Carvalho de Melo
2021-06-03 14:57 ` Arnaldo Carvalho de Melo
2021-06-05 2:55 ` Andrii Nakryiko
2021-06-07 13:20 ` Parallelizing vmlinux BTF encoding. was " Arnaldo Carvalho de Melo
2021-06-07 15:42 ` Yonghong Song
2021-06-08 0:53 ` Andrii Nakryiko
2021-06-08 12:59 ` Arnaldo Carvalho de Melo
2021-06-15 19:01 ` Arnaldo Carvalho de Melo
2021-06-15 19:13 ` Andrii Nakryiko
2021-06-15 19:38 ` Arnaldo Carvalho de Melo [this message]
2021-06-15 20:05 ` Andrii Nakryiko
2021-06-15 20:25 ` Arnaldo Carvalho de Melo
2021-06-15 21:26 ` Andrii Nakryiko
2021-05-30 21:45 ` Jiri Olsa
2021-06-01 19:07 ` Arnaldo Carvalho de Melo
2021-06-02 10:21 ` Michael Petlan
2021-07-15 21:31 ` Michal Suchánek
2021-07-16 13:25 ` Arnaldo Carvalho de Melo
2021-07-16 13:35 ` Luca Boccassi
2021-07-16 19:08 ` Luca Boccassi
[not found] ` <20210716201248.GL24916@kitsune.suse.cz>
2021-07-17 14:35 ` Luca Boccassi
2021-07-17 15:10 ` Michal Suchánek
2021-07-17 15:14 ` Luca Boccassi
2021-07-17 16:36 ` Michal Suchánek
2021-07-17 16:39 ` Luca Boccassi
2021-07-19 10:30 ` Michal Suchánek
2021-07-19 10:34 ` Luca Boccassi
2021-07-19 12:10 ` Luca Boccassi
2021-07-19 21:08 ` Arnaldo Carvalho de Melo
2021-07-28 10:53 ` Expected release date of v1.22 Deepak Kumar Mishra
2021-07-28 11:21 ` Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YMkBpBfqW+AcpyNN@kernel.org \
--to=acme@kernel.org \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=dwarves@vger.kernel.org \
--cc=jolsa@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).