All of lore.kernel.org
 help / color / mirror / Atom feed
From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Yonghong Song <yhs@fb.com>, Andrii Nakryiko <andrii@kernel.org>,
	Jiri Olsa <jolsa@kernel.org>,
	dwarves@vger.kernel.org, bpf <bpf@vger.kernel.org>,
	Kernel Team <kernel-team@fb.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22
Date: Tue, 15 Jun 2021 16:38:12 -0300	[thread overview]
Message-ID: <YMkBpBfqW+AcpyNN@kernel.org> (raw)
In-Reply-To: <CAEf4BzaDNim+kFQx64i9EZogctGZNFigQBsog7eC6DjrfjTbEA@mail.gmail.com>

Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu:
> On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:

> > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu:
> > > > I think it's very fragile and it will be easy to get
> > > > broken/invalid/incomplete BTF. Yonghong already brought up the case

> > > I thought about that as it would be almost like the compiler generating
> > > BTF, but you are right, the vmlinux prep process is a complex beast and
> > > probably it is best to go with the second approach I outlined and you
> > > agreed to be less fragile, so I'll go with that, thanks for your
> > > comments.

> > So, just to write some notes here from what I saw so far:

> > 1. In the LTO cases there are inter-CU references, so the current code
> > combines all CUs into one and we end up not being able to parallelize
> > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't
> > think the current algorithm is ideal, can be improved.
 
> Yeah, let's worry about LTO later.
 
> > 2. The case where there's no inter CU refs, which so far is the most
> > common, seems easier, we create N threads, all sharing the dwarf_loader
> > state and the btf_encoder, as-is now. we can process one CU per thread,
> > and as soon as we finish it, just grab a lock and call
> > btf_encoder__encode_cu() with the just produced CU data structures
> > (tags, types, functions, variables, etc), consume them and delete the
> > CU.
> >
> > So each thread will consume one CU, push it to the 'struct btf' class
> > as-is now and then ask for the next CU, using the dwarf_loader state,
> > still under that lock, then go back to processing dwarf tags, then
> > lock, btf add types, rinse, repeat.
> 
> Hmm... wouldn't keeping a "local" per-thread struct btf and just keep
> appending to it for each processed CU until we run out of CUs be
> simpler?

I thought about this as a logical next step, I would love to have a
'btf__merge_argv(struct btf *btf[]), is there one?

But from what I've read after this first paragraph of yours, lemme try
to rephrase:

1. pahole calls btf_encoder__new(...)

   Creates a single struct btf.

2. dwarf_loader will create N threads, each will call a
dwarf_get_next_cu() that is locked and will return a CU to process, when
it finishes this CU, calls btf_encoder__encode_cu() under an all-threads
lock. Rinse repeat.

Until all the threads have consumed all CUs.

then btf_encoder__encode(), which should be probably renamed to
btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF
or raw file.

My first reaction to your first paragraph was:

Yeah, we can have multiple 'struct btf' instances, one per thread, that
will each contain a subset of DWARF CU's encoded as BTF, and then I have
to merge the per-thread BTF and then dedup. O think my rephrase above is
better, no?

> So each thread does as much as possible locally without any
> locks. And only at the very end we merge everything together and then
> dedup. Or we can even dedup inside each worker before merging final
> btf, that probably would give quite a lot of speed up and some memory
> saving. Would be interesting to experiment with that.
> 
> So I like the idea of a fixed pool of threads (can be customized, and
> I'd default to num_workers == num_cpus), but I think we can and should
> keep them independent for as long as possible.

Sure, this should map the whatever the user passes to -j in the kernel
make command line, if nothing is passed as an argument, then default to
getconf(_NPROCESSORS_ONLN).

There is a nice coincidence here where we probably don't care about -J
anymore and want to deal only with -j (detached btf) that is the same as
what 'make' expects to state how many "jobs" (thread pool size) the user
wants 8-)
 
> Another disadvantage of generating small struct btf and then lock +
> merge is that we don't get as efficient string re-use, we'll churn
> more on string memory allocation. Keeping bigger local struct btfs
> allow for more efficient memory re-use (and probably a tiny bit of CPU
> savings).

I think we're in the same page, the contention for adding the CU to a
single 'struct btf' (amongst all DWARF loading threads) after we just
produced it should be minimal, so we grab all the advantages: locality
of reference, minimal contention as DWARF reading/creating the pahole
internal, neutral, data structures should be higher than adding
types/functions/variables via the libbpf BTF API.

I.e. we can leave paralellizing the BTF _encoding_ for later, what we're
trying to do now is to paralellize the DWARF _loading_, right?
 
> So please consider that, it also seems simpler overall.
 
> > The ordering will be different than what we have now, as some smaller
> > CUs (object files with debug) will be processed faster so will get its
> > btf encoding slot faster, but that, at btf__dedup() time shouldn't make
> > a difference, right?
 
> Right, order doesn't matter.
 
> > I think I'm done with refactoring the btf_encoder code, which should be
> > by now a thin layer on top of the excellent libbpf BTF API, just getting
> > what the previous loader (DWARF) produced and feeding libbpf.
 
> Great.
 
> > I thought about fancy thread pools, etc, researching some pre-existing
> > thing or doing some kthread + workqueue lifting from the kernel but will
> > instead start with the most spartan code, we can improve later.
 
> Agree, simple is good. Really curious how much faster we can get. I
> think anything fancy will give a relatively small improvement. The
> biggest one will come from any parallelization.

And I think that is possible, modulo elfutils libraries saying no, I
hope that will not be the case.

- Arnaldo

  reply	other threads:[~2021-06-15 19:38 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 15:20 [RFT] Testing 1.22 Arnaldo Carvalho de Melo
2021-05-27 16:54 ` Andrii Nakryiko
2021-05-27 19:04   ` Arnaldo
2021-05-27 19:14     ` Andrii Nakryiko
2021-05-27 19:55       ` Arnaldo
2021-05-27 20:41         ` Andrii Nakryiko
2021-05-27 21:08           ` Jiri Olsa
2021-05-27 21:57             ` Arnaldo
2021-05-28 19:45           ` Arnaldo Carvalho de Melo
2021-05-29  2:36             ` Andrii Nakryiko
2021-06-01 18:31               ` Arnaldo Carvalho de Melo
2021-06-01 18:32                 ` Arnaldo Carvalho de Melo
2021-05-30  0:40             ` Andrii Nakryiko
2021-06-01 18:24               ` Arnaldo Carvalho de Melo
2021-06-03 14:57               ` Arnaldo Carvalho de Melo
2021-06-05  2:55                 ` Andrii Nakryiko
2021-06-07 13:20                   ` Parallelizing vmlinux BTF encoding. was " Arnaldo Carvalho de Melo
2021-06-07 15:42                     ` Yonghong Song
2021-06-08  0:53                     ` Andrii Nakryiko
2021-06-08 12:59                       ` Arnaldo Carvalho de Melo
2021-06-15 19:01                         ` Arnaldo Carvalho de Melo
2021-06-15 19:13                           ` Andrii Nakryiko
2021-06-15 19:38                             ` Arnaldo Carvalho de Melo [this message]
2021-06-15 20:05                               ` Andrii Nakryiko
2021-06-15 20:25                                 ` Arnaldo Carvalho de Melo
2021-06-15 21:26                                   ` Andrii Nakryiko
2021-05-30 21:45             ` Jiri Olsa
2021-06-01 19:07               ` Arnaldo Carvalho de Melo
2021-06-02 10:21 ` Michael Petlan
2021-07-15 21:31 ` Michal Suchánek
2021-07-16 13:25   ` Arnaldo Carvalho de Melo
2021-07-16 13:35     ` Luca Boccassi
2021-07-16 19:08       ` Luca Boccassi
     [not found]         ` <20210716201248.GL24916@kitsune.suse.cz>
2021-07-17 14:35           ` Luca Boccassi
2021-07-17 15:10             ` Michal Suchánek
2021-07-17 15:14               ` Luca Boccassi
2021-07-17 16:36                 ` Michal Suchánek
2021-07-17 16:39                   ` Luca Boccassi
2021-07-19 10:30                 ` Michal Suchánek
2021-07-19 10:34                   ` Luca Boccassi
2021-07-19 12:10     ` Luca Boccassi
2021-07-19 21:08       ` Arnaldo Carvalho de Melo
2021-07-28 10:53         ` Expected release date of v1.22 Deepak Kumar Mishra
2021-07-28 11:21           ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YMkBpBfqW+AcpyNN@kernel.org \
    --to=acme@kernel.org \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=dwarves@vger.kernel.org \
    --cc=jolsa@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.