Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Alexei Starovoitov <ast@fb.com>
Cc: David Ahern <dsahern@gmail.com>,
	Song Liu <liu.song.a23@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Alexei Starovoitov <ast@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Networking <netdev@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Date: Wed, 17 Oct 2018 18:31:19 -0300	[thread overview]
Message-ID: <20181017213119.GC4589@kernel.org> (raw)
In-Reply-To: <c5331ed8-4a00-bc87-bdd9-c4ecfe91bf30@fb.com>

Em Wed, Oct 17, 2018 at 07:08:37PM +0000, Alexei Starovoitov escreveu:
> On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
> >> On 10/17/18 8:09 AM, David Ahern wrote:
> >>> On 10/16/18 11:43 PM, Song Liu wrote:
> >>>> I agree that processing events while recording has significant overhead.
> >>>> In this case, perf user space need to know details about the the jited BPF
> >>>> program. It is impossible to pass all these details to user space through
> >>>> the relatively stable ring_buffer API. Therefore, some processing of the
> >>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >>>> details via BPF_OBJ_GET_INFO_BY_FD.
> >>>>
> >>>> I have some idea on processing important data with relatively low overhead.
> >>>> Let me try implement it.
> >>>>
> >>>
> >>> As I understand it, you want this series:
> >>>
> >>>  kernel: add event to perf buffer on bpf prog load
> >>>
> >>>  userspace: perf reads the event and grabs information about the program
> >>> from the fd
> >>>
> >>> Is that correct?
> >>>
> >>> Userpsace is not awakened immediately when an event is added the the
> >>> ring. It is awakened once the number of events crosses a watermark. That
> >>> means there is an unknown - and potentially long - time window where the
> >>> program can be unloaded before perf reads the event.
> >
> >>> So no matter what you do expecting perf record to be able to process the
> >>> event quickly is an unreasonable expectation.
> >
> >> yes... unless we go with threaded model as Arnaldo suggested and use
> >> single event as a watermark to wakeup our perf thread.
> >> In such case there is still a race window between user space waking up
> >> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> >> and some other process trying to instantly unload the prog it
> >> just loaded.
> >> I think such race window is extremely tiny and if perf misses
> >> those load/unload events it's a good thing, since there is no chance
> >> that normal pmu event samples would be happening during prog execution.

> >> The alternative approach with no race window at all is to burden kernel
> >> RECORD_* events with _all_ information about bpf prog. Which is jited
> >> addresses, jited image itself, info about all subprogs, info about line
> >> info, all BTF data, etc. As I said earlier I'm strongly against such
> >> RECORD_* bloating.
> >> Instead we need to find a way to process new RECORD_BPF events with
> >> single prog_id field in perf user space with minimal race
> >> and threaded approach sounds like a win to me.

> > There is another alternative, I think: put just a content based hash,
> > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> > validator does the jit, etc, it stashes the content that
> > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> > the kernel right after getting stuff from sys_bpf() and preparing it for
> > use, then we know that in (start, end) we have blob foo with content id,
> > that we will use to retrieve information that augments what we know with
> > just (start, end, id) and allows annotation, etc.

> > That stash space for jitted stuff gets garbage collected from time to
> > time or is even completely disabled if the user is not interested in
> > such augmentation, just like one can do disabling perf's ~/.debug/
> > directory hashed by build-id.

> > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> > cookie and we solve the other race we already have with kernel modules,
> > DSOs, etc.

> > I have mentioned this before, there were objections, perhaps this time I
> > formulated in a different way that makes it more interesting?

> that 'content based hash' we already have. It's called program tag.

But that was calculated by whom? Userspace? It can't do that, its the
kernel that ultimately puts together, from what userspace gave it, what
we need to do performance analysis, line numbers, etc.

> and we already taught iovisor/bcc to stash that stuff into
> /var/tmp/bcc/bpf_prog_TAG/ directory.
> Unfortunately that approach didn't work.
> JITed image only exists in the kernel. It's there only when

That is why I said that _the_ kernel stashes that thing, not bcc/iovisor
or perf, the kernel calculates the hash, and it also puts that into the
PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place
the kernel stashed it.

> program is loaded and it's the one that matter the most for performance
> analysis, since sample IPs are pointing into it.

Agreed

> Also the line info mapping that user space knows is not helping much
> either, since verifier optimizes the instructions and then JIT
> does more. The debug_info <-> JIT relationship must be preserved
> by the kernel and returned to user space.

Violent agreement :-)

> The only other non-race way is to keep all that info in the kernel after
> program is unloaded, but that is even worse than bloating RECORD*s

Keep all that info in a file, as I described above. Or keep it for a
while, to give that thread in userspace time to get it and tell the
kernel that it can trow it away.

It may well be that most of the time the 'perf record' thread catching
those events picks that up and saves it in
/var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded,
no?

- Arnaldo