bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>,
	Ian Rogers <irogers@google.com>, Jiri Olsa <jolsa@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>, Hao Luo <haoluo@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	bpf@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, Martin KaFai Lau <kafai@fb.com>,
	Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@chromium.org>,
	Stanislav Fomichev <sdf@google.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Namhyung Kim <namhyung@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	kernel-team@meta.com
Subject: Re: [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object
Date: Fri, 17 Mar 2023 23:08:44 -0700	[thread overview]
Message-ID: <CAEf4BzYQ-bktO9s8yhBk7xUoz=2NFrgdGviWsN2=HWPBaGv6hA@mail.gmail.com> (raw)
In-Reply-To: <20230317212125.GA3390869@ZenIV>

On Fri, Mar 17, 2023 at 2:21 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> > On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> >
> > > > But build IDs are _generally_ available.  The only problem (AIUI)
> > > > is when you're trying to examine the contents of one container from
> > > > another container.  And to solve that problem, you're imposing a cost
> > > > on everybody else with (so far) pretty vague justifications.  I really
> > > > don't like to see you growing struct file for this (nor struct inode,
> > > > nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> > > > have a good suggestion.
> > >
> > > There is a lot of profiling, observability and debugging tooling built
> > > using BPF. And when capturing stack traces from BPF programs, if the
> > > build ID note is not physically present in memory, fetching it from
> > > the BPF program might fail in NMI (and other non-faultable contexts).
> > > This patch set is about making sure we always can fetch build ID, even
> > > from most restrictive environments. It's guarded by Kconfig to avoid
> > > adding 8 bytes of overhead to struct file for environment where this
> > > might be unacceptable, giving users and distros a choice.
> >
> > Lovely.  As an exercise you might want to collect the stats on the
> > number of struct file instances on the system vs. the number of files
> > that happen to be ELF objects and are currently mmapped anywhere.

That's a good suggestion. I wrote a simple script that uses the drgn
tool ([0]), it enables nice introspection of the state of the kernel
memory for the running kernel. The script is at the bottom ([1]) for
anyone to sanity check. I didn't try to figure out which file is
mmaped as executable and which didn't, so let's do worst case and
assume that none of the file is executable, and thus that 8 byte
pointer is a waste for all of them.

On my devserver I got:

task_cnt=15984 uniq_file_cnt=56780

On randomly chosen production host I got:

task_cnt=6387 uniq_file_cnt=22514

So it seems like my devserver is "busier" than the production host. :)

Above numbers suggest that my devserver's kernel has about 57000
*unique* `struct file *` instances. That's 450KB of overhead. That's
not much by any modern standard.

But let's say I'm way off, and we have 1 million struct files. That's
8MB overhead. I'd argue that those 8MB is not a big deal even on a
normal laptop, even less so on production servers. Especially if you
have 1 million active struct file instances created in the system, as
way more will be used for application-specific needs.


> > That does depend upon the load, obviously, but it's not hard to collect -
> > you already have more than enough hooks inserted in the relevant places.
> > That might give a better appreciation of the reactions...
>
> One possibility would be a bit stolen from inode flags + hash keyed by
> struct inode address (middle bits make for a decent hash function);
> inode eviction would check that bit and kick the corresponding thing
> from hash if the bit is set.
>
> Associating that thing with inode => hash lookup/insert + set the bit.

This is an interesting idea, but now we are running into a few
unnecessary problems. We need to have a global dynamically sized hash
map in the system. If we fix the number of buckets, we risk either
wasting memory on an underutilized system (if we oversize), or
performance problems due to collisions (if we undersize) if we have a
busy system with lots of executables mapped in memory. If we don't
pre-size, then we are talking about reallocations, rehashing, and
doing that under global lock or something like that. Further, we'd
have to take locks on buckets, which causes further problems for
looking up build ID from this hashmap in NMI context for perf events
and BPF programs, as locks can't be safely taken under those
conditions, and thus fetching build ID would still be unreliable
(though less so than it is today, of course).

All of this is solvable to some degree (but not perfectly and not with
simple and elegant approaches), but seems like an unnecessarily
overcomplication compared to the amount of memory that we hope to
save. It still feels like a Kconfig-guarded 8 byte field per struct
file is a reasonable price for gaining reliable build ID information
for profiling/tracing tools.


  [0] https://drgn.readthedocs.io/en/latest/index.html

  [1] Script I used:

from drgn.helpers.linux.pid import for_each_task
from drgn.helpers.linux.fs import for_each_file

task_cnt = 0
file_set = set()

for task in for_each_task(prog):
    task_cnt += 1
    try:
        for (fd, file) in for_each_file(task):
            file_set.add(file.value_())
    except:
        pass

uniq_file_cnt = len(file_set)
print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")

  reply	other threads:[~2023-03-18  6:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-16 17:01 [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object Jiri Olsa
2023-03-16 17:01 ` [PATCHv3 bpf-next 1/9] mm: " Jiri Olsa
2023-03-16 22:07   ` Andrii Nakryiko
2023-03-16 17:01 ` [PATCHv3 bpf-next 2/9] perf: Use file object build id in perf_event_mmap_event Jiri Olsa
2023-03-16 17:01 ` [PATCHv3 bpf-next 3/9] bpf: Use file object build id in stackmap Jiri Olsa
2023-03-16 22:07   ` Andrii Nakryiko
2023-03-16 17:01 ` [PATCHv3 bpf-next 4/9] bpf: Switch BUILD_ID_SIZE_MAX to enum Jiri Olsa
2023-03-16 22:07   ` Andrii Nakryiko
2023-03-16 17:01 ` [PATCHv3 bpf-next 5/9] selftests/bpf: Add read_buildid function Jiri Olsa
2023-03-16 22:23   ` Andrii Nakryiko
2023-03-30 22:05     ` Jiri Olsa
2023-03-16 17:01 ` [PATCHv3 bpf-next 6/9] selftests/bpf: Add err.h header Jiri Olsa
2023-03-16 22:24   ` Andrii Nakryiko
2023-03-16 17:01 ` [PATCHv3 bpf-next 7/9] selftests/bpf: Replace extract_build_id with read_build_id Jiri Olsa
2023-03-16 17:01 ` [PATCHv3 bpf-next 8/9] selftests/bpf: Add iter_task_vma_buildid test Jiri Olsa
2023-03-16 22:31   ` Andrii Nakryiko
2023-03-16 17:01 ` [PATCHv3 bpf-next 9/9] selftests/bpf: Add file_build_id test Jiri Olsa
2023-03-16 19:59   ` Daniel Borkmann
2023-03-16 22:36   ` Andrii Nakryiko
2023-03-16 17:34 ` [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object Matthew Wilcox
2023-03-16 17:50   ` Ian Rogers
2023-03-16 21:51     ` Andrii Nakryiko
2023-03-17  3:51       ` Matthew Wilcox
2023-03-17 16:33         ` Andrii Nakryiko
2023-03-17 21:14           ` Al Viro
2023-03-17 21:21             ` Al Viro
2023-03-18  6:08               ` Andrii Nakryiko [this message]
2023-03-18  8:34                 ` Jiri Olsa
2023-03-18  8:33   ` Jiri Olsa
2023-03-18 15:16     ` Matthew Wilcox
2023-03-18 17:40       ` Jiri Olsa
2023-03-22 15:45       ` Arnaldo Carvalho de Melo
2023-03-31 18:19         ` Andrii Nakryiko
2023-03-31 18:36           ` Matthew Wilcox
2023-03-31 20:27             ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEf4BzYQ-bktO9s8yhBk7xUoz=2NFrgdGviWsN2=HWPBaGv6hA@mail.gmail.com' \
    --to=andrii.nakryiko@gmail.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@fromorbit.com \
    --cc=haoluo@google.com \
    --cc=irogers@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kafai@fb.com \
    --cc=kernel-team@meta.com \
    --cc=kpsingh@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@gmail.com \
    --cc=peterz@infradead.org \
    --cc=sdf@google.com \
    --cc=songliubraving@fb.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).