linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Ian Rogers <irogers@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>,
	linux-fsdevel@vger.kernel.org, brauner@kernel.org,
	 viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
	 linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	gregkh@linuxfoundation.org,  linux-mm@kvack.org
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
Date: Mon, 6 May 2024 11:58:32 -0700	[thread overview]
Message-ID: <CAEf4Bza+XwjaO0qs1a-5jN5xX11wP2jJRGoM4_HFphiF_NSD2w@mail.gmail.com> (raw)
In-Reply-To: <CAP-5=fV+K9-ggiaPR7xTVP2p=enLZf0ARBcjb4QG+7kv-00q7w@mail.gmail.com>

On Sat, May 4, 2024 at 10:26 PM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> >
> > This patch set was based on top of next-20240503 tag in linux-next tree.
> > Not sure what should be the target tree for this, I'd appreciate any guidance,
> > thank you!
> >
> > Andrii Nakryiko (5):
> >   fs/procfs: extract logic for getting VMA name constituents
> >   fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
> >   tools: sync uapi/linux/fs.h header into tools subdir
> >   selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
> >   selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
>
> I'd love to see improvements like this for the Linux perf command.
> Some thoughts:
>
>  - Could we do something scalability wise better than a file
> descriptor per pid? If a profiler is running in a container the cost
> of many file descriptors can be significant, and something that
> increases as machines get larger. Could we have a /proc/maps for all
> processes?

It's probably not a question to me, as it seems like an entirely
different set of APIs. But it also seems a bit convoluted to mix
together information about many address spaces.

As for the cost of FDs, I haven't run into this limitation, and it
seems like the trend in Linux in general is towards "everything is a
file". Just look at pidfd, for example.

Also, having a fd that can be queries has an extra nice property. For
example, opening /proc/self/maps (i.e., process' own maps file)
doesn't require any extra permissions, and then it can be transferred
to another trusted process that would do address
resolution/symbolization. In practice right now it's unavoidable to
add extra caps/root permissions to the profiling process even if the
only thing that it needs is contents of /proc/<pid>/maps (and the use
case is as benign as symbol resolution). Not having an FD for this API
would make this use case unworkable.

>
>  - Something that is broken in perf currently is that we can race
> between reading /proc and opening events on the pids it contains. For
> example, perf top supports a uid option that first scans to find all
> processes owned by a user then tries to open an event on each process.
> This fails if the process terminates between the scan and the open
> leading to a frequent:
> ```
> $ sudo perf top -u `id -u`
> The sys_perf_event_open() syscall returned with 3 (No such process)
> for event (cycles:P).
> ```
> It would be nice for the API to consider cgroups, uids and the like as
> ways to get a subset of things to scan.

This seems like putting too much into an API, tbh. It feels like
mapping cgroupos/uids to their processes is its own way and if we
don't have efficient APIs to do this, we should add it. But conflating
it into "get VMAs from this process" seems wrong to me.

>
>  - Some what related, the mmap perf events give data after the mmap
> call has happened. As VMAs get merged this can lead to mmap perf
> events looking like the memory overlaps (for jits using anonymous
> memory) and we lack munmap/mremap events.

Is this related to "VMA generation" that Arnaldo mentioned? I'd
happily add it to the new API, as it's easily extensible, if the
kernel already maintains it. If not, then it should be a separate work
to discuss whether kernel *should* track this information.

>
> Jiri Olsa has looked at improvements in this area in the past.
>
> Thanks,
> Ian
>
> >  fs/proc/task_mmu.c                            | 290 +++++++++++---
> >  include/uapi/linux/fs.h                       |  32 ++
> >  .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
> >  tools/testing/selftests/bpf/.gitignore        |   1 +
> >  tools/testing/selftests/bpf/Makefile          |   2 +-
> >  tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
> >  tools/testing/selftests/bpf/test_progs.c      |   3 +
> >  tools/testing/selftests/bpf/test_progs.h      |   2 +
> >  tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
> >  9 files changed, 763 insertions(+), 70 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/procfs_query.c
> >
> > --
> > 2.43.0
> >
> >

      reply	other threads:[~2024-05-06 18:58 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-04  0:30 [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 1/5] fs/procfs: extract logic for getting VMA name constituents Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
2024-05-04 15:28   ` Greg KH
2024-05-04 21:50     ` Andrii Nakryiko
2024-05-06 13:58       ` Arnaldo Carvalho de Melo
2024-05-06 18:05         ` Namhyung Kim
2024-05-06 18:51           ` Andrii Nakryiko
2024-05-06 18:53           ` Arnaldo Carvalho de Melo
2024-05-06 19:16             ` Arnaldo Carvalho de Melo
2024-05-07 21:55               ` Namhyung Kim
2024-05-06 18:41         ` Andrii Nakryiko
2024-05-06 20:35           ` Arnaldo Carvalho de Melo
2024-05-07 16:36             ` Andrii Nakryiko
2024-05-04 23:36   ` kernel test robot
2024-05-07 18:10   ` Liam R. Howlett
2024-05-07 18:52     ` Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 3/5] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 4/5] selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
2024-05-04 15:29   ` Greg KH
2024-05-04 21:57     ` Andrii Nakryiko
2024-05-05  5:09       ` Ian Rogers
2024-05-06 18:32         ` Andrii Nakryiko
2024-05-06 18:43           ` Ian Rogers
2024-05-07  5:06             ` Andrii Nakryiko
2024-05-07 17:29               ` Andrii Nakryiko
2024-05-07 22:27                 ` Namhyung Kim
2024-05-07 22:56                   ` Andrii Nakryiko
2024-05-08  0:36                     ` Arnaldo Carvalho de Melo
2024-05-04 15:32   ` Greg KH
2024-05-04 22:13     ` Andrii Nakryiko
2024-05-07 15:48       ` Liam R. Howlett
2024-05-07 16:10         ` Matthew Wilcox
2024-05-07 16:18           ` Liam R. Howlett
2024-05-07 16:27         ` Andrii Nakryiko
2024-05-07 18:06           ` Liam R. Howlett
2024-05-07 19:00             ` Andrii Nakryiko
2024-05-08  1:20               ` Liam R. Howlett
2024-05-04 11:24 ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
2024-05-04 15:33   ` Greg KH
2024-05-04 21:50     ` Andrii Nakryiko
2024-05-04 21:50   ` Andrii Nakryiko
2024-05-05  5:26 ` Ian Rogers
2024-05-06 18:58   ` Andrii Nakryiko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEf4Bza+XwjaO0qs1a-5jN5xX11wP2jJRGoM4_HFphiF_NSD2w@mail.gmail.com \
    --to=andrii.nakryiko@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=irogers@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).