From: Kui-Feng Lee <kuifeng@fb.com>
To: "andrii.nakryiko@gmail.com" <andrii.nakryiko@gmail.com>
Cc: "daniel@iogearbox.net" <daniel@iogearbox.net>,
Kernel Team <Kernel-team@fb.com>, Yonghong Song <yhs@fb.com>,
"ast@kernel.org" <ast@kernel.org>,
"andrii@kernel.org" <andrii@kernel.org>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Parameterize task iterators.
Date: Thu, 4 Aug 2022 23:05:12 +0000 [thread overview]
Message-ID: <1e8ef7d38e7c412f6ca14b4a429e87df982bde1a.camel@fb.com> (raw)
In-Reply-To: <CAEf4BzZjpb514dpzxfeE_OzV6jTajObxc3UNpX1Wyr+0ZM7pLQ@mail.gmail.com>
On Tue, 2022-08-02 at 14:17 -0700, Andrii Nakryiko wrote:
> On Tue, Aug 2, 2022 at 9:42 AM Kui-Feng Lee <kuifeng@fb.com> wrote:
> >
> > On Mon, 2022-08-01 at 20:30 -0700, Andrii Nakryiko wrote:
> > > On Mon, Aug 1, 2022 at 4:27 PM Kui-Feng Lee <kuifeng@fb.com>
> > > wrote:
> > > >
> > > > Allow creating an iterator that loops through resources of one
> > > > task/thread.
> > > >
> > > > People could only create iterators to loop through all
> > > > resources of
> > > > files, vma, and tasks in the system, even though they were
> > > > interested
> > > > in only the resources of a specific task or process. Passing
> > > > the
> > > > additional parameters, people can now create an iterator to go
> > > > through all resources or only the resources of a task.
> > > >
> > > > Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
> > > > ---
> > > > include/linux/bpf.h | 4 ++
> > > > include/uapi/linux/bpf.h | 23 +++++++++
> > > > kernel/bpf/task_iter.c | 93
> > > > ++++++++++++++++++++++++++----
> > > > ----
> > > > tools/include/uapi/linux/bpf.h | 23 +++++++++
> > > > 4 files changed, 121 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 11950029284f..3c26dbfc9cef 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1718,6 +1718,10 @@ int bpf_obj_get_user(const char __user
> > > > *pathname, int flags);
> > > >
> > > > struct bpf_iter_aux_info {
> > > > struct bpf_map *map;
> > > > + struct {
> > > > + u32 tid;
> > > > + u8 type;
> > > > + } task;
> > > > };
> > > >
> > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > > > diff --git a/include/uapi/linux/bpf.h
> > > > b/include/uapi/linux/bpf.h
> > > > index ffcbf79a556b..ed5ba501609f 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -87,10 +87,33 @@ struct bpf_cgroup_storage_key {
> > > > __u32 attach_type; /* program attach type
> > > > (enum bpf_attach_type) */
> > > > };
> > > >
> > > > +enum bpf_task_iter_type {
> > > > + BPF_TASK_ITER_ALL = 0,
> > > > + BPF_TASK_ITER_TID,
> > > > +};
> > > > +
> > > > union bpf_iter_link_info {
> > > > struct {
> > > > __u32 map_fd;
> > > > } map;
> > > > + /*
> > > > + * Parameters of task iterators.
> > > > + */
> > > > + struct {
> > > > + __u32 pid_fd;
> > >
> > > I was a bit late to the discussion about pidfd vs plain pid. I
> > > think
> > > we should support both in this API. While pid_fd has some nice
> > > guarantees like avoiding the risk of accidental PID reuse, in a
> > > lot
> > > (if not all) cases where task/task_vma/task_file iterators are
> > > going
> > > to be used this is never a risk, because pid will usually come
> > > from
> > > some tracing BPF program (kprobe/tp/fentry/etc), like in case of
> > > profiling, and then will be used by user-space almost immediately
> > > to
> > > query some additional information (fetching relevant vma
> > > information
> > > for profiling use case). So main benefit of pidfd is not that
> > > relevant
> > > for BPF tracing use cases, because PIDs are not going to be
> > > reused so
> > > fast within such a short time frame.
> > >
> > > But pidfd does have downsides. It requires 2 syscalls (pidfd_open
> > > and
> > > close) for each PID, it creates struct file for each such active
> > > pidfd. So it will have non-trivial overhead for high-frequency
> > > BPF
> > > iterator use cases (imagine querying some simple stats for a big
> > > set
> > > of tasks, frequently: you'll spend more time in pidfd syscalls
> > > and
> > > more resources just keeping corresponding struct file open than
> > > actually doing useful BPF work). For simple BPF iter cases it
> > > will
> > > unnecessarily complicate program flow while giving no benefit
> > > instead.
> >
> > It is a good point to have more syscalls.
> >
> > >
> > > So I propose we support both in UAPI. Internally either way we
> > > resolve
> > > to plain pid/tid, so this won't cause added maintenance burden.
> > > But
> > > simple cases will keep simple, while more long-lived and/or
> > > complicated ones will still be supported. We then can have
> > > BPF_TASK_ITER_PIDFD vs BPF_TASK_ITER_TID to differentiate whether
> > > the
> > > above __u32 pid_fd (which we should probably rename to something
> > > more
> > > generic like "target") is pid FD or TID/PID. See also below about
> > > TID
> > > vs PID.
> > >
> > > > + /*
> > > > + * The type of the iterator.
> > > > + *
> > > > + * It can be one of enum bpf_task_iter_type.
> > > > + *
> > > > + * BPF_TASK_ITER_ALL (default)
> > > > + * The iterator iterates over resources of
> > > > everyprocess.
> > > > + *
> > > > + * BPF_TASK_ITER_TID
> > > > + * You should also set *pid_fd* to iterate
> > > > over one task.
> > >
> > > naming nit: we should decide whether we use TID (thread) and PID
> > > (process) terminology (more usual for user-space) or PID (process
> > > ==
> > > task == user-space thread) and TGID (thread group, i.e. user-
> > > space
> > > process). I haven't investigated much what's we use most
> > > consistently,
> > > but curious to hear what others think.
> > >
> > > Also I can see use-cases where we want to iterate just specified
> > > task
> > > (i.e., just specified thread) vs all the tasks that belong to the
> > > same
> > > process group (i.e., thread within process). Naming TBD, but we
> > > should
> > > have BPF_TASK_ITER_TID and BPF_TASK_ITER_TGID (or some other
> > > naming).
> >
> >
> > I discussed with Yonghong about iterators over resources of all
> > tasks
> > of a process. User code should create iterators for each thread of
> > the
> > process if necessary. We may add the support of tgid if it is
> > higly
> > demanded.
> >
> > In a discussion of using pidfd, people mentioned to extend pidfd to
> > threads if there is a good use-case. It also applies to our case.
> > Most of the time, if not always, vma & files are shared by all
> > threads
> > of a process. So, an iteration over all resources of every threads
> > of
> > a process doesn't get obvious benefit. It is also true for an
> > iterator
> > over the resources of a specific thread instead of a process.
> >
>
> Ok, so two different points here.
>
> First, TID (thread) vs TGID (process) modes. I'd define TGID mode as:
> a) user specifies some TID and we resolve that to thread group leader
> TID (that is we resolve thread to process), and then iterate all
> threads within that process. For TID (thread) mode, we accept
> specified TID as exactly the thread we iterate (even if it's thread
> group leader, we iterate only that specific thread, not all threads
> in
> a process).
>
> Second, about the point that all threads within a process share vma,
> file table, etc. That's true. But you are forgetting about iter/task
> that is iterating just tasks. TGID mode for such use case is very
> useful. For task_vma/task_file we can probably do the same logic we
> have today where if the thread has the same file table or mm_struct
> as
> thread group leader, we skip such thread when iterating vmas and
> files.
Yes, you are right. Iterators of all tasks in a procss is useful.
Just like our discussion offline, it is worth to supports pidfd, tid
and tgid. For pidfd, it would works just like tgid. We just do a
translation at the kernel from pidfd to tgid.
>
> Thoughts?
>
>
> > >
> > > One might ask why do we need single-task mode if we can always
> > > stop
> > > iteration from BPF program, but this is trivial only for
> > > iter/task,
> > > while for iter/task_vma and iter/task_file it becomes
> > > inconvenient to
> > > detect switch from one task to another. It costs us essentially
> > > nothing to support this mode, so I advocate to do that.
> > >
> > > I have similar thoughts about cgroup iteration modes and actually
> > > supporting cgroup_fd as target for task iterators (which will
> > > mean
> > > iterating tasks belonging to provided cgroup(s)), but I'll reply
> > > on
> > > cgroup iterator patch first, and we can just reuse the same
> > > cgroup
> > > target specification between iter/cgroup and iter/task
> > > afterwards.
> > >
> > >
> > > > + */
> > > > + __u8 type; /* BPF_TASK_ITER_* */
> > > > + } task;
> > > > };
> > > >
> > >
> > > [...]
> >
next prev parent reply other threads:[~2022-08-04 23:05 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-01 23:26 [PATCH bpf-next v2 0/3] Parameterize task iterators Kui-Feng Lee
2022-08-01 23:26 ` [PATCH bpf-next v2 1/3] bpf: " Kui-Feng Lee
2022-08-02 1:49 ` Alexei Starovoitov
2022-08-02 16:47 ` Kui-Feng Lee
2022-08-02 21:19 ` Andrii Nakryiko
2022-08-02 3:30 ` Andrii Nakryiko
2022-08-02 16:42 ` Kui-Feng Lee
2022-08-02 21:17 ` Andrii Nakryiko
2022-08-04 23:05 ` Kui-Feng Lee [this message]
2022-08-01 23:26 ` [PATCH bpf-next v2 2/3] bpf: Handle bpf_link_info for the parameterized task BPF iterators Kui-Feng Lee
2022-08-01 23:26 ` [PATCH bpf-next v2 3/3] selftests/bpf: Test " Kui-Feng Lee
2022-08-01 23:35 ` [PATCH bpf-next v2 0/3] Parameterize task iterators Kui-Feng Lee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1e8ef7d38e7c412f6ca14b4a429e87df982bde1a.camel@fb.com \
--to=kuifeng@fb.com \
--cc=Kernel-team@fb.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).