Re: [PATCH bpf-next v2 1/3] bpf: Parameterize task iterators. - Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@fb.com>
To: "andrii.nakryiko@gmail.com" <andrii.nakryiko@gmail.com>
Cc: "daniel@iogearbox.net" <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>, Yonghong Song <yhs@fb.com>,
	"ast@kernel.org" <ast@kernel.org>,
	"andrii@kernel.org" <andrii@kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/3] bpf: Parameterize task iterators.
Date: Tue, 2 Aug 2022 16:42:35 +0000	[thread overview]
Message-ID: <9ab00aa58259d9dd7b45fdf860423e86612b591d.camel@fb.com> (raw)
In-Reply-To: <CAEf4BzZqwoCecuUTe=LGBBrTWMp_bCttik1fkmRF1rBXxBYPAw@mail.gmail.com>

On Mon, 2022-08-01 at 20:30 -0700, Andrii Nakryiko wrote:
> On Mon, Aug 1, 2022 at 4:27 PM Kui-Feng Lee <kuifeng@fb.com> wrote:
> > 
> > Allow creating an iterator that loops through resources of one
> > task/thread.
> > 
> > People could only create iterators to loop through all resources of
> > files, vma, and tasks in the system, even though they were
> > interested
> > in only the resources of a specific task or process.  Passing the
> > additional parameters, people can now create an iterator to go
> > through all resources or only the resources of a task.
> > 
> > Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
> > ---
> >  include/linux/bpf.h            |  4 ++
> >  include/uapi/linux/bpf.h       | 23 +++++++++
> >  kernel/bpf/task_iter.c         | 93 ++++++++++++++++++++++++++----
> > ----
> >  tools/include/uapi/linux/bpf.h | 23 +++++++++
> >  4 files changed, 121 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 11950029284f..3c26dbfc9cef 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1718,6 +1718,10 @@ int bpf_obj_get_user(const char __user
> > *pathname, int flags);
> > 
> >  struct bpf_iter_aux_info {
> >         struct bpf_map *map;
> > +       struct {
> > +               u32     tid;
> > +               u8      type;
> > +       } task;
> >  };
> > 
> >  typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index ffcbf79a556b..ed5ba501609f 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -87,10 +87,33 @@ struct bpf_cgroup_storage_key {
> >         __u32   attach_type;            /* program attach type
> > (enum bpf_attach_type) */
> >  };
> > 
> > +enum bpf_task_iter_type {
> > +       BPF_TASK_ITER_ALL = 0,
> > +       BPF_TASK_ITER_TID,
> > +};
> > +
> >  union bpf_iter_link_info {
> >         struct {
> >                 __u32   map_fd;
> >         } map;
> > +       /*
> > +        * Parameters of task iterators.
> > +        */
> > +       struct {
> > +               __u32   pid_fd;
> 
> I was a bit late to the discussion about pidfd vs plain pid. I think
> we should support both in this API. While pid_fd has some nice
> guarantees like avoiding the risk of accidental PID reuse, in a lot
> (if not all) cases where task/task_vma/task_file iterators are going
> to be used this is never a risk, because pid will usually come from
> some tracing BPF program (kprobe/tp/fentry/etc), like in case of
> profiling, and then will be used by user-space almost immediately to
> query some additional information (fetching relevant vma information
> for profiling use case). So main benefit of pidfd is not that
> relevant
> for BPF tracing use cases, because PIDs are not going to be reused so
> fast within such a short time frame.
> 
> But pidfd does have downsides. It requires 2 syscalls (pidfd_open and
> close) for each PID, it creates struct file for each such active
> pidfd. So it will have non-trivial overhead for high-frequency BPF
> iterator use cases (imagine querying some simple stats for a big set
> of tasks, frequently: you'll spend more time in pidfd syscalls and
> more resources just keeping corresponding struct file open than
> actually doing useful BPF work). For simple BPF iter cases it will
> unnecessarily complicate program flow while giving no benefit
> instead.

It is a good point to have more syscalls.

> 
> So I propose we support both in UAPI. Internally either way we
> resolve
> to plain pid/tid, so this won't cause added maintenance burden. But
> simple cases will keep simple, while more long-lived and/or
> complicated ones will still be supported. We then can have
> BPF_TASK_ITER_PIDFD vs BPF_TASK_ITER_TID to differentiate whether the
> above __u32 pid_fd (which we should probably rename to something more
> generic like "target") is pid FD or TID/PID. See also below about TID
> vs PID.
> 
> > +               /*
> > +                * The type of the iterator.
> > +                *
> > +                * It can be one of enum bpf_task_iter_type.
> > +                *
> > +                * BPF_TASK_ITER_ALL (default)
> > +                *      The iterator iterates over resources of
> > everyprocess.
> > +                *
> > +                * BPF_TASK_ITER_TID
> > +                *      You should also set *pid_fd* to iterate
> > over one task.
> 
> naming nit: we should decide whether we use TID (thread) and PID
> (process) terminology (more usual for user-space) or PID (process ==
> task == user-space thread) and TGID (thread group, i.e. user-space
> process). I haven't investigated much what's we use most
> consistently,
> but curious to hear what others think.
> 
> Also I can see use-cases where we want to iterate just specified task
> (i.e., just specified thread) vs all the tasks that belong to the
> same
> process group (i.e., thread within process). Naming TBD, but we
> should
> have BPF_TASK_ITER_TID and BPF_TASK_ITER_TGID (or some other naming).

I discussed with Yonghong about iterators over resources of all tasks
of a process.  User code should create iterators for each thread of the
process if necessary.  We may add the support of tgid if it is higly
demanded.

In a discussion of using pidfd, people mentioned to extend pidfd to
threads if there is a good use-case.  It also applies to our case. 
Most of the time, if not always, vma & files are shared by all threads
of a process.  So, an iteration over all resources of every threads of
a process doesn't get obvious benefit.  It is also true for an iterator
over the resources of a specific thread instead of a process.

> 
> One might ask why do we need single-task mode if we can always stop
> iteration from BPF program, but this is trivial only for iter/task,
> while for iter/task_vma and iter/task_file it becomes inconvenient to
> detect switch from one task to another. It costs us essentially
> nothing to support this mode, so I advocate to do that.
> 
> I have similar thoughts about cgroup iteration modes and actually
> supporting cgroup_fd as target for task iterators (which will mean
> iterating tasks belonging to provided cgroup(s)), but I'll reply on
> cgroup iterator patch first, and we can just reuse the same cgroup
> target specification between iter/cgroup and iter/task afterwards.
> 
> 
> > +                */
> > +               __u8    type;   /* BPF_TASK_ITER_* */
> > +       } task;
> >  };
> > 
> 
> [...]