linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Joe Burton <jevburton.kernel@gmail.com>,
	Stanislav Fomichev <sdf@google.com>, bpf <bpf@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link
Date: Fri, 4 Feb 2022 10:26:59 -0800	[thread overview]
Message-ID: <CA+khW7i+TScwPZ6-rcFKiXtxMm8hiZYJGH-wYb=7jBvDWg8pJQ@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQLZZ3SM2CDxnzgOnDgRtGU7+6wT9u5v4oFas5MnZF6DsQ@mail.gmail.com>

On Thu, Feb 3, 2022 at 7:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 2:46 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > > > +
> > > > +SEC("iter/cgroup_view")
> > > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > > > +{
> > > > +     struct seq_file *seq = ctx->meta->seq;
> > > > +     struct cgroup *cgroup = ctx->cgroup;
> > > > +     struct wait_lat *lat;
> > > > +     u64 id;
> > > > +
> > > > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > > > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
> > >
> > > Looks like "id = cgroup->kn->id" assignment is missing here?
> > >
> >
> > Ah, yes. I'll fix it.
> >
> > > Thanks a lot for this test. It explains the motivation well.
> > >
> > > It seems that the patches 1-4 are there to automatically
> > > supply cgroup pointer into bpf_iter__cgroup_view.
> > >
> > > Since user space needs to track good part of cgroup dir opreations
> > > can we task it with the job of patches 1-4 as well?
> > > It can register notifier for cgroupfs operations and
> > > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> > > with corresponding cgroup_id.
> > > Ideally there is no new 'view' program and it's a subset of 'iter'
> > > bpf program. They're already parametrizable.
> > > When 'iter' is pinned the user space can tell it which object it should
> > > iterate on. The 'view' will be an interator of one element and
> > > argument to it can be cgroup_id.
> > > When user space pins the same 'view' program in a newly created bpffs
> > > directory it will parametrize it with a different cgroup_id.
> > > At the end the same 'view' program will be pinned in multiple directories
> > > with different cgroup_id arguments.
> > > This patch 5 will look very much the same, but patches 1-4 will not be
> > > necessary.
> > > Of course there are races between cgroup create/destroy and bpffs
> > > mkdir, prog pin operatiosn, but they will be there regardless.
> > > The patch 1-4 approach is not race free either.
> >
> > Right. I tried to minimize the races between cgroupfs and bpffs in
> > this patchset. The cgroup and kernfs APIs called in this patchset
> > guarantee that the cgroup and kernfs objects are alive once get. Some
> > states in the objects such as 'id' should be valid at least.
> >
> > > Will that work?
> >
> > Thanks Alexei for the idea.
> >
> > The parameterization part sounds good. By 'parametrize', do you mean a
> > variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
> > [1])? or some metadata of the program? I assume it's program's
> > metadata. Either parameterizing with cgroup_id or passing cgroup
> > object to the prog should work. The problem is at pinning.
>
> The bpf_iter_link_info is used to parametrize the iterator.
> The map iterator will iterate the given map_fd.
> iirc pinning is not parameterizable yet,
> but that's not difficult to add.
>

I can take a look at that. This will be useful in our use case.

>
> > In our use case, we can't ask the users who create cgroups to do the
> > pinning. Pinning requires root privilege. In our use case, we have
> > non-root users who can create cgroup directories and still want to
> > read bpf stats. They can't do pinning by themselves. This is why
> > inheritance is a requirement for us. With inheritance, they only need
> > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > operation is required. Patch 1-4 are needed to implement inheritance.
> >
> > It's also not a good idea in our use case to add a userspace
> > privileged process to monitor cgroupfs operations and perform the
> > pinning. It's more complex and has a higher maintenance cost and
> > runtime overhead, compared to the solution of asking whoever makes
> > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > the data center that don't have the userspace process deployed, the
> > stats will be unavailable, which is a no-no for some of our users.
>
> The commit log says that there will be a daemon that does that
> monitoring of cgroupfs. And that daemon needs to mkdir
> directories in bpffs when a new cgroup is created, no?
> The kernel is only doing inheritance of bpf progs into
> new dirs. I think that daemon can pin as well.
>
> The cgroup creation is typically managed by an agent like systemd.
> Sounds like you have your own agent that creates cgroups?
> If so it has to be privileged and it can mkdir in bpffs and pin too ?

Ah, yes, we have our own daemon to manage cgroups. That daemon creates
the top-level cgroup for each job to run inside. However, the job can
create its own cgroups inside the top-level cgroup, for fine grained
resource control. This doesn't go through the daemon. The job-created
cgroups don't have the pinned objects and this is a no-no for our
users.

  reply	other threads:[~2022-02-04 18:27 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-01 20:55 [PATCH RFC bpf-next v2 0/5] Extend cgroup interface with bpf Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view Hao Luo
2022-02-01 20:55 ` [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link Hao Luo
2022-02-03 18:04   ` Alexei Starovoitov
2022-02-03 22:46     ` Hao Luo
2022-02-04  3:33       ` Alexei Starovoitov
2022-02-04 18:26         ` Hao Luo [this message]
2022-02-06  4:29           ` Alexei Starovoitov
2022-02-08 20:07             ` Hao Luo
2022-02-08 21:20               ` Alexei Starovoitov
2022-02-08 21:34                 ` Hao Luo
2022-02-14 18:29                 ` Hao Luo
2022-02-14 19:24                   ` Alexei Starovoitov
2022-02-14 20:23                     ` Hao Luo
2022-02-14 20:27                       ` Alexei Starovoitov
2022-02-14 20:39                         ` Hao Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+khW7i+TScwPZ6-rcFKiXtxMm8hiZYJGH-wYb=7jBvDWg8pJQ@mail.gmail.com' \
    --to=haoluo@google.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=jevburton.kernel@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=shakeelb@google.com \
    --cc=songliubraving@fb.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).