linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hao Luo <haoluo@google.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Joe Burton <jevburton.kernel@gmail.com>,
	Tejun Heo <tj@kernel.org>, Josh Don <joshdon@google.com>,
	Stanislav Fomichev <sdf@google.com>, bpf <bpf@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
Date: Tue, 8 Mar 2022 13:08:39 -0800	[thread overview]
Message-ID: <CA+khW7iQ6w99pB+kodXheJDo5nAZ6wxZiaWtt08xKQETs=uJFg@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQ+-9DAuqj3jLvnwPn0PwuRnfSZ4niDOPqOaF+SH-_+P8A@mail.gmail.com>

On Sat, Mar 5, 2022 at 3:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@google.com> wrote:
> >
> > I gave this question more thought. We don't need to bind mount the top
> > bpffs into the container, instead, we may be able to overlay a bpffs
> > directory into the container. Here is the workflow in my mind:
>
> I don't quite follow what you mean by 'overlay' here.
> Another bpffs mount or future overlayfs that supports bpffs?
>
> > For each job, let's say A, the container runtime can create a
> > directory in bpffs, for example
> >
> >   /sys/fs/bpf/jobs/A
> >
> > and then create the cgroup for A. The sleepable tracing prog will
> > create the file:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > 100 is the created cgroup's id. Then the container runtime overlays
> > the bpffs directory into container A in the same path:
>
> Why cgroup id ? Wouldn't it be easier to use the same cgroup name
> as in cgroupfs ?
>

Cgroup name isn't unique. We don't need the hierarchy information of
cgroups. We can use a library function to translate cgroup path to
cgroup id. See the get_cgroup_id() in patch 9/9. It works fine in the
selftest.

> >   [A's container path]/sys/fs/bpf/jobs/A.
> >
> > A can see the stats at the path within its mount ns:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > When A creates cgroup, it is able to write to the top layer of the
> > overlayed directory. So it is
> >
> >   /sys/fs/bpf/jobs/A/101/stats
> >
> > Some of my thoughts:
> >   1. Compared to bind mount top bpffs into container, overlaying a
> > directory avoids exposing other jobs' stats. This gives better
> > isolation. I already have a patch for supporting laying bpffs over
> > other fs, it's not too hard.
>
> So it's overlayfs combination of bpffs and something like ext4, right?
> I thought you found out that overlaryfs has to be upper fs
> and lower fs shouldn't be modified underneath.
> So if bpffs is a lower fs the writes into it should go
> through the upper overlayfs, right?
>

It's overlayfs combining bpffs and ext4. Bpffs is the upper layer. The
lower layer is an empty ext4 directory. The merged directory is a
directory in the container.
The upper layer contains bpf objects that we want to expose to the
container, for example, the sleepable tracing progs and the iter link
for reading stats. Only the merged directory is visible to the
container and all the updates go through the merged directory.

The following is the example of workflow I'm thinking:

Step 1: We first set up directories and bpf objects needed by containers.

[# ~] ls /sys/fs/bpf/container/upper
tracing_prog   iter_link
[# ~] ls /sys/fs/bpf/container/work
[# ~] ls /container
root   lower
[# ~] ls /container/root
bpf
[# ~] ls /container/root/bpf

Step 2: Use overlayfs to mount a directory from bpffs into the container's home.

[# ~] mkdir /container/lower
[# ~] mkdir /sys/fs/bpf/container/workdir
[# ~] mount -t overlay overlay -o \
 lowerdir=/container/lower,\
 upperdir=/sys/fs/bpf/container/upper,\
 workdir=/sys/fs/bpf/container/work \
  /container/root/bpf
[# ~] ls /container/root/bpf
tracing_prog    iter_link

Step 3: pivot root for container, we expect to see the bpf objects are
mapped into container,

[# ~] chroot /container/root
[# ~] ls /
bpf
[# ~] ls /bpf
tracing_prog   iter_link

Note:

- I haven't tested Step 3. But Step 1 and step 2 seem to be working as
expected. I am testing the behaviors of the bpf objects, after we
enter the container.

- Only a directory in bpffs is mapped into the container, not the top
bpffs. The path is uniform in all containers, that is, /bpf. The
container should be able to mkdir in /bpf, etc.

> >   2. Once the container runtime has overlayed directory into the
> > container, it has no need to create more cgroups for this job. It
> > doesn't need to track the stats of job-created cgroups, which are
> > mainly for inspection by the job itself. Even if it needs to collect
> > the stats from those cgroups, it can read from the path in the
> > container.
> >   3. The overlay path in container doesn't have to be exactly the same
> > as the path in root mount ns. In the sleepable tracing prog, we may
> > select paths based on current process's ns. If we choose to do this,
> > we can further avoid exposing cgroup id and job name to the container.
>
> The benefits make sense.

  reply	other threads:[~2022-03-08 21:08 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-25 23:43 [PATCH bpf-next v1 0/9] Extend cgroup interface with bpf Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Hao Luo
2022-02-27  5:18   ` Kumar Kartikeya Dwivedi
2022-02-28 22:10     ` Hao Luo
2022-03-02 19:34       ` Alexei Starovoitov
2022-03-03 18:50         ` Hao Luo
2022-03-04 18:37           ` Hao Luo
2022-03-05 23:47             ` Alexei Starovoitov
2022-03-08 21:08               ` Hao Luo [this message]
2022-03-02 20:55   ` Yonghong Song
2022-03-03 18:56     ` Hao Luo
2022-03-03 19:13       ` Yonghong Song
2022-03-03 19:15         ` Hao Luo
2022-03-12  3:46   ` Al Viro
2022-03-14 17:07     ` Hao Luo
2022-03-14 23:10       ` Al Viro
2022-03-15 17:27         ` Hao Luo
2022-03-15 18:59           ` Alexei Starovoitov
2022-03-15 19:03             ` Alexei Starovoitov
2022-03-15 19:00           ` Al Viro
2022-03-15 19:47             ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 2/9] bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 3/9] selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 4/9] bpf: Introduce sleepable tracepoints Hao Luo
2022-03-02 19:41   ` Alexei Starovoitov
2022-03-03 19:37     ` Hao Luo
2022-03-03 19:59       ` Alexei Starovoitov
2022-03-02 21:23   ` Yonghong Song
2022-03-02 21:30     ` Alexei Starovoitov
2022-03-03  1:08       ` Yonghong Song
2022-03-03  2:29         ` Alexei Starovoitov
2022-03-03 19:43           ` Hao Luo
2022-03-03 20:02             ` Alexei Starovoitov
2022-03-03 20:04               ` Alexei Starovoitov
2022-03-03 22:06                 ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 5/9] cgroup: Sleepable cgroup tracepoints Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 6/9] libbpf: Add sleepable tp_btf Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 7/9] bpf: Lift permission check in __sys_bpf when called from kernel Hao Luo
2022-03-02 20:01   ` Alexei Starovoitov
2022-03-03 19:14     ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter Hao Luo
2022-02-26  2:32   ` kernel test robot
2022-02-26  2:32   ` kernel test robot
2022-02-26  2:53   ` kernel test robot
2022-03-02 21:59   ` Yonghong Song
2022-03-03 20:02     ` Hao Luo
2022-03-02 22:45   ` Kumar Kartikeya Dwivedi
2022-03-03  2:03     ` Yonghong Song
2022-03-03  3:03       ` Kumar Kartikeya Dwivedi
2022-03-03  4:00         ` Alexei Starovoitov
2022-03-03  7:33         ` Yonghong Song
2022-03-03  8:13           ` Kumar Kartikeya Dwivedi
2022-03-03 21:52           ` Hao Luo
2022-02-25 23:43 ` [PATCH bpf-next v1 9/9] selftests/bpf: Tests using sleepable tracepoints to monitor cgroup events Hao Luo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+khW7iQ6w99pB+kodXheJDo5nAZ6wxZiaWtt08xKQETs=uJFg@mail.gmail.com' \
    --to=haoluo@google.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=jevburton.kernel@gmail.com \
    --cc=joshdon@google.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=memxor@gmail.com \
    --cc=sdf@google.com \
    --cc=shakeelb@google.com \
    --cc=songliubraving@fb.com \
    --cc=tj@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).