Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: David Ahern <dsahern@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrey Ignatov <rdna@fb.com>, Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
Date: Tue, 31 Mar 2020 19:04:04 -0700	[thread overview]
Message-ID: <20200401020404.wbad24dxkv7qr2os@ast-mbp> (raw)
In-Reply-To: <95bfd8e0-86b3-cb87-9f06-68a7c1ba7d7a@gmail.com>

On Tue, Mar 31, 2020 at 07:42:46PM -0600, David Ahern wrote:
> On 3/30/20 7:17 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
> >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>>>
> >>>> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >>>> a mere 4 patch set - to go in without something as essential as the
> >>>> QUERY for observability.
> >>>
> >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
> >>
> >> You want a feature where a process can prevent another from installing a
> >> program on a cgroup. How do I learn which process is holding the
> >> bpf_link reference and preventing me from installing a program? Unless I
> >> have missed some recent change that is not currently covered by bpftool
> >> cgroup, and there is no way reading kernel code will tell me.
> > 
> > No. That's not the case at all. You misunderstood the concept.
> 
> I don't think so ...
> 
> > 
> >> That is my point. You are restricting what root can do and people will
> >> not want to resort to killing random processes trying to find the one
> >> holding a reference. 
> > 
> > Not true either.
> > bpf_link = old attach with allow_multi (but with extra safety for owner)
> 
> cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI.
> That's a year for tools like 'ip vrf exec' to exist and be relied on.
> 'ip vrf exec' does not use MULTI.
> 
> I have not done a deep dive on systemd code, but on ubuntu 18.04 system:
> 
> $ sudo ~/bin/bpftool cgroup tree
> CgroupPath
> ID       AttachType      AttachFlags     Name
> /sys/fs/cgroup/unified/system.slice/systemd-udevd.service
>     5        ingress
>     4        egress
> /sys/fs/cgroup/unified/system.slice/systemd-journald.service
>     3        ingress
>     2        egress
> /sys/fs/cgroup/unified/system.slice/systemd-logind.service
>     7        ingress
>     6        egress
> 
> suggests that multi is not common with systemd either at some point in
> its path, so 'ip vrf exec' is not alone in not using the flag. There
> most likely are many other tools.

Please take a look at systemd source code:
src/core/bpf-devices.c
src/core/bpf-firewall.c
It prefers to use BPF_F_ALLOW_MULTI when possible.
Since it's the most sensible flag.
Since 'ip vrf exec' is not using allow_multi it's breaking several systemd features.
(regardless of what bpf_link can and cannot do)

> > The only thing bpf_link protects is the owner of the link from other
> > processes of nuking that link.
> > It does _not_ prevent other processes attaching their own cgroup-bpf progs
> > either via old interface or via bpf_link.
> > 
> 
> It does when that older code does not use the MULTI flag. There is a
> history that is going to create conflicts and being able to id which
> program holds the bpf_link is essential.
> 
> And this is really just one use case. There are many other reasons for
> wanting to know what process is holding a reference to something.

I'm not disagreeing that it's useful to query what is attached where. My point
once again that bpf_link for cgroup didn't change a single bit in this logic.
There are processes (like systemd) that are using allow_multi. When they switch
to use bpf_link few years from now nothing will change for all other processes
in the system. Only systemd will be assured that their bpf-device prog will not
be accidentally removed by 'ip vrf'. Currently nothing protects systemd's bpf
progs. Any cap_net_admin process can _accidentally_ nuke it. It's even more
weird that bpf-cgroup-device that systemd is using is under cap_net_admin.
There is nothing networking about it. But that's a separate discussion.

May be you should fix 'ip vrf' first before systemd folks start yelling and
then we can continue arguing about merits of observability?