Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: "Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Alexei Starovoitov" <ast@fb.com>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii.nakryiko@gmail.com>,
	"Andrii Nakryiko" <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	"Kernel Team" <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
Date: Wed, 4 Mar 2020 17:07:08 -0800	[thread overview]
Message-ID: <20200305010706.dk7zedpyj5pb5jcv@ast-mbp> (raw)
In-Reply-To: <20200304132439.6abadbe3@kicinski-fedora-PC1C0HJN>

On Wed, Mar 04, 2020 at 01:24:39PM -0800, Jakub Kicinski wrote:
> On Wed, 4 Mar 2020 12:45:07 -0800 Alexei Starovoitov wrote:
> > On Wed, Mar 04, 2020 at 11:41:58AM -0800, Jakub Kicinski wrote:
> > > On Tue, 3 Mar 2020 20:36:45 -0800 Alexei Starovoitov wrote:  
> > > > > > libxdp can choose to pin it in some libxdp specific location, so other
> > > > > > libxdp-enabled applications can find it in the same location, detach,
> > > > > > replace, modify, but random app that wants to hack an xdp prog won't
> > > > > > be able to mess with it.    
> > > > > 
> > > > > What if that "random app" comes first, and keeps holding on to the link
> > > > > fd? Then the admin essentially has to start killing processes until they
> > > > > find the one that has the device locked, no?    
> > > > 
> > > > Of course not. We have to provide an api to make it easy to discover
> > > > what process holds that link and where it's pinned.  
> > > 
> > > That API to discover ownership would be useful but it's on the BPF side.  
> > 
> > it's on bpf side because it's bpf specific.
> > 
> > > We have netlink notifications in networking world. The application
> > > which doesn't want its program replaced should simply listen to the
> > > netlink notifications and act if something goes wrong.  
> > 
> > instead of locking the bike let's setup a camera and monitor the bike
> > when somebody steals it.
> > and then what? chase the thief and bring the bike back?
> 
> :) Is the bike the BPF program? It's more like thief is stealing our
> parking spot, we still have the program :)

yeah. parking spot is a better analogy.

> Maybe also the thief should not have CAP_ADMIN in the first place?
> And ask a daemon to perform its actions..

a daemon idea keeps coming back in circles.
With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
but xdp, tc, cgroup still don't have the owner concept.
Some people argued that these three need three separate daemons.
Especially since cgroups are mainly managed by systemd plus container
manager it's quite different from networking (xdp, tc) where something
like 'networkd' might makes sense.
But if you take this line of thought all the ways systemd should be that
single daemon to coordinate attaching to xdp, tc, cgroup because
in many cases cgroup and tc progs have to coordinate the work.
At that's where it's getting gloomy... unless the kernel can provide
a facility so central daemon is not necessary.

> > current xdp, tc, cgroup apis don't have the concept of the link
> > and owner of that link.
> 
> Why do the attachment points have to have a concept of an owner and 
> not the program itself?

bpf program is an object. That object has an owner or multiple owners.
A user process that holds a pointer to that object is a shared owner.
FD is such pointer. FD == std::shared_ptr<bpf_prog>.
Holding that pointer guarantees that <bpf_prog> will not disappear,
but it says nothing that the program will keep running.
For [ku]probe,tp,fentry,fexit there was always <bpf_link> in the kernel.
It wasn't that formal in the past until most recent Andrii's patches,
but the concept existed for long time. FD == std::shared_ptr<bpf_link>
connects a kernel object with <bpf_prog>. When that kernel objects emits
an event the <bpf_link> guarantees that <bpf_prog> will be executed.

For cgroups we don't have such concept. We thought that three attach modes we
introduced (default, allow-override, allow-multi) will cover all use cases. But
in practice turned out that it only works when there is a central daemon for
_all_ cgroup-bpf progs in the system otherwise different processes step on each
other. More so there has to be a central diff-review human authority otherwise
teams step on each other. That's sort-of works within one org, but doesn't
scale.

To avoid making systemd a central place to coordinate attaching xdp, tc, cgroup
progs the kernel has to provide a mechanism for an application to connect a
kernel object with a prog and hold the ownership of that link so that no other
process in the system can break that connection. That kernel object is cgroup,
qdisc, netdev. Interesting question comes when that object disappears. What to
do with the link? Two ways to solve it:
1. make link hold the object, so it cannot be removed.
2. destroy the link when object goes away.
Both have pros and cons as I mentioned earlier.
And that's what's to be decided.
I think the truth is somewhat in the middle. The link has to hold the object,
so it doesn't disappear from under it, but get notified on deletion, so the
link can be self destroyed. From the user point of view the execution guarantee
is still preserved. The kernel object was removed and the link has one dangling
side. Note this behavior is vastly different from existing xdp, tc, cgroup
behavior where both object and bpf prog can be alive, but connection is gone
and execution guarantee is broken.