Re: [RFC PATCH bpf-next 4/8] bpf: support GET_FD_BY_ID and GET_NEXT_ID for bpf_link

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [RFC PATCH bpf-next 4/8] bpf: support GET_FD_BY_ID and GET_NEXT_ID for bpf_link
Date: Wed, 8 Apr 2020 13:23:09 -0700	[thread overview]
Message-ID: <CAEf4BzaiRYMc4QMjz8bEn1bgiSXZvW_e2N48-kTR4Fqgog2fBg@mail.gmail.com> (raw)
In-Reply-To: <877dyq80x8.fsf@toke.dk>

On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andriin@fb.com> writes:
> >>
> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
> >> > done as the very last step in finalizing bpf_link, together with installing
> >> > FD. This approach allows users of bpf_link in kernel code to not worry about
> >> > races between user-space and kernel code that hasn't finished attaching and
> >> > initializing bpf_link.
> >> >
> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
> >> > querying bpf_link information (implemented later in the series) will be
> >> > allowed.
> >>
> >> I must admit I remain sceptical about this model of restricting access
> >> without any of the regular override mechanisms (for instance, enforcing
> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
> >> keep saying there would be 'some' override mechanism, I think it would
> >> be helpful if you could just include that so we can see the full
> >> mechanism in context.
> >
> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.
> >
> > One way to go about this is to allow creating writable bpf_link for
> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
> > operation on writable links, same as we do with LINK_UPDATE here.
> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
> > cgroup dying: it will detach bpf_link, but will leave it alive until
> > last FD is closed.
>
> Yup, I think this would be a reasonable way to implement the override
> mechanism - it would ensure 'full root' users (like a root shell) can
> remove attachments, while still preventing applications from doing so by
> limiting their capabilities.

So I did some experiments and I think I want to keep GET_FD_BY_ID for
bpf_link to return only read-only bpf_links. After that, one can pin
bpf_link temporarily and re-open it as writable one, provided
CAP_DAC_OVERRIDE capability is present. All that works already,
because pinned bpf_link is just a file, so one can do fchmod on it and
all that will go through normal file access permission check code
path. Unfortunately, just re-opening same FD as writable (which would
be possible if fcntl(fd, F_SETFL, S_IRUSR
 S_IWUSR) was supported on Linux) without pinning is not possible.
Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
either, because backing inode is not BPF FS inode. I'm not sure, but
maybe we can support the latter eventually. But either way, I think
given this is to be used for manual troubleshooting, going through few
extra hoops to force-detach bpf_link is actually a good thing.

>
> Extending on the concept of RO/RW bpf_link attachments, maybe it should
> even be possible for an application to choose which mode it wants to pin
> its fd in? With the same capability being able to override it of
> course...

Isn't that what patch #2 is doing?... There are few bugs in the
implementation currently, but it will work in the final version.

>
> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can
> > be disabled for majority of real-life applications to prevent them
> > from doing this. If every realistic application has/needs
> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
> > get writable bpf_link and do anything with it.
>
> I poked around a bit, and looking at the sandboxing configurations
> shipped with various daemons in their systemd unit files, it appears
> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they
> have to be able to read /etc/shadow (which is installed as chmod 0). If
> this is really the case, that would indicate it's not a widely needed
> capability; but I wouldn't exactly say that I've done a comprehensive
> survey, so probably a good idea for you to check your users as well :)

Right, it might not be possible to drop it for all applications right
away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is
absolutely necessary to work with BPF.

>
> -Toke
>