Re: [PATCH v2 bpf-next 00/18] BPF token

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	Maryam Tahhan <mtahhan@redhat.com>,
	 Andrii Nakryiko <andrii@kernel.org>,
	bpf@vger.kernel.org,  linux-security-module@vger.kernel.org,
	Kees Cook <keescook@chromium.org>,
	 Christian Brauner <brauner@kernel.org>,
	lennart@poettering.net, cyphar@cyphar.com,  kernel-team@meta.com
Subject: Re: [PATCH v2 bpf-next 00/18] BPF token
Date: Mon, 26 Jun 2023 15:31:30 -0700	[thread overview]
Message-ID: <CAEf4BzY5UWLCjDiQ_pfCeKMVJScdk7B4ZaKwi=yaf8ACnaOXLg@mail.gmail.com> (raw)
In-Reply-To: <fe47aeb6-dae8-43a6-bcb0-ada2ebf62e08@app.fastmail.com>

On Sat, Jun 24, 2023 at 7:00 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
> > On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> >> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> >>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> >>>
> >>>> Hopefully you can see where I'm going with this. And this is just one
> >>>> random tiny example. We can think up tons of other cases to prove BPF
> >>>> is not isolatable to any sort of "container".
> >>>
> >>> No.  You have not come up with an example of why BPF is not isolatable
> >>> to a container.  You have come up with an example of why binding to a
> >>> sched_switch raw tracepoint does not make sense in a container without
> >>> additional mechanisms to give it well defined functionality and
> >>> appropriate security.
> >
> > One big blocker for the case of BPF is not isolatable to a container are
> > CPU hardware bugs. There has been plenty of mitigation effort so that the
> > flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
> > it's a cat and mouse game and vendors are also not really transparent. So
> > actual reasonable discussion can be resumed once CPU vendors gets their
> > stuff fixed.
> >
> >    [0]
> > https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks
> >
>
> By this standard, shouldn’t we just give up?  Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control.
>
> Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security.
>
> >> Thinking about this some more:
> >>
> >> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.
> >
> > Agree that proxy is a mess for various reasons stated earlier.
> >
> >> So here are a couple of possible solutions:
> >>
> >> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.
> >
> > I don't think it's very practical, meaning the vast majority of applications
> > out there today are tightly coupled BPF code + user space application, and in
> > a lot of cases programs are dynamically created. This would require somehow
> > splitting up parts of your application to run outside the container in hostns
> > and other parts inside the container.. for the sake of the mentioned example
> > it's something fairly static, but real-world applications look different and
> > are much more complex.
> >
>
> It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it.
>
> And yet someone still wants to run it in a userns.
>

Yes, to get all the other benefits of userns. Yes, BPF isolation
cannot be enforced and we rely on a human-driven process to decide
whether it's ok to run BPF inside each specific container. But why
can't we also get all the other benefits of userns outside of BPF
usage.

BPF parts are critical for such applications, but they also normally
have a huge user-space part, and use large common libraries, so there
is a lot of benefit to having as much userns-provided isolation as
possible.

> This is IMO a rather bizarre situation.
>
> If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc.  After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything.  This ability, as described, isn’t “the right to use BPF.”  It is the right to *read all secrets*, intentionally.  (And modify them, with bpf_probe_write_user, possibly subject to some constraints.)

What makes you think this is not how it's actually done in practice
already (except right now we don't have BPF token, so it's
all-or-nothin, userns or not, root or not, which is overall worse than
what we'll get with BPF token + userns)?

Audit, code review, proper development practices. Then discussions and
reviews between team running container manager and team with BPF-based
workload to make decisions whether it's safe to allow BPF access (and
to what degree) and how teams will maintain privacy and safety
obligations.

>
>
> If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures.  I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first.

If we had dozens of teams developing and loading/unloading their
custom kernel modules all the time, it might not have sounded so
ridiculous?

>
> >> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container.
> >
> > Same as above. Programs are in most cases very tightly coupled to the
> > application
> > itself. I'm not sure if the ask is to redesign/implement all the
> > existing user
> > space infra.
> >
> >> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> >>
> >> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.
> >
> > Worst case, sure, but it's not the point. These containers which would
> > receive
> > the tokens are part of your trusted compute base.. so its up to the
> > specific
> > applications and their surrounding infrastructure with regards to what
> > problem
> > they solve where and approved by operators/platform engs to deploy in
> > your cluster.
> > I don't particularly see that there's a performance problem. Andrii
> > specifically
> > mentioned /trusted unprivileged applications/.

Yep, performance is not why this is being done.

> >
> >> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> >>
> >> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> >>
> >> --Andy
> >>