Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf

From: Andy Lutomirski <luto@kernel.org>
To: Song Liu <songliubraving@fb.com>, linux-security@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>,
	Lorenz Bauer <lmb@cloudflare.com>, Jann Horn <jannh@google.com>,
	Greg KH <gregkh@linuxfoundation.org>,
	"linux-abi@vger.kernel.org" <linux-abi@vger.kernel.org>,
	"kees@chromium.org" <kees@chromium.org>
Subject: Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
Date: Sat, 29 Jun 2019 17:12:09 -0700	[thread overview]
Message-ID: <CALCETrWBnH4Q43POU8cQ7YMjb9LioK28FDEQf7aHZbdf1eBZWg@mail.gmail.com> (raw)
In-Reply-To: <3C595328-3ABE-4421-9772-8D41094A4F57@fb.com>

On Fri, Jun 28, 2019 at 12:05 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> > On Jun 27, 2019, at 4:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 6/27/19 1:19 PM, Song Liu wrote:
> >> This patch introduce unprivileged BPF access. The access control is
> >> achieved via device /dev/bpf. Users with write access to /dev/bpf are able
> >> to call sys_bpf().
> >> Two ioctl command are added to /dev/bpf:
> >> The two commands enable/disable permission to call sys_bpf() for current
> >> task. This permission is noted by bpf_permitted in task_struct. This
> >> permission is inherited during clone(CLONE_THREAD).
> >> Helper function bpf_capable() is added to check whether the task has got
> >> permission via /dev/bpf.
> >
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 0e079b2298f8..79dc4d641cf3 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -9134,7 +9134,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
> >>              env->insn_aux_data[i].orig_idx = i;
> >>      env->prog = *prog;
> >>      env->ops = bpf_verifier_ops[env->prog->type];
> >> -    is_priv = capable(CAP_SYS_ADMIN);
> >> +    is_priv = bpf_capable(CAP_SYS_ADMIN);
> >
> > Huh?  This isn't a hardening measure -- the "is_priv" verifier mode allows straight-up leaks of private kernel state to user mode.
> >
> > (For that matter, the pending lockdown stuff should possibly consider this a "confidentiality" issue.)
> >
> >
> > I have a bigger issue with this patch, though: it's a really awkward way to pretend to have capabilities.  For bpf, it seems like you could make this be a *real* capability without too much pain since there's only one syscall there.  Just find a way to pass an fd to /dev/bpf into the syscall.  If this means you need a new bpf_with_cap() syscall that takes an extra argument, so be it.  The old bpf() syscall can just translate to bpf_with_cap(..., -1).
> >
> > For a while, I've considered a scheme I call "implicit rights".  There would be a directory in /dev called /dev/implicit_rights.  This would either be part of devtmpfs or a whole new filesystem -- it would *not* be any other filesystem.  The contents would be files that can't be read or written and exist only in memory.  You create them with a privileged syscall.  Certain actions that are sensitive but not at the level of CAP_SYS_ADMIN (use of large-attack-surface bpf stuff, creation of user namespaces, profiling the kernel, etc) could require an "implicit right".  When you do them, if you don't have CAP_SYS_ADMIN, the kernel would do a path walk for, say, /dev/implicit_rights/bpf and, if the object exists, can be opened, and actually refers to the "bpf" rights object, then the action is allowed.  Otherwise it's denied.
> >
> > This is extensible, and it doesn't require the rather ugly per-task state of whether it's enabled.
> >
> > For things like creation of user namespaces, there's an existing API, and the default is that it works without privilege.  Switching it to an implicit right has the benefit of not requiring code changes to programs that already work as non-root.
> >
> > But, for BPF in particular, this type of compatibility issue doesn't exist now.  You already can't use most eBPF functionality without privilege.  New bpf-using programs meant to run without privilege are *new*, so they can use a new improved API.  So, rather than adding this obnoxious ioctl, just make the API explicit, please.
> >
> > Also, please cc: linux-abi next time.
>
> Thanks for your inputs.
>
> I think we need to clarify the use case here. In this case, we are NOT
> thinking about creating new tools for unprivileged users. Instead, we
> would like to use existing tools without root.

I read patch 4, and I interpret it very differently.  Patches 2-4 are
creating a new version of libbpf and a new version of bpftool.  Given
this, I see no real justification for adding a new in-kernel per-task
state instead of just pushing the complexity into libbpf.

> On the kernel side, we are not planning provides a subset of safe
> features for unprivileged users. The permission here is all-or-nothing.

This may be a showstopper.  I think this series needs an extremely
clear explanation of the security implications of providing access to
/dev/bpf.  Is it just exposing more attack surface for kernel bugs, or
is it, *by design*, exposing new privileges.  Given the is_priv change
that I pointed out upthread, it appears to be the latter, and I'm
wondering how this is a reasonable thing to do.

>
> Introducing bpf_with_cap() syscall means we need teach these tools to
> manage the fd, and use the new API when necessary. This is clearly not
> easy.

How hard can it be?  I looked the the libbpf sources, and there are
really very few call sites of sys_bpf().  You could update all of them
with a very small patch.

Also, on a quick survey of kernel/bpf's capable() calls, I feel like
this series may be misguided.  If you want to enable unprivileged or
less privileged use of bpf(), how about going through all of the
capable() calls and handling them one-by-one as appropriate.  Here's a
random survey:

map_freeze(): It looks like the only reason for a capable() call is
that there isn't a clear permission model right now defining who owns
a map and therefore may freeze it.  Could you not use a check along
the lines of capable_wrt_inode_uidgid() instead of capable()?

        if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
            (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
            !capable(CAP_SYS_ADMIN))
                return -EPERM;

I'm not sure why this is needed at all?  Is it a DoS mitigation?  If
so, couldn't you find a way to acocunt for inefficient unaligned
access similarly to how you account for complexity in general?

        if (attr->insn_cnt == 0 ||
            attr->insn_cnt > (capable(CAP_SYS_ADMIN) ?
BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
                return -E2BIG;

This is similar.  I could imagine a cgroup setting that limits bpf
program complexity.  This is very, very different type of privilege
than reading kernel memory.

        if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
            type != BPF_PROG_TYPE_CGROUP_SKB &&
            !capable(CAP_SYS_ADMIN))
                return -EPERM;

I suspect you could just delete this check or expand the allowable
unprivileged program types after auditing that the other types have
appropriately restrictive verifiers and require appropriate
permissions to *run* the programs.

In bpf_prog_attach():
        if (!capable(CAP_NET_ADMIN))
                return -EPERM;

This looks like it wants to be ns_capable() after you audit the code
to make sure it's safe enough.

bpf_prog_get_fd_by_id(): I really think you just need a real
permission model. Anyone can create a file on a filesystem, and there
are reasonable policies that allow appropriate users to open existing
files by name without CAP_SYS_ADMIN.  Similarly, anyone can create a
key in the kernel keyring subsystem, and there are well-defined rules
under which someone can access an existing key by name.  I think you
should come up with a way to handle this for bpf.  Adding a bpffs for
this seems like a decent approach.

bpf_prog_get_info_by_id():

        if (!capable(CAP_SYS_ADMIN)) {
                info.jited_prog_len = 0;
                info.xlated_prog_len = 0;
                info.nr_jited_ksyms = 0;
                info.nr_jited_func_lens = 0;
                info.nr_func_info = 0;
                info.nr_line_info = 0;
                info.nr_jited_line_info = 0;
                goto done;
        }

It looks like someone decided this information was sensitive if you
query someone else's program.  Again, there are sensible ways to
address this.

Finally, in the verifier:

is_priv = capable(CAP_SYS_ADMIN);

That should just be left alone.  Arguably you could make a new
capability CAP_BPF_LEAK_KERNEL_ADDRESSES or similar for this.

As it stands, it seems like bpf() was designed under the assumption
that the kind of security model needed to make it work well for
unprivileged or differently-privileged users would be developed later.
But, instead of developing such a security model, this patch is
introducing a whole new Linux "capability" that just disables all
security.  From my perspective, this seems like a bad idea.

So, if anyone cares about my opinion, NAK to the whole concept.
Please instead work on fixing this for real, one capable() check at a
time.