Linux-Security-Module Archive on lore.kernel.org
 help / color / Atom feed
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
       [not found]                             ` <369476A8-4CE1-43DA-9239-06437C0384C7@fb.com>
@ 2019-07-30 20:24                               ` Andy Lutomirski
  2019-07-31  8:10                                 ` Song Liu
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-07-30 20:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> > On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
> >>>>>
> >>>>
> >>>> Well, yes. sys_bpf() is pretty powerful.
> >>>>
> >>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>>> the meanwhile, such users should not take down the whole system easily
> >>>> by accident, e.g., with rm -rf /.
> >>>
> >>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >>
> >> This is a great idea! fscaps + /etc/bpfusers should do the trick.
> >
> > After some discussions and more thinking on this, I have some concerns
> > with the user space only approach.
> >
> > IIUC, your proposal for user space only approach is like:
> >
> > 1. bpftool (and other tools) check /etc/bpfusers and only do
> >   setuid for allowed users:
> >
> >       int main()
> >       {
> >               if (/* uid in /etc/bpfusers */)
> >                       setuid(0);
> >               sys_bpf(...);
> >       }
> >
> > 2. bpftool (and other tools) is installed with CAP_SETUID:
> >
> >       setcap cap_setuid=e+p /bin/bpftool
> >
> > 3. sys admin maintains proper /etc/bpfusers.
> >
> > This approach is not ideal, because we need to trust the tool to give
> > it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> > or use other root only sys calls after setuid(0).
> >
>
> I would like more comments on this.
>
> Currently, bpf permission is more or less "root or nothing", which we
> would like to change.
>
> The short term goal is to separate bpf from root, in other words, it is
> "all or nothing". Special user space utilities, such as systemd, would
> benefit from this. Once this is implemented, systemd can call sys_bpf()
> when it is not running as root.

As generally nasty as Linux capabilities are, this sounds like a good
use for CAP_BPF_ADMIN.

But what do you have in mind?  Isn't non-root systemd mostly just the
user systemd session?  That should *not* have bpf() privileges until
bpf() is improved such that you can't use it to compromise the system.

>
> In longer term, it may be useful to provide finer grain permission of
> sys_bpf(). For example, sys_bpf() should be aware of containers; and
> user may only have access to certain bpf maps. Let's call this
> "fine grain" capability.
>
>
> Since we are seeing new use cases every year, we will need many
> iterations to implement the fine grain permission. I think we need an
> API that is flexible enough to cover different types of permission
> control.
>
> For example, bpf_with_cap() can be flexible:
>
>         bpf_with_cap(cmd, attr, size, perm_fd);
>
> We can get different types of permission via different combinations of
> arguments:
>
>     A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>     this is "all or nothing" permission.
>
>     A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>     commands to this specific cgroup.
>

I don't see why you need to invent a whole new mechanism for this.
The entire cgroup ecosystem outside bpf() does just fine using the
write permission on files in cgroupfs to control access.  Why can't
bpf() do the same thing?

>
> Alexei raised another idea in offline discussions: instead of adding
> bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
> permission for the _next_ sys_bpf() from current task:
>
>     bpf(LOAD_PERM_FD, perm_fd);
>     /* the next sys_bpf() uses permission from perm_fd */
>     bpf(cmd, attr, size);
>
> This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but
> doesn't require the new sys call.

That sounds almost every bit as problematic as the approach where you
ask for permission once and it sticks.

>
> 1. User space only approach doesn't work, even for "all or nothing"
>    permission control. I expanded the discussion in the previous
>    email. Please let me know if I missed anything there.

As in my previous email, I disagree.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-07-30 20:24                               ` [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf Andy Lutomirski
@ 2019-07-31  8:10                                 ` Song Liu
  2019-07-31 19:09                                   ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Song Liu @ 2019-07-31  8:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List



> On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Andy,
>> 
>>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Hi Andy,
>>> 
>>> 

[...]

>>> 
>> 
>> I would like more comments on this.
>> 
>> Currently, bpf permission is more or less "root or nothing", which we
>> would like to change.
>> 
>> The short term goal is to separate bpf from root, in other words, it is
>> "all or nothing". Special user space utilities, such as systemd, would
>> benefit from this. Once this is implemented, systemd can call sys_bpf()
>> when it is not running as root.
> 
> As generally nasty as Linux capabilities are, this sounds like a good
> use for CAP_BPF_ADMIN.

I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make 
existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

> 
> But what do you have in mind?  Isn't non-root systemd mostly just the
> user systemd session?  That should *not* have bpf() privileges until
> bpf() is improved such that you can't use it to compromise the system.

cgroup bpf is the major use case here. A less important use case is to 
run bpf selftests without being root. 

> 
>> 
>> In longer term, it may be useful to provide finer grain permission of
>> sys_bpf(). For example, sys_bpf() should be aware of containers; and
>> user may only have access to certain bpf maps. Let's call this
>> "fine grain" capability.
>> 
>> 
>> Since we are seeing new use cases every year, we will need many
>> iterations to implement the fine grain permission. I think we need an
>> API that is flexible enough to cover different types of permission
>> control.
>> 
>> For example, bpf_with_cap() can be flexible:
>> 
>>        bpf_with_cap(cmd, attr, size, perm_fd);
>> 
>> We can get different types of permission via different combinations of
>> arguments:
>> 
>>    A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>>    this is "all or nothing" permission.
>> 
>>    A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>>    commands to this specific cgroup.
>> 
> 
> I don't see why you need to invent a whole new mechanism for this.
> The entire cgroup ecosystem outside bpf() does just fine using the
> write permission on files in cgroupfs to control access.  Why can't
> bpf() do the same thing?

It is easier to use write permission for BPF_PROG_ATTACH. But it is 
not easy to do the same for other bpf commands: BPF_PROG_LOAD and 
BPF_MAP_*. A lot of these commands don't have target concept. Maybe 
we should have target concept for all these commands. But that is a 
much bigger project. OTOH, "all or nothing" model allows all these 
commands at once.

Well, that being said, I will look more into using write permission 
in cgroupfs. 

Thanks again for all these comments and suggestions. Please let us 
know your future thoughts and insights. 

Song

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-07-31  8:10                                 ` Song Liu
@ 2019-07-31 19:09                                   ` Andy Lutomirski
  2019-08-02  7:21                                     ` Song Liu
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-07-31 19:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Hi Andy,
> >>
> >>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Andy,
> >>>
> >>>
>
> [...]
>
> >>>
> >>
> >> I would like more comments on this.
> >>
> >> Currently, bpf permission is more or less "root or nothing", which we
> >> would like to change.
> >>
> >> The short term goal is to separate bpf from root, in other words, it is
> >> "all or nothing". Special user space utilities, such as systemd, would
> >> benefit from this. Once this is implemented, systemd can call sys_bpf()
> >> when it is not running as root.
> >
> > As generally nasty as Linux capabilities are, this sounds like a good
> > use for CAP_BPF_ADMIN.
>
> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

It's been done before -- it's not that hard.  IMO the main tricky bit
would be try be somewhat careful about defining exactly what
CAP_BPF_ADMIN does.

> > I don't see why you need to invent a whole new mechanism for this.
> > The entire cgroup ecosystem outside bpf() does just fine using the
> > write permission on files in cgroupfs to control access.  Why can't
> > bpf() do the same thing?
>
> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> we should have target concept for all these commands. But that is a
> much bigger project. OTOH, "all or nothing" model allows all these
> commands at once.

For BPF_PROG_LOAD, I admit I've never understood why permission is
required at all.  I think that CAP_SYS_ADMIN or similar should be
needed to get is_priv in the verifier, but I think that should mainly
be useful for tracing, and that requires lots of privilege anyway.
BPF_MAP_* is probably the trickiest part.  One solution would be some
kind of bpffs, but I'm sure other solutions are possible.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-07-31 19:09                                   ` Andy Lutomirski
@ 2019-08-02  7:21                                     ` Song Liu
  2019-08-04 22:16                                       ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Song Liu @ 2019-08-02  7:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

Hi Andy,

> On Jul 31, 2019, at 12:09 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
>>>>> 
>>>>> Hi Andy,
>>>>> 
>>>>> 
>> 
>> [...]
>> 
>>>>> 
>>>> 
>>>> I would like more comments on this.
>>>> 
>>>> Currently, bpf permission is more or less "root or nothing", which we
>>>> would like to change.
>>>> 
>>>> The short term goal is to separate bpf from root, in other words, it is
>>>> "all or nothing". Special user space utilities, such as systemd, would
>>>> benefit from this. Once this is implemented, systemd can call sys_bpf()
>>>> when it is not running as root.
>>> 
>>> As generally nasty as Linux capabilities are, this sounds like a good
>>> use for CAP_BPF_ADMIN.
>> 
>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> 
> It's been done before -- it's not that hard.  IMO the main tricky bit
> would be try be somewhat careful about defining exactly what
> CAP_BPF_ADMIN does.

Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the 
Plumbers conference. 

OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons 
like systemd to do sys_bpf() without root. 

> 
>>> I don't see why you need to invent a whole new mechanism for this.
>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>> write permission on files in cgroupfs to control access.  Why can't
>>> bpf() do the same thing?
>> 
>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>> we should have target concept for all these commands. But that is a
>> much bigger project. OTOH, "all or nothing" model allows all these
>> commands at once.
> 
> For BPF_PROG_LOAD, I admit I've never understood why permission is
> required at all.  I think that CAP_SYS_ADMIN or similar should be
> needed to get is_priv in the verifier, but I think that should mainly
> be useful for tracing, and that requires lots of privilege anyway.
> BPF_MAP_* is probably the trickiest part.  One solution would be some
> kind of bpffs, but I'm sure other solutions are possible.

Improving permission management of cgroup_bpf is another good topic to
discuss. However, it is also an overkill for current use case. 

Let me get more details about the use case, so that we have a clear 
picture about short term and long term goals. 

Thanks again for your suggestions. 
Song



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-02  7:21                                     ` Song Liu
@ 2019-08-04 22:16                                       ` Andy Lutomirski
  2019-08-05  0:08                                         ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-04 22:16 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
 >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> >
> > It's been done before -- it's not that hard.  IMO the main tricky bit
> > would be try be somewhat careful about defining exactly what
> > CAP_BPF_ADMIN does.
>
> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> Plumbers conference.
>
> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> like systemd to do sys_bpf() without root.

I don't understand the use case here.  Are you talking about systemd
--user?  As far as I know, a user is expected to be able to fully
control their systemd --user process, so giving it unrestricted bpf
access is very close to giving it superuser access, and this doesn't
sound like a good idea.  I think that, if systemd --user needs bpf(),
it either needs real unprivileged bpf() or it needs a privileged
helper (SUID or a daemon) to intermediate this access.

>
> >
> >>> I don't see why you need to invent a whole new mechanism for this.
> >>> The entire cgroup ecosystem outside bpf() does just fine using the
> >>> write permission on files in cgroupfs to control access.  Why can't
> >>> bpf() do the same thing?
> >>
> >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> >> we should have target concept for all these commands. But that is a
> >> much bigger project. OTOH, "all or nothing" model allows all these
> >> commands at once.
> >
> > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > needed to get is_priv in the verifier, but I think that should mainly
> > be useful for tracing, and that requires lots of privilege anyway.
> > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > kind of bpffs, but I'm sure other solutions are possible.
>
> Improving permission management of cgroup_bpf is another good topic to
> discuss. However, it is also an overkill for current use case.
>

I looked at the code some more, and I don't think this is so hard
after all.  As I understand it, all of the map..by_id stuff is, to
some extent, deprecated in favor of persistent maps.  As I see it, the
map..by_id calls should require privilege forever, although I can
imagine ways to scope that privilege to a namespace if the maps
themselves were to be scoped to a namespace.

Instead, unprivileged tools would use the persistent map interface
roughly like this:

$ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
8 entries 64 name mapname

This would require that the caller have either CAP_DAC_OVERRIDE or
that the caller have permission to create files in /sys/fs/bpf/my_dir
(using the same rules as for any filesystem), and the resulting map
would end up owned by the creating user and have mode 0600 (or maybe
0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
the various capable() checks that are currently involved in accessing
a persistent map would instead check FMODE_READ or FMODE_WRITE on the
map file as appropriate.

Half of this stuff already works.  I just set my system up like this:

$ ls -l /sys/fs/bpf
total 0
drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto

$ mkdir /sys/fs/bpf/luto/test

$ ls -l /sys/fs/bpf/luto
total 0
drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test

I bet that making the bpf() syscalls work appropriately in this
context without privilege would only be a couple of hours of work.
The hard work, creating bpffs and making it function, is already done
:)

P.S. The docs for bpftool create are less than fantastic.  The
complete lack of any error message at all when the syscall returns
-EACCES is also not fantastic.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-04 22:16                                       ` Andy Lutomirski
@ 2019-08-05  0:08                                         ` Andy Lutomirski
  2019-08-05  5:47                                           ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-05  0:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
>  >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> > >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> > >
> > > It's been done before -- it's not that hard.  IMO the main tricky bit
> > > would be try be somewhat careful about defining exactly what
> > > CAP_BPF_ADMIN does.
> >
> > Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> > Plumbers conference.
> >
> > OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> > like systemd to do sys_bpf() without root.
>
> I don't understand the use case here.  Are you talking about systemd
> --user?  As far as I know, a user is expected to be able to fully
> control their systemd --user process, so giving it unrestricted bpf
> access is very close to giving it superuser access, and this doesn't
> sound like a good idea.  I think that, if systemd --user needs bpf(),
> it either needs real unprivileged bpf() or it needs a privileged
> helper (SUID or a daemon) to intermediate this access.
>
> >
> > >
> > >>> I don't see why you need to invent a whole new mechanism for this.
> > >>> The entire cgroup ecosystem outside bpf() does just fine using the
> > >>> write permission on files in cgroupfs to control access.  Why can't
> > >>> bpf() do the same thing?
> > >>
> > >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> > >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> > >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> > >> we should have target concept for all these commands. But that is a
> > >> much bigger project. OTOH, "all or nothing" model allows all these
> > >> commands at once.
> > >
> > > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > > needed to get is_priv in the verifier, but I think that should mainly
> > > be useful for tracing, and that requires lots of privilege anyway.
> > > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > > kind of bpffs, but I'm sure other solutions are possible.
> >
> > Improving permission management of cgroup_bpf is another good topic to
> > discuss. However, it is also an overkill for current use case.
> >
>
> I looked at the code some more, and I don't think this is so hard
> after all.  As I understand it, all of the map..by_id stuff is, to
> some extent, deprecated in favor of persistent maps.  As I see it, the
> map..by_id calls should require privilege forever, although I can
> imagine ways to scope that privilege to a namespace if the maps
> themselves were to be scoped to a namespace.
>
> Instead, unprivileged tools would use the persistent map interface
> roughly like this:
>
> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
> 8 entries 64 name mapname
>
> This would require that the caller have either CAP_DAC_OVERRIDE or
> that the caller have permission to create files in /sys/fs/bpf/my_dir
> (using the same rules as for any filesystem), and the resulting map
> would end up owned by the creating user and have mode 0600 (or maybe
> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
> the various capable() checks that are currently involved in accessing
> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
> map file as appropriate.
>
> Half of this stuff already works.  I just set my system up like this:
>
> $ ls -l /sys/fs/bpf
> total 0
> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>
> $ mkdir /sys/fs/bpf/luto/test
>
> $ ls -l /sys/fs/bpf/luto
> total 0
> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>
> I bet that making the bpf() syscalls work appropriately in this
> context without privilege would only be a couple of hours of work.
> The hard work, creating bpffs and making it function, is already done
> :)
>
> P.S. The docs for bpftool create are less than fantastic.  The
> complete lack of any error message at all when the syscall returns
> -EACCES is also not fantastic.

This isn't remotely finished, but I spent a bit of time fiddling with this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms

What do you think?  (It's obviously not done.  It doesn't compile, and
I haven't gotten to the permissions needed to do map operations.  I
also haven't touched the capable() checks.)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05  0:08                                         ` Andy Lutomirski
@ 2019-08-05  5:47                                           ` Andy Lutomirski
  2019-08-05  7:36                                             ` Song Liu
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-05  5:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
> > >
> > > Hi Andy,
> > >
> >  >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> > > >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> > > >
> > > > It's been done before -- it's not that hard.  IMO the main tricky bit
> > > > would be try be somewhat careful about defining exactly what
> > > > CAP_BPF_ADMIN does.
> > >
> > > Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> > > Plumbers conference.
> > >
> > > OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> > > like systemd to do sys_bpf() without root.
> >
> > I don't understand the use case here.  Are you talking about systemd
> > --user?  As far as I know, a user is expected to be able to fully
> > control their systemd --user process, so giving it unrestricted bpf
> > access is very close to giving it superuser access, and this doesn't
> > sound like a good idea.  I think that, if systemd --user needs bpf(),
> > it either needs real unprivileged bpf() or it needs a privileged
> > helper (SUID or a daemon) to intermediate this access.
> >
> > >
> > > >
> > > >>> I don't see why you need to invent a whole new mechanism for this.
> > > >>> The entire cgroup ecosystem outside bpf() does just fine using the
> > > >>> write permission on files in cgroupfs to control access.  Why can't
> > > >>> bpf() do the same thing?
> > > >>
> > > >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> > > >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> > > >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> > > >> we should have target concept for all these commands. But that is a
> > > >> much bigger project. OTOH, "all or nothing" model allows all these
> > > >> commands at once.
> > > >
> > > > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > > > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > > > needed to get is_priv in the verifier, but I think that should mainly
> > > > be useful for tracing, and that requires lots of privilege anyway.
> > > > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > > > kind of bpffs, but I'm sure other solutions are possible.
> > >
> > > Improving permission management of cgroup_bpf is another good topic to
> > > discuss. However, it is also an overkill for current use case.
> > >
> >
> > I looked at the code some more, and I don't think this is so hard
> > after all.  As I understand it, all of the map..by_id stuff is, to
> > some extent, deprecated in favor of persistent maps.  As I see it, the
> > map..by_id calls should require privilege forever, although I can
> > imagine ways to scope that privilege to a namespace if the maps
> > themselves were to be scoped to a namespace.
> >
> > Instead, unprivileged tools would use the persistent map interface
> > roughly like this:
> >
> > $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
> > 8 entries 64 name mapname
> >
> > This would require that the caller have either CAP_DAC_OVERRIDE or
> > that the caller have permission to create files in /sys/fs/bpf/my_dir
> > (using the same rules as for any filesystem), and the resulting map
> > would end up owned by the creating user and have mode 0600 (or maybe
> > 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
> > the various capable() checks that are currently involved in accessing
> > a persistent map would instead check FMODE_READ or FMODE_WRITE on the
> > map file as appropriate.
> >
> > Half of this stuff already works.  I just set my system up like this:
> >
> > $ ls -l /sys/fs/bpf
> > total 0
> > drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
> >
> > $ mkdir /sys/fs/bpf/luto/test
> >
> > $ ls -l /sys/fs/bpf/luto
> > total 0
> > drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
> >
> > I bet that making the bpf() syscalls work appropriately in this
> > context without privilege would only be a couple of hours of work.
> > The hard work, creating bpffs and making it function, is already done
> > :)
> >
> > P.S. The docs for bpftool create are less than fantastic.  The
> > complete lack of any error message at all when the syscall returns
> > -EACCES is also not fantastic.
>
> This isn't remotely finished, but I spent a bit of time fiddling with this:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>
> What do you think?  (It's obviously not done.  It doesn't compile, and
> I haven't gotten to the permissions needed to do map operations.  I
> also haven't touched the capable() checks.)

I updated the branch.  It compiles, and basic map functionality works!

# mount -t bpf bpf /sys/fs/bpf
# cd /sys/fs/bpf
# mkdir luto
# chown luto: luto
# setpriv --euid=1000 --ruid=1000 bash
$ pwd
/sys/fs/bpf
bash-5.0$ ls -l
total 0
drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
value 8 entries 64 name mapname
bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Found 0 elements

# chown root: /sys/fs/bpf/luto/filename

$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Error: bpf obj get (/sys/fs/bpf/luto): Permission denied

So I think it's possible to get a respectable subset of bpf()
functionality working without privilege in short order :)

(FWIW, a decent fraction of this probably works even without my
patches, but it's going to have nonsensical semantics and may fail for
silly reasons.)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05  5:47                                           ` Andy Lutomirski
@ 2019-08-05  7:36                                             ` Song Liu
  2019-08-05 17:23                                               ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Song Liu @ 2019-08-05  7:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

Hi Andy, 

> On Aug 4, 2019, at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>>>>>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>>>>> 
>>>>> It's been done before -- it's not that hard.  IMO the main tricky bit
>>>>> would be try be somewhat careful about defining exactly what
>>>>> CAP_BPF_ADMIN does.
>>>> 
>>>> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
>>>> Plumbers conference.
>>>> 
>>>> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
>>>> like systemd to do sys_bpf() without root.
>>> 
>>> I don't understand the use case here.  Are you talking about systemd
>>> --user?  As far as I know, a user is expected to be able to fully
>>> control their systemd --user process, so giving it unrestricted bpf
>>> access is very close to giving it superuser access, and this doesn't
>>> sound like a good idea.  I think that, if systemd --user needs bpf(),
>>> it either needs real unprivileged bpf() or it needs a privileged
>>> helper (SUID or a daemon) to intermediate this access.
>>> 
>>>> 
>>>>> 
>>>>>>> I don't see why you need to invent a whole new mechanism for this.
>>>>>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>>>>>> write permission on files in cgroupfs to control access.  Why can't
>>>>>>> bpf() do the same thing?
>>>>>> 
>>>>>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>>>>>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>>>>>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>>>>>> we should have target concept for all these commands. But that is a
>>>>>> much bigger project. OTOH, "all or nothing" model allows all these
>>>>>> commands at once.
>>>>> 
>>>>> For BPF_PROG_LOAD, I admit I've never understood why permission is
>>>>> required at all.  I think that CAP_SYS_ADMIN or similar should be
>>>>> needed to get is_priv in the verifier, but I think that should mainly
>>>>> be useful for tracing, and that requires lots of privilege anyway.
>>>>> BPF_MAP_* is probably the trickiest part.  One solution would be some
>>>>> kind of bpffs, but I'm sure other solutions are possible.
>>>> 
>>>> Improving permission management of cgroup_bpf is another good topic to
>>>> discuss. However, it is also an overkill for current use case.
>>>> 
>>> 
>>> I looked at the code some more, and I don't think this is so hard
>>> after all.  As I understand it, all of the map..by_id stuff is, to
>>> some extent, deprecated in favor of persistent maps.  As I see it, the
>>> map..by_id calls should require privilege forever, although I can
>>> imagine ways to scope that privilege to a namespace if the maps
>>> themselves were to be scoped to a namespace.
>>> 
>>> Instead, unprivileged tools would use the persistent map interface
>>> roughly like this:
>>> 
>>> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
>>> 8 entries 64 name mapname
>>> 
>>> This would require that the caller have either CAP_DAC_OVERRIDE or
>>> that the caller have permission to create files in /sys/fs/bpf/my_dir
>>> (using the same rules as for any filesystem), and the resulting map
>>> would end up owned by the creating user and have mode 0600 (or maybe
>>> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
>>> the various capable() checks that are currently involved in accessing
>>> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
>>> map file as appropriate.
>>> 
>>> Half of this stuff already works.  I just set my system up like this:
>>> 
>>> $ ls -l /sys/fs/bpf
>>> total 0
>>> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>>> 
>>> $ mkdir /sys/fs/bpf/luto/test
>>> 
>>> $ ls -l /sys/fs/bpf/luto
>>> total 0
>>> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>>> 
>>> I bet that making the bpf() syscalls work appropriately in this
>>> context without privilege would only be a couple of hours of work.
>>> The hard work, creating bpffs and making it function, is already done
>>> :)
>>> 
>>> P.S. The docs for bpftool create are less than fantastic.  The
>>> complete lack of any error message at all when the syscall returns
>>> -EACCES is also not fantastic.
>> 
>> This isn't remotely finished, but I spent a bit of time fiddling with this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>> 
>> What do you think?  (It's obviously not done.  It doesn't compile, and
>> I haven't gotten to the permissions needed to do map operations.  I
>> also haven't touched the capable() checks.)
> 
> I updated the branch.  It compiles, and basic map functionality works!

Thanks a lot for trying this out. This is a very interesting direction
that we will explore. 

> 
> # mount -t bpf bpf /sys/fs/bpf
> # cd /sys/fs/bpf
> # mkdir luto
> # chown luto: luto
> # setpriv --euid=1000 --ruid=1000 bash
> $ pwd
> /sys/fs/bpf
> bash-5.0$ ls -l
> total 0
> drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
> bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> value 8 entries 64 name mapname
> bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Found 0 elements
> 
> # chown root: /sys/fs/bpf/luto/filename
> 
> $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
> 
> So I think it's possible to get a respectable subset of bpf()
> functionality working without privilege in short order :)

I think we have two key questions to answer: 
  1. What subset of bpf() functionality will the users need?
  2. Who are the users? 

Different answers to these two questions lead to different directions.


In our use case, the answers are 
  1) almost all bpf() functionality
  2) highly trusted users (sudoers)

So our initial approach of /dev/bpf allows all bpf() functionality
in one bit in task_struct. (Yes, we can just sudo. But, we would 
rather not use sudo when possible.)


"cgroup management" use case may have answers like:
  1) cgroup_bpf only
  2) users in their own containers

For this case, getting cgroup_bpf related features (cgroup_bpf progs; 
some map types, etc.) work with unprivileged users would be the right 
direction. 


"USDT tracing" use case may have answers like:
  1) uprobe, stockmap, histogram, etc.
  2) unprivileged user, w/ or w/o containers

For this case, the first step is likely hacking sys_perf_event_open(). 


I guess we will need more discussions to decide how to make bpf() 
work better for all these (and more) use cases. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05  7:36                                             ` Song Liu
@ 2019-08-05 17:23                                               ` Andy Lutomirski
  2019-08-05 19:21                                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-05 17:23 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Mon, Aug 5, 2019 at 12:37 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>

> >
> > # mount -t bpf bpf /sys/fs/bpf
> > # cd /sys/fs/bpf
> > # mkdir luto
> > # chown luto: luto
> > # setpriv --euid=1000 --ruid=1000 bash
> > $ pwd
> > /sys/fs/bpf
> > bash-5.0$ ls -l
> > total 0
> > drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
> > bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> > value 8 entries 64 name mapname
> > bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> > Found 0 elements
> >
> > # chown root: /sys/fs/bpf/luto/filename
> >
> > $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> > Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
> >
> > So I think it's possible to get a respectable subset of bpf()
> > functionality working without privilege in short order :)
>
> I think we have two key questions to answer:
>   1. What subset of bpf() functionality will the users need?
>   2. Who are the users?
>
> Different answers to these two questions lead to different directions.
>
>
> In our use case, the answers are
>   1) almost all bpf() functionality
>   2) highly trusted users (sudoers)
>
> So our initial approach of /dev/bpf allows all bpf() functionality
> in one bit in task_struct. (Yes, we can just sudo. But, we would
> rather not use sudo when possible.)

For this, I think some compelling evidence is needed that a new kernel
mechanism is actually better than sudo and better than making bpftool
privileged as previously discussed :)

>
>
> "cgroup management" use case may have answers like:
>   1) cgroup_bpf only
>   2) users in their own containers
>
> For this case, getting cgroup_bpf related features (cgroup_bpf progs;
> some map types, etc.) work with unprivileged users would be the right
> direction.

:)

>
>
> "USDT tracing" use case may have answers like:
>   1) uprobe, stockmap, histogram, etc.
>   2) unprivileged user, w/ or w/o containers
>
> For this case, the first step is likely hacking sys_perf_event_open().
>

This would be nice.

>
> I guess we will need more discussions to decide how to make bpf()
> work better for all these (and more) use cases.
>

I refreshed the branch again.  I had a giant hole in my previous idea
that we could deprivilege program loading: some BPF functions need
privilege.  Now I have a changelog comment to that effect and a patch
that sketches out a way to addressing this.

I don't think I'm going to have time soon to actually get any of this
stuff mergeable, and it would be fantastic if you or someone else who
likes working of bpf were to take this code and run with it.  Feel
free to add my Signed-off-by, and I'd be happy to help review.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05 17:23                                               ` Andy Lutomirski
@ 2019-08-05 19:21                                                 ` Alexei Starovoitov
  2019-08-05 21:25                                                   ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-05 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Mon, Aug 05, 2019 at 10:23:10AM -0700, Andy Lutomirski wrote:
> 
> I refreshed the branch again.  I had a giant hole in my previous idea
> that we could deprivilege program loading: some BPF functions need
> privilege.  Now I have a changelog comment to that effect and a patch
> that sketches out a way to addressing this.
> 
> I don't think I'm going to have time soon to actually get any of this
> stuff mergeable, and it would be fantastic if you or someone else who
> likes working of bpf were to take this code and run with it.  Feel
> free to add my Signed-off-by, and I'd be happy to help review.

Thanks a lot for working on patches and helping us with the design!

Can you resend the patches to the mailing list?
It's kinda hard to reply/review to patches that are somewhere in the web.
I'm still trying to understand the main idea.
If I'm reading things correctly:
patch 1 "add access permissions to bpf fds"
  just passes the flags ?
patch 2 "Don't require mknod() permission to pin an object" 
 makes sense in isolation.
patch 3 "Allow creating all program types without privilege"
  is not right.
patch 4 "Add a way to mark functions as requiring privilege"
 is an interesting idea, but I don't think it helps that much.

So the main thing we're trying to solve with augmented bpf syscall
and/or /dev/bpf is to be able to use root-only features of bpf when
trused process already dropped root permissions.
These features include bpf2bpf calls, bounded loops, special maps (like LPM), etc.

Attaching to a cgroup already has file based permission checks.
The user needs to open cgroup directory to attach.
acls on cgroup dir can already be used to prevent attaching to
certain parts of cgroup hierarchy.

It seems this discussion is centered around making /dev/bpf to
let unpriv (and not trusted) users (humans) to do bpf.
That's not quite the case.
It's a good use case, but not the one we're after at the moment.
In our enviroment bpftrace, bpftool, all bcc tools are pre-installed
and the users (humans) can simply 'sudo' to run them.
Adding suid bit to installed bpftool binary is doable, but there is no need.
'sudo' works just fine.
What we need is to drop privileges sooner in daemons like systemd.
Container management daemon runs in the nested containers.
These trusted daemons need to have access to full bpf, but they
don't want to be root all the time.
They cannot flip back and forth via seteuid to root every time they
need to do bpf.
Hence the idea is to have a file that this daemon can open,
then drop privileges and still keep doing bpf things because FD is held.
Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
This /dev/bpf would be accessible to root only.
There is no desire to open it up to non-root.

It seems there is concern that /dev/bpf is unnecessary special.
How about we combine bpffs and /dev/bpf ideas?
Like we can have a special file name in bpffs.
The root would do 'touch /sys/fs/bpf/privileges' and it would behave
just like /dev/bpf, but now it can be in any bpffs directory and acls
to bpffs mount would work as-is.

CAP_BPF is also good idea. I think for the enviroment where untrusted
and unprivileged users want to run 'bpftrace' that would be perfect mechanism.
getcap /bin/bpftrace would have cap_bpf, cap_kprobe and whatever else.
Sort of like /bin/ping.
But I don't see how cap_bpf helps to solve our trusted root daemon problem.
imo open ("/sys/fs/bpf/privileges") and pass that FD into bpf syscall
is the only viable mechanism.

Note the verifier does very different amount of work for unpriv vs root.
It does speculative execution analysis, pointer leak checks for unpriv.
So we gotta pass special flag to the verifier to make it act like it's
loading a program for root.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05 19:21                                                 ` Alexei Starovoitov
@ 2019-08-05 21:25                                                   ` Andy Lutomirski
  2019-08-05 22:21                                                     ` Andy Lutomirski
  2019-08-06  1:11                                                     ` Alexei Starovoitov
  0 siblings, 2 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-05 21:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 05, 2019 at 10:23:10AM -0700, Andy Lutomirski wrote:
> >
> > I refreshed the branch again.  I had a giant hole in my previous idea
> > that we could deprivilege program loading: some BPF functions need
> > privilege.  Now I have a changelog comment to that effect and a patch
> > that sketches out a way to addressing this.
> >
> > I don't think I'm going to have time soon to actually get any of this
> > stuff mergeable, and it would be fantastic if you or someone else who
> > likes working of bpf were to take this code and run with it.  Feel
> > free to add my Signed-off-by, and I'd be happy to help review.
>
> Thanks a lot for working on patches and helping us with the design!
>
> Can you resend the patches to the mailing list?
> It's kinda hard to reply/review to patches that are somewhere in the web.

Will do.

> I'm still trying to understand the main idea.
> If I'm reading things correctly:

The series doesn't, strictly speaking, have an overall problem that it
solves.  It's a series of steps in the direction of making bpf() make
more sense without privilege and toward reducing the required
privilege.

> patch 1 "add access permissions to bpf fds"
>   just passes the flags ?

It tries to make the kernel respect the access modes for fds.  Without
this patch, there seem to be some holes: nothing looked at program fds
and, unless I missed something, you could take a readonly fd for a
program, pin the program, and reopen it RW.

> patch 2 "Don't require mknod() permission to pin an object"
>  makes sense in isolation.

It makes even more sense now :)

> patch 3 "Allow creating all program types without privilege"
>   is not right.

I think it can be made right, which is the point.

> patch 4 "Add a way to mark functions as requiring privilege"
>  is an interesting idea, but I don't think it helps that much.

Other than the issue that this patch partially fixes, can you see any
reason that loading a program should require privilege?  Obviously the
verifier is weakened a bit when called by privileged users, but a lot
of that is about excessive resource usage and various less-well-tested
features.  It seems to me that most of the value of bpf() should be
available to programs that should not need privilege to load.  Are
there things I'm missing?

>
> So the main thing we're trying to solve with augmented bpf syscall
> and/or /dev/bpf is to be able to use root-only features of bpf when
> trused process already dropped root permissions.
> These features include bpf2bpf calls, bounded loops, special maps (like LPM), etc.

Can you elaborate on all these:

I see nothing inherently wrong with bpf2bpf for unprivileged users as
long as they have appropriate access to the called program.  Patch 1
improves that.

Bounded loops: if they are adequately well verified, then the only
damage is that they can make bpf progs that run slowly, right?  It
seems like some kind of capability or sysctl for "allow using lots of
bpf resources" would do the trick.  This could even be a cgroup
setting -- bpf resources aren't all that different from any other
resource.

LPM: I don't see why this requires privilege at all.  It indeed checks
capable(CAP_SYS_ADMIN), but I don't see why.

>
> Attaching to a cgroup already has file based permission checks.
> The user needs to open cgroup directory to attach.
> acls on cgroup dir can already be used to prevent attaching to
> certain parts of cgroup hierarchy.

The current checks seem inadequate.

$ echo 'yay' </sys/fs/cgroup/systemd/system.slice/

The ability to obtain an fd to a cgroup does *not* imply any right to
modify that cgroup.  The ability to write to a cgroup directory
already means something else -- it's the ability to create cgroups
under the group in question.  I'm suggesting that a new API be added
that allows attaching a bpf program to a cgroup without capabilities
and that instead requires write access to a new file in the cgroup
directory.  (It could be a single file for all bpf types or one file
per type.  I prefer the latter -- it gives the admin finer-grained
control.)

> What we need is to drop privileges sooner in daemons like systemd.

This is doable right now: systemd could fork off a subprocess and
delegate its cgroup operations to it.  It would be maybe a couple
hundred lines of code.  As an added benefit, that subprocess could
verify that the bpf operations in question are reasonable.
Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
capability and flip it on and off as needed.

> Container management daemon runs in the nested containers.
> These trusted daemons need to have access to full bpf, but they
> don't want to be root all the time.
> They cannot flip back and forth via seteuid to root every time they
> need to do bpf.
> Hence the idea is to have a file that this daemon can open,
> then drop privileges and still keep doing bpf things because FD is held.
> Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
> This /dev/bpf would be accessible to root only.
> There is no desire to open it up to non-root.

This seems extremely dangerous right now.  A program that can bypass
*all* of the capable() checks in bpf() can do a whole lot.  Among
other things, it can read all of kernel memory.  It can very likely
gain full system root by appropriate installation of malicious
programs in a cgroup that contains fully privileged programs.  In this
regard, bpf() is like most of the Linux capabilities -- it seems
somewhat limited, but it really implies a lot of privilege.  There was
a little paper awhile back pointing out that, on a normal system, most
of the Linux capabilities were functionally equivalent.

>
> It seems there is concern that /dev/bpf is unnecessary special.
> How about we combine bpffs and /dev/bpf ideas?
> Like we can have a special file name in bpffs.
> The root would do 'touch /sys/fs/bpf/privileges' and it would behave
> just like /dev/bpf, but now it can be in any bpffs directory and acls
> to bpffs mount would work as-is.

This seems to have most of the same problems.  My main point is that
it conflates a whole lot of different permissions, and I really don't
think it's that much work to mostly disentangle the permissions in
question.  My little series (if completed) plus a patch to allow
unprivileged cgroup attach operations if you have an FMODE_WRITE fd to
an appropriate file should get most of the way there.

Also, be careful about your bpffs idea: bpffs is (sort of) namespaced,
and it would make sense to allow new bpf instances to be created
inside unprivileged user namespaces.  Such instances should not be
able to create magical privilege-granting files.  In that respect,
/dev/bpf is better.

>
> CAP_BPF is also good idea. I think for the enviroment where untrusted
> and unprivileged users want to run 'bpftrace' that would be perfect mechanism.
> getcap /bin/bpftrace would have cap_bpf, cap_kprobe and whatever else.
> Sort of like /bin/ping.
> But I don't see how cap_bpf helps to solve our trusted root daemon problem.
> imo open ("/sys/fs/bpf/privileges") and pass that FD into bpf syscall
> is the only viable mechanism.
>

As above, I think that forking before dropping privileges and asking
the child to do the bpf() operations is safer and more flexible.

> Note the verifier does very different amount of work for unpriv vs root.
> It does speculative execution analysis, pointer leak checks for unpriv.
> So we gotta pass special flag to the verifier to make it act like it's
> loading a program for root.
>

Indeed.  And programs in untrusted containers should not be able to do this.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05 21:25                                                   ` Andy Lutomirski
@ 2019-08-05 22:21                                                     ` Andy Lutomirski
  2019-08-06  1:11                                                     ` Alexei Starovoitov
  1 sibling, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-05 22:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List



> On Aug 5, 2019, at 2:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:


> 
>> What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

I tried to look at the code and I couldn’t find it. Does systemd drop privileges at all?  Can you point me at the code you’re thinking of

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-05 21:25                                                   ` Andy Lutomirski
  2019-08-05 22:21                                                     ` Andy Lutomirski
@ 2019-08-06  1:11                                                     ` Alexei Starovoitov
  2019-08-07  5:24                                                       ` Andy Lutomirski
  1 sibling, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-06  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> It tries to make the kernel respect the access modes for fds.  Without
> this patch, there seem to be some holes: nothing looked at program fds
> and, unless I missed something, you could take a readonly fd for a
> program, pin the program, and reopen it RW.

I think it's by design. iirc Daniel had a use case for something like this.

The key to understand is that bpf is not about security. It's about safety.
All features are geared to safety.
The users are trusted most of the time.
The number of unprivileged bpf use cases is tiny compared to trusted.
Hence unprivileged bpf is actually something that can be deprecated.

> Other than the issue that this patch partially fixes, can you see any
> reason that loading a program should require privilege?  Obviously the
> verifier is weakened a bit when called by privileged users, but a lot
> of that is about excessive resource usage and various less-well-tested
> features.  It seems to me that most of the value of bpf() should be
> available to programs that should not need privilege to load.  Are
> there things I'm missing?

see below.

> LPM: I don't see why this requires privilege at all.  It indeed checks
> capable(CAP_SYS_ADMIN), but I don't see why.

see below.

> 
> >
> > Attaching to a cgroup already has file based permission checks.
> > The user needs to open cgroup directory to attach.
> > acls on cgroup dir can already be used to prevent attaching to
> > certain parts of cgroup hierarchy.
> 
> The current checks seem inadequate.
> 
> $ echo 'yay' </sys/fs/cgroup/systemd/system.slice/
> 
> The ability to obtain an fd to a cgroup does *not* imply any right to
> modify that cgroup.  The ability to write to a cgroup directory
> already means something else -- it's the ability to create cgroups
> under the group in question.  I'm suggesting that a new API be added
> that allows attaching a bpf program to a cgroup without capabilities
> and that instead requires write access to a new file in the cgroup
> directory.  (It could be a single file for all bpf types or one file
> per type.  I prefer the latter -- it gives the admin finer-grained
> control.)

This is something to discuss. I don't mind something like this,
but in general bpf is not for untrusted users.
Hence I don't want to overdesign.

> 
> > What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
bpf based IP sandboxing doesn't work in 'systemd --user'.
That is just one of the problems that people complained about.

Note that systemd bpf usage is basic. There is ongoing work to
adopt libbpf in systemd, so more features and more use cases will open up.

Inside containers and inside nested containers we need to start processes
that will use bpf. All of the processes are trusted.
We need to drop root not to be secure, but to be safe.
Consider a bug in a code that accidently did sys_kill(-1).
Dropping root is a mitigation for bugs like this.

> 
> > Container management daemon runs in the nested containers.
> > These trusted daemons need to have access to full bpf, but they
> > don't want to be root all the time.
> > They cannot flip back and forth via seteuid to root every time they
> > need to do bpf.
> > Hence the idea is to have a file that this daemon can open,
> > then drop privileges and still keep doing bpf things because FD is held.
> > Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
> > This /dev/bpf would be accessible to root only.
> > There is no desire to open it up to non-root.
> 
> This seems extremely dangerous right now.  A program that can bypass
> *all* of the capable() checks in bpf() can do a whole lot.  Among
> other things, it can read all of kernel memory. 

It's 'dangerous' only if you think about it from security point of view.
The tracing (and sometimes networking) bpf progs need to read all of kernel
memory without being root.
That is the whole point of the /dev/bpf.

> This seems to have most of the same problems.  My main point is that
> it conflates a whole lot of different permissions, and I really don't
> think it's that much work to mostly disentangle the permissions in
> question.  My little series (if completed) plus a patch to allow
> unprivileged cgroup attach operations if you have an FMODE_WRITE fd to
> an appropriate file should get most of the way there.

I think I understand your concern. One /dev/bpf magic to by-pass
all of capable() checks in bpf doesn't look nice indeed.

Now to answer your question about capable(sys_admin) in the verifier.

See this bugfix:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7c2e988f400e83501e0a3568250780609b7c8263

It's a bug in the JIT that was there pretty much forever,
but it got exposed due to two new bpf features.
Initially syzbot complained that bounded loops allow jmp to 1st insn.
syzbot reproducer contained 'never taken' branch, so buggy JIT
wouldn't have caused a kernel panic for this reproducer though
JITed x86 code is broken.
It took me few days to realize that with bpf2bpf calls _and_
bounded loops I can craft a test that will expose this JIT bug
and will crash the kernel.
So a combination of two recent verifier features exposed old JIT bug
that would have been a security issue if we didn't gate these verifier
features for root only.
Now it's simply a kernel bugfix.

Such subtle bugs happen all the time when new verifier features are
introduced. Hence we always start them with root only.
bpf2bpf calls, bounded loops, precision tracking, dead code elimination,
LPM maps, many programs type are root only.
We don't want cve-s to be filed for every bug like this.

Even if we start relaxing features (like dropping root from LPM map)
at any given time there will be a lot of useful verifier features,
maps and program types that are root only.
For root == for trusted users only.
Unfortunately this approach creates adoption problem.
The trusted users don't want to be root to use bpf.
Hence this /dev/bpf.

To solve your concern of bypassing all capable checks...
How about we do /dev/bpf/full_verifier first?
It will replace capable() checks in the verifier only.

How to delegate cgroup attach permission is a follow up discussion.
Could be special files in cgroup dir as you proposed or something else.
Let's table that and focus on the verifier first.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-06  1:11                                                     ` Alexei Starovoitov
@ 2019-08-07  5:24                                                       ` Andy Lutomirski
  2019-08-07  9:03                                                         ` Lorenz Bauer
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-07  5:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> > It tries to make the kernel respect the access modes for fds.  Without
> > this patch, there seem to be some holes: nothing looked at program fds
> > and, unless I missed something, you could take a readonly fd for a
> > program, pin the program, and reopen it RW.
>
> I think it's by design. iirc Daniel had a use case for something like this.

That seems odd.  Daniel, can you elaborate?

> Hence unprivileged bpf is actually something that can be deprecated.

I hope not.  There are a couple setsockopt uses right now, and and
seccomp will surely want it someday.  And the bpf-inside-container use
case really is unprivileged bpf -- containers are, in many (most?)
cases, explicitly not trusted by the host.  But regardless, see below.

> > The ability to obtain an fd to a cgroup does *not* imply any right to
> > modify that cgroup.  The ability to write to a cgroup directory
> > already means something else -- it's the ability to create cgroups
> > under the group in question.  I'm suggesting that a new API be added
> > that allows attaching a bpf program to a cgroup without capabilities
> > and that instead requires write access to a new file in the cgroup
> > directory.  (It could be a single file for all bpf types or one file
> > per type.  I prefer the latter -- it gives the admin finer-grained
> > control.)
>
> This is something to discuss. I don't mind something like this,
> but in general bpf is not for untrusted users.
> Hence I don't want to overdesign.

I think the code would end up being extremely simple.  It would be an
ABI change, but I think that it could easily be arranged so that new
and old libbpf would be fully functional on new and old kernels except
that only new libbpf with a new kernel would be able to attach cgroup
programs without privilege.  I see no reason to remove the current
cgroup attach API.

> See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
> bpf based IP sandboxing doesn't work in 'systemd --user'.
> That is just one of the problems that people complained about.

This isn't privilege *dropping* -- this is not having privilege in the
first place.  Giving systemd --user unrestricted use of bpf() is
tantamount to making the user root.  But again, see below.

>
> Inside containers and inside nested containers we need to start processes
> that will use bpf. All of the processes are trusted.

Trusted by whom?  In a non-nested container, the container manager
*might* be trusted by the outside world.  In a *nested* container,
unless the inner container management is controlled from outside the
outer container, it's not trusted.  I don't know much about how
Facebook's containers work, but the LXC/LXD/Podman world is moving
very strongly toward user namespaces and maximally-untrusted
containers, and I think bpf() should work in that context.

> To solve your concern of bypassing all capable checks...
> How about we do /dev/bpf/full_verifier first?
> It will replace capable() checks in the verifier only.

I'm not convinced that "in the verifier" is the right distinction.
Telling administrators that some setting lets certain users bypass
bpf() verifier checks doesn't have a clear enough meaning.  I propose,
instead, that the current capable() checks be divided into three
categories:

a) Those that, by design, control privileged operations.  This
includes most attach calls, but it also includes allow_ptr_leaks,
bpf_probe_read(), and quite a few other things.  It also includes all
of the by_id calls, I think, unless some clever modification to the
way they worked would isolate different users' objects.  I think that
persistent objects can do pretty much everything that by_id users
would need, so this isn't a big deal.

b) Those that protect code that is considered to be at higher risk of
exposing security bugs.  In other words, these are things that would
have no need for privilege if we could prove that the implementation
was perfect.  This includes bpf2bpf, LPM, etc.

c) Those that serve to prevent excessive resource usage.  This
includes the bounded loop code (I think -- this might also be category
(b)) and several explicit checks for program size.

I suppose that the speculative execution mitigation stuff could be
split out from (b) into its own category.

Based on this, how about a different division: /dev/bpf_dangerous for
(b) and /dev/bpf_resources for (c)?  And (a) is dealt with by
gradually reducing the privilege needed for the important operations.
/dev/bpf_dangerous isn't ideal in the long run though, since a lot of
the operations in that category will probably become less dangerous as
the code is better vetted, and people granting "dangerous" permission
might want to restrict it to only the specific parts that they need.
/dev/bpf_lpm, /dev/bpf2bpf, etc could work better.

This type of thing actually fits quite nicely into an idea I've been
thinking about for a while called "implicit rights". In very brief
summary, there would be objects called /dev/rights/xyz, where xyz is
the same of a "right".  If there is a readable object of the right
type at the literal path "/dev/rights/xyz", then you have right xyz.
There's a bit more flexibility on top of this.  BPF could use
/dev/rights/bpf/maptypes/lpm and
/dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
use cases include a biggie:
/dev/rights/namespace/create_unprivileged_userns.
/dev/rights/bind_port/80 would be nice, too.

I may have a bit of time over the next few days to actually prototype
this.  Would you be interested in using this for BPF?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-07  5:24                                                       ` Andy Lutomirski
@ 2019-08-07  9:03                                                         ` Lorenz Bauer
  2019-08-07 13:52                                                           ` Andy Lutomirski
  2019-08-13 21:58                                                         ` Alexei Starovoitov
  2019-08-22 14:17                                                         ` Daniel Borkmann
  2 siblings, 1 reply; 61+ messages in thread
From: Lorenz Bauer @ 2019-08-07  9:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Jann Horn,
	Greg KH, Linux API, LSM List

On Wed, 7 Aug 2019 at 06:24, Andy Lutomirski <luto@kernel.org> wrote:
> a) Those that, by design, control privileged operations.  This
> includes most attach calls, but it also includes allow_ptr_leaks,
> bpf_probe_read(), and quite a few other things.  It also includes all
> of the by_id calls, I think, unless some clever modification to the
> way they worked would isolate different users' objects.  I think that
> persistent objects can do pretty much everything that by_id users
> would need, so this isn't a big deal.

Slightly OT, since this is an implementation question: GET_MAP_FD_BY_ID
is useful to iterate a nested map. This isn't covered by rights to
persistent objects,
so it would need some thought.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-07  9:03                                                         ` Lorenz Bauer
@ 2019-08-07 13:52                                                           ` Andy Lutomirski
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-07 13:52 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Jann Horn, Greg KH, Linux API, LSM List


> On Aug 7, 2019, at 2:03 AM, Lorenz Bauer <lmb@cloudflare.com> wrote:
> 
>> On Wed, 7 Aug 2019 at 06:24, Andy Lutomirski <luto@kernel.org> wrote:
>> a) Those that, by design, control privileged operations.  This
>> includes most attach calls, but it also includes allow_ptr_leaks,
>> bpf_probe_read(), and quite a few other things.  It also includes all
>> of the by_id calls, I think, unless some clever modification to the
>> way they worked would isolate different users' objects.  I think that
>> persistent objects can do pretty much everything that by_id users
>> would need, so this isn't a big deal.
> 
> Slightly OT, since this is an implementation question: GET_MAP_FD_BY_ID
> is useful to iterate a nested map. This isn't covered by rights to
> persistent objects,
> so it would need some thought.
> 
> 

A call to get an fd to a map referenced by a map to which you already have an fd seems reasonable to me. The new fd would inherit the old fd’s access mode.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-07  5:24                                                       ` Andy Lutomirski
  2019-08-07  9:03                                                         ` Lorenz Bauer
@ 2019-08-13 21:58                                                         ` Alexei Starovoitov
  2019-08-13 22:26                                                           ` Daniel Colascione
                                                                             ` (2 more replies)
  2019-08-22 14:17                                                         ` Daniel Borkmann
  2 siblings, 3 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-13 21:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> >
> > Inside containers and inside nested containers we need to start processes
> > that will use bpf. All of the processes are trusted.
> 
> Trusted by whom?  In a non-nested container, the container manager
> *might* be trusted by the outside world.  In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted.  I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.

agree that containers (namespaces) reduce amount of trust necessary
for apps to run, but the end goal is not security though.
Linux has become a single user system.
If user can ssh into the host they can become root.
If arbitrary code can run on the host it will be break out of any sandbox.
Containers are not providing the level of security that is enough
to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
Containers are used to make production systems safer.
Some people call it more 'secure', but it's clearly not secure for
arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
It's been a constant source of pain. The constant blinding, randomization,
verifier speculative analysis, all spectre v1, v2, v4 mitigations
are simply not worth it. It's a lot of complex kernel code without users.
There is not a single use case to allow arbitrary malicious bpf
program to be loaded and executed.
As soon as we have /dev/bpf to allow all of bpf to be used without root
we will set sysctl kernel.unprivileged_bpf_disabled=1
Hence I prefer this /dev/bpf mechanism to be as simple a possible.
The applications that will use it are going to be just as trusted as systemd.

> > To solve your concern of bypassing all capable checks...
> > How about we do /dev/bpf/full_verifier first?
> > It will replace capable() checks in the verifier only.
> 
> I'm not convinced that "in the verifier" is the right distinction.
> Telling administrators that some setting lets certain users bypass
> bpf() verifier checks doesn't have a clear enough meaning.  

linux is a single user system. there are no administrators any more.
No doubt, folks will disagree, but that game is over.
At least on bpf side it's done.

> I propose,
> instead, that the current capable() checks be divided into three
> categories:

I don't see a use case for these categories.
All bpf programs extend the kernel in some way.
The kernel vs user is one category.
Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
When application has CAP_NET_ADMIN it covers all of networking knobs.
There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
Similarly CAP_BPF as the only knob is enough.
The only disadvantage of CAP_BPF is that it's not possible to
pass it from one systemd-like daemon to another systemd-like daemon.
Hence /dev/bpf idea and passing file descriptor.

> This type of thing actually fits quite nicely into an idea I've been
> thinking about for a while called "implicit rights". In very brief
> summary, there would be objects called /dev/rights/xyz, where xyz is
> the same of a "right".  If there is a readable object of the right
> type at the literal path "/dev/rights/xyz", then you have right xyz.
> There's a bit more flexibility on top of this.  BPF could use
> /dev/rights/bpf/maptypes/lpm and
> /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> use cases include a biggie:
> /dev/rights/namespace/create_unprivileged_userns.
> /dev/rights/bind_port/80 would be nice, too.

The concept of "implicit rights" is very nice and I'm sure it will
be a good fit somewhere, but I don't see why use it in bpf space.
There is no use case for fine grain partition of bpf features.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-13 21:58                                                         ` Alexei Starovoitov
@ 2019-08-13 22:26                                                           ` Daniel Colascione
  2019-08-13 23:24                                                             ` Andy Lutomirski
  2019-08-13 23:06                                                           ` Andy Lutomirski
  2019-08-15 19:46                                                           ` Kees Cook
  2 siblings, 1 reply; 61+ messages in thread
From: Daniel Colascione @ 2019-08-13 22:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > >
> > > Inside containers and inside nested containers we need to start processes
> > > that will use bpf. All of the processes are trusted.
> >
> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.
> Linux has become a single user system.
> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.
> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> Containers are used to make production systems safer.
> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.
> There is not a single use case to allow arbitrary malicious bpf
> program to be loaded and executed.
> As soon as we have /dev/bpf to allow all of bpf to be used without root
> we will set sysctl kernel.unprivileged_bpf_disabled=1
> Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> The applications that will use it are going to be just as trusted as systemd.
>
> > > To solve your concern of bypassing all capable checks...
> > > How about we do /dev/bpf/full_verifier first?
> > > It will replace capable() checks in the verifier only.
> >
> > I'm not convinced that "in the verifier" is the right distinction.
> > Telling administrators that some setting lets certain users bypass
> > bpf() verifier checks doesn't have a clear enough meaning.
>
> linux is a single user system. there are no administrators any more.
> No doubt, folks will disagree, but that game is over.
> At least on bpf side it's done.
>
> > I propose,
> > instead, that the current capable() checks be divided into three
> > categories:
>
> I don't see a use case for these categories.
> All bpf programs extend the kernel in some way.
> The kernel vs user is one category.
> Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> When application has CAP_NET_ADMIN it covers all of networking knobs.
> There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> Similarly CAP_BPF as the only knob is enough.
> The only disadvantage of CAP_BPF is that it's not possible to
> pass it from one systemd-like daemon to another systemd-like daemon.
> Hence /dev/bpf idea and passing file descriptor.
>
> > This type of thing actually fits quite nicely into an idea I've been
> > thinking about for a while called "implicit rights". In very brief
> > summary, there would be objects called /dev/rights/xyz, where xyz is
> > the same of a "right".  If there is a readable object of the right
> > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > There's a bit more flexibility on top of this.  BPF could use
> > /dev/rights/bpf/maptypes/lpm and
> > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > use cases include a biggie:
> > /dev/rights/namespace/create_unprivileged_userns.
> > /dev/rights/bind_port/80 would be nice, too.
>
> The concept of "implicit rights" is very nice and I'm sure it will
> be a good fit somewhere, but I don't see why use it in bpf space.
> There is no use case for fine grain partition of bpf features.

Isn't this "implicit rights" model just another kind of ambient
authority --- one that constrains the otherwise-free filesystem
namespace to boot? IMHO, the kernel should be moving toward explicit
authorization tokens modeled by file descriptors and away from
contextual authorization decisions.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-13 21:58                                                         ` Alexei Starovoitov
  2019-08-13 22:26                                                           ` Daniel Colascione
@ 2019-08-13 23:06                                                           ` Andy Lutomirski
  2019-08-14  0:57                                                             ` Alexei Starovoitov
  2019-08-15 19:46                                                           ` Kees Cook
  2 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-13 23:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > >
> > > Inside containers and inside nested containers we need to start processes
> > > that will use bpf. All of the processes are trusted.
> >
> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.
> Linux has become a single user system.
> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.

I would argue that this is a reasonable assumption to make if you're
designing a system using Linux, but it's not a valid assumption to
make as kernel developers.  Otherwise we should just give everyone
CAP_SYS_ADMIN and call it a day.  There really is a difference between
root and non-root.

> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> Containers are used to make production systems safer.
> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.

Seccomp really will want eBPF some day, and it should work without
privilege.  Maybe it should be a restricted subset of eBPF, and
Spectre will always be an issue until dramatically better hardware
shows up, but I think people will want the ability for regular
programs to load eBPF seccomp programs.

> Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> The applications that will use it are going to be just as trusted as systemd.

I still don't understand your systemd example.  systemd --users is not
trusted systemwide in any respect.  The main PID 1 systemd is root.
No matter how you dice it, granting a user systemd instance extra bpf
access is tantamount to granting the user extra bpf access in general.

It sounds to me like you're thinking of eBPF as a feature a bit like
unprivileged user namespaces: *in principle*, it's supposed to be safe
to give any unprivileged process the ability to use it, and you
consider security flaws in it to be bugs worth fixing.  But you think
it's a large attack surface and that most unprivileged programs
shouldn't be allowed to use it.  Is that reasonable?


>
> > > To solve your concern of bypassing all capable checks...
> > > How about we do /dev/bpf/full_verifier first?
> > > It will replace capable() checks in the verifier only.
> >
> > I'm not convinced that "in the verifier" is the right distinction.
> > Telling administrators that some setting lets certain users bypass
> > bpf() verifier checks doesn't have a clear enough meaning.
>
> linux is a single user system. there are no administrators any more.
> No doubt, folks will disagree, but that game is over.
> At least on bpf side it's done.
>
> > I propose,
> > instead, that the current capable() checks be divided into three
> > categories:
>
> I don't see a use case for these categories.
> All bpf programs extend the kernel in some way.
> The kernel vs user is one category.
> Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> When application has CAP_NET_ADMIN it covers all of networking knobs.
> There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> Similarly CAP_BPF as the only knob is enough.
> The only disadvantage of CAP_BPF is that it's not possible to
> pass it from one systemd-like daemon to another systemd-like daemon.
> Hence /dev/bpf idea and passing file descriptor.
>
> > This type of thing actually fits quite nicely into an idea I've been
> > thinking about for a while called "implicit rights". In very brief
> > summary, there would be objects called /dev/rights/xyz, where xyz is
> > the same of a "right".  If there is a readable object of the right
> > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > There's a bit more flexibility on top of this.  BPF could use
> > /dev/rights/bpf/maptypes/lpm and
> > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > use cases include a biggie:
> > /dev/rights/namespace/create_unprivileged_userns.
> > /dev/rights/bind_port/80 would be nice, too.
>
> The concept of "implicit rights" is very nice and I'm sure it will
> be a good fit somewhere, but I don't see why use it in bpf space.
> There is no use case for fine grain partition of bpf features.
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-13 22:26                                                           ` Daniel Colascione
@ 2019-08-13 23:24                                                             ` Andy Lutomirski
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-13 23:24 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Alexei Starovoitov, Andy Lutomirski, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Tue, Aug 13, 2019 at 3:27 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > > >
> > > > Inside containers and inside nested containers we need to start processes
> > > > that will use bpf. All of the processes are trusted.
> > >
> > > Trusted by whom?  In a non-nested container, the container manager
> > > *might* be trusted by the outside world.  In a *nested* container,
> > > unless the inner container management is controlled from outside the
> > > outer container, it's not trusted.  I don't know much about how
> > > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > > very strongly toward user namespaces and maximally-untrusted
> > > containers, and I think bpf() should work in that context.
> >
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> > Linux has become a single user system.
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > Containers are used to make production systems safer.
> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> > There is not a single use case to allow arbitrary malicious bpf
> > program to be loaded and executed.
> > As soon as we have /dev/bpf to allow all of bpf to be used without root
> > we will set sysctl kernel.unprivileged_bpf_disabled=1
> > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > The applications that will use it are going to be just as trusted as systemd.
> >
> > > > To solve your concern of bypassing all capable checks...
> > > > How about we do /dev/bpf/full_verifier first?
> > > > It will replace capable() checks in the verifier only.
> > >
> > > I'm not convinced that "in the verifier" is the right distinction.
> > > Telling administrators that some setting lets certain users bypass
> > > bpf() verifier checks doesn't have a clear enough meaning.
> >
> > linux is a single user system. there are no administrators any more.
> > No doubt, folks will disagree, but that game is over.
> > At least on bpf side it's done.
> >
> > > I propose,
> > > instead, that the current capable() checks be divided into three
> > > categories:
> >
> > I don't see a use case for these categories.
> > All bpf programs extend the kernel in some way.
> > The kernel vs user is one category.
> > Conceptually CAP_BPF is enough. It would be similar to CAP_NET_ADMIN.
> > When application has CAP_NET_ADMIN it covers all of networking knobs.
> > There is no use case that would warrant fine grain CAP_ROUTE_ADMIN,
> > CAP_ETHTOOL_ADMIN, CAP_ETH0_ADMIN, etc.
> > Similarly CAP_BPF as the only knob is enough.
> > The only disadvantage of CAP_BPF is that it's not possible to
> > pass it from one systemd-like daemon to another systemd-like daemon.
> > Hence /dev/bpf idea and passing file descriptor.
> >
> > > This type of thing actually fits quite nicely into an idea I've been
> > > thinking about for a while called "implicit rights". In very brief
> > > summary, there would be objects called /dev/rights/xyz, where xyz is
> > > the same of a "right".  If there is a readable object of the right
> > > type at the literal path "/dev/rights/xyz", then you have right xyz.
> > > There's a bit more flexibility on top of this.  BPF could use
> > > /dev/rights/bpf/maptypes/lpm and
> > > /dev/rights/bpf/verifier/bounded_loops, for example.  Other non-BPF
> > > use cases include a biggie:
> > > /dev/rights/namespace/create_unprivileged_userns.
> > > /dev/rights/bind_port/80 would be nice, too.
> >
> > The concept of "implicit rights" is very nice and I'm sure it will
> > be a good fit somewhere, but I don't see why use it in bpf space.
> > There is no use case for fine grain partition of bpf features.
>
> Isn't this "implicit rights" model just another kind of ambient
> authority --- one that constrains the otherwise-free filesystem
> namespace to boot?

Yes.

> IMHO, the kernel should be moving toward explicit
> authorization tokens modeled by file descriptors and away from
> contextual authorization decisions.

And yes, I agree there too. Here's how I think about it: there are
really two layers here:

Rights: these are objects like /dev/rights/bpf/some_bpf_privilege or
/dev/rights/namespace/unpriv_userns, and you would, ideally, use them
like genuine capabilities.  You'd pass an fd with appropriate access
(FMODE_READ, presumably, since exec is awkward to work with for fds)
into bpf() or similar, and the kernel would say "yep, caller has the
capability" and do something.  There's nothing really restricting them
to /dev/rights, but they more or less have to live on a memory-backed
file system (a real backing store has all kinds of issues), and
putting them in /dev gets a lot of nifty properties for free.  For
example, existing container systems that don't know about them will
automatically deny them to containers, since nothing with an ounce of
sense passes all of /dev through to a container.  But container
systems that are aware of them can bind-mount them into the container.
And /dev is already known to be magical due to things like
/dev/urandom.

The implicit part on top is less than ideal, but it solves two problems:

1. It keeps compatibility with existing code.  There are programs that
expect unshare(CLONE_NEWUSER) to work -- with *implicit* rights, it
will work exactly when it's supposed to.  Also, for cases like
CLONE_NEWUSER, it does have more or less the right semantics -- if
they were explicit, most programs would just try to open
/dev/rights/namespace/unpriv_userns and pass the fd to unshare2, so
we're not losing much.

2. For things like eBPF where the set of rights could be a moving
target, implicit rights lets the model evolve without breaking
userspace.  So if LPM maps eventually become bulletproof and a right
is no longer needed, it still works.  Or if some feature in the
verifier that is currently unrestricted were subsequently deemed to
need restrictions, they could be added without retrofitting all the
users.

There are cases where implicit rights would be totally inappropriate.
For example, a CAP_DAC_READ_SEARCH right could not be safely made
implicit.  In general, I think the implicit model works for system
calls where it's unambiguous what the caller wants to have happen and
there, depending on privilege level, it either works or it doesn't.
So, for accessing a filesystem, it's not at all obvious whether a
program is accessing it on its own behalf or on a client's behalf, and
privilege usage should be explicit.  For something like "don't
Spectre-mitigate this eBPF program", the semantics change and the
request should IMO be explicit.  For for "create an LPM map", I don't
see how a confused deputy is likely, and an implicit right seems
reasonable.  Similarly, for creating a namespace or binding a network
port, confused deputies seem unlikely.  (For connecting to a network
address, if such a thing were ever restricted, confused deputies are
definitely possible and happen all the time, e.g. under a DNS
rebinding attack.)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-13 23:06                                                           ` Andy Lutomirski
@ 2019-08-14  0:57                                                             ` Alexei Starovoitov
  2019-08-14 17:51                                                               ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-14  0:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Tue, Aug 13, 2019 at 04:06:00PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 13, 2019 at 2:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 10:24:25PM -0700, Andy Lutomirski wrote:
> > > >
> > > > Inside containers and inside nested containers we need to start processes
> > > > that will use bpf. All of the processes are trusted.
> > >
> > > Trusted by whom?  In a non-nested container, the container manager
> > > *might* be trusted by the outside world.  In a *nested* container,
> > > unless the inner container management is controlled from outside the
> > > outer container, it's not trusted.  I don't know much about how
> > > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > > very strongly toward user namespaces and maximally-untrusted
> > > containers, and I think bpf() should work in that context.
> >
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> > Linux has become a single user system.
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> 
> I would argue that this is a reasonable assumption to make if you're
> designing a system using Linux, but it's not a valid assumption to
> make as kernel developers.  Otherwise we should just give everyone
> CAP_SYS_ADMIN and call it a day.  There really is a difference between
> root and non-root.

hmm. No. Kernel developers should not make any assumptions.
They should guide their design by real use cases instead. That includes studing
what people do now and hacks they use to workaround lack of interfaces.
Effecitvely bpf is root only. There are no unpriv users.
This root applications go out of their way to reduce privileges
while they still want to use bpf. That is the need that /dev/bpf is solving.

> 
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > Containers are used to make production systems safer.
> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> 
> Seccomp really will want eBPF some day, and it should work without
> privilege.  Maybe it should be a restricted subset of eBPF, and
> Spectre will always be an issue until dramatically better hardware
> shows up, but I think people will want the ability for regular
> programs to load eBPF seccomp programs.

I'm absolutely against using eBPF in seccomp.
Precisely due to discussions like the current one.

> 
> > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > The applications that will use it are going to be just as trusted as systemd.
> 
> I still don't understand your systemd example.  systemd --users is not
> trusted systemwide in any respect.  The main PID 1 systemd is root.
> No matter how you dice it, granting a user systemd instance extra bpf
> access is tantamount to granting the user extra bpf access in general.

People use systemd --user while their kernel have 'undef CONFIG_USER_NS'.

> It sounds to me like you're thinking of eBPF as a feature a bit like
> unprivileged user namespaces: *in principle*, it's supposed to be safe
> to give any unprivileged process the ability to use it, and you
> consider security flaws in it to be bugs worth fixing. But you think
> it's a large attack surface and that most unprivileged programs
> shouldn't be allowed to use it.  Is that reasonable?

I think there should be no unprivileged bpf at all,
because over all these years we've seen zero use cases.
Hence all new features are root only.
LPM map is a prime example. There was not a single security bug in there.
There were few functional bugs, but not security issues.
These bugs didn't crash the kernel and didn't expose any data.
Yet we still keep LPM as root only.
Can we flip the switch and make it non-root? It's trivial single line patch ?
and security risk is very low?
Nope, since it will not address the underlying issue.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14  0:57                                                             ` Alexei Starovoitov
@ 2019-08-14 17:51                                                               ` Andy Lutomirski
  2019-08-14 22:05                                                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-14 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Colascione
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Tue, Aug 13, 2019 at 5:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:

> hmm. No. Kernel developers should not make any assumptions.
> They should guide their design by real use cases instead. That includes studing
> what people do now and hacks they use to workaround lack of interfaces.
> Effecitvely bpf is root only. There are no unpriv users.
> This root applications go out of their way to reduce privileges
> while they still want to use bpf. That is the need that /dev/bpf is solving.
>
> >
> > > Containers are not providing the level of security that is enough
> > > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> > > Containers are used to make production systems safer.
> > > Some people call it more 'secure', but it's clearly not secure for
> > > arbitrary code and that is what kernel.unprivileged_bpf_disabled allows.
> > > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > > It's been a constant source of pain. The constant blinding, randomization,
> > > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > > are simply not worth it. It's a lot of complex kernel code without users.
> >
> > Seccomp really will want eBPF some day, and it should work without
> > privilege.  Maybe it should be a restricted subset of eBPF, and
> > Spectre will always be an issue until dramatically better hardware
> > shows up, but I think people will want the ability for regular
> > programs to load eBPF seccomp programs.
>
> I'm absolutely against using eBPF in seccomp.
> Precisely due to discussions like the current one.

I still think I don't really agree with your overall premise.

If eBPF is genuinely not usable by programs that are not fully trusted
by the admin, then no kernel changes at all are needed.  Programs that
want to reduce their own privileges can easily fork() a privileged
subprocess or run a little helper to which they delegate BPF
operations.  This is far more flexible than anything that will ever be
in the kernel because it allows the helper to verify that the rest of
the program is doing exactly what it's supposed to and restrict eBPF
operations to exactly the subset that is needed.  So a container
manager or network manager that drops some provilege could have a
little bpf-helper that manages its BPF XDP, firewalling, etc
configuration.  The two processes would talk over a socketpair.

The interesting cases you're talking about really *do* involved
unprivileged or less privileged eBPF, though.  Let's see:

systemd --user: systemd --user *is not privileged at all*.  There's no
issue of reducing privilege, since systemd --user doesn't have any
privilege to begin with.  But systemd supports some eBPF features, and
presumably it would like to support them in the systemd --user case.
This is unprivileged eBPF.

Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
it doesn't involve the bpf() syscall.  There are some seccomp
proposals in the works that will want some stuff from eBPF.  In
particular, the ability to call seccomp-specific bpf functions from a
seccomp program could be very nice. Similarly, the ability to use the
enhanced instruction set and maybe even *read* maps would be nice.  I
do think that seccomp will continue to want its programs to be
stateless.

So it's a bit of a chicken-and-egg situation.  There aren't major
unprivileged eBPF users because the kernel support isn't there.

>
> >
> > > Hence I prefer this /dev/bpf mechanism to be as simple a possible.
> > > The applications that will use it are going to be just as trusted as systemd.
> >
> > I still don't understand your systemd example.  systemd --users is not
> > trusted systemwide in any respect.  The main PID 1 systemd is root.
> > No matter how you dice it, granting a user systemd instance extra bpf
> > access is tantamount to granting the user extra bpf access in general.
>
> People use systemd --user while their kernel have 'undef CONFIG_USER_NS'.

I don't know what you're getting at.  I'm typing this email in a
browser running under a systemd --user instance, and there are no user
namespaces involved.

$ ps -u luto |grep systemd
 1944 ?        00:00:02 systemd
$ stat /proc/1944
...
Access: (0555/dr-xr-xr-x)  Uid: ( 1000/    luto)   Gid: ( 1000/    luto)
Context: unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
$ gdb -p 1944
[snipped tons of output, but gdb works fine like this]

systemd --user is not privileged.  Giving it /dev/bpf as imagined by
the current set of patches would be a gaping security hole.

>
> I think there should be no unprivileged bpf at all,
> because over all these years we've seen zero use cases.
> Hence all new features are root only.

You're the maintainer.  If you feel this way, then I think you should
just drop the /dev/bpf idea entirely and have userspace manage all of
this by itself.  It will remain extremely awkward for containers and
especially nested containers to use eBPF.

--Andy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 17:51                                                               ` Andy Lutomirski
@ 2019-08-14 22:05                                                                 ` Alexei Starovoitov
  2019-08-14 22:30                                                                   ` Andy Lutomirski
  2019-08-15 11:24                                                                   ` Jordan Glover
  0 siblings, 2 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-14 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> 
> If eBPF is genuinely not usable by programs that are not fully trusted
> by the admin, then no kernel changes at all are needed.  Programs that
> want to reduce their own privileges can easily fork() a privileged
> subprocess or run a little helper to which they delegate BPF
> operations.  This is far more flexible than anything that will ever be
> in the kernel because it allows the helper to verify that the rest of
> the program is doing exactly what it's supposed to and restrict eBPF
> operations to exactly the subset that is needed.  So a container
> manager or network manager that drops some provilege could have a
> little bpf-helper that manages its BPF XDP, firewalling, etc
> configuration.  The two processes would talk over a socketpair.

there were three projects that tried to delegate bpf operations.
All of them failed.
bpf operational workflow is much more complex than you're imagining.
fork() also doesn't work for all cases.
I gave this example before: consider multiple systemd-like deamons
that need to do bpf operations that want to pass this 'bpf capability'
to other deamons written by other teams. Some of them will start
non-root, but still need to do bpf. They will be rpm installed
and live upgraded while running.
We considered to make systemd such centralized bpf delegation
authority too. It didn't work. bpf in kernel grows quickly.
libbpf part grows independently. llvm keeps evolving.
All of them are being changed while system overall has to stay
operational. Centralized approach breaks apart.

> The interesting cases you're talking about really *do* involved
> unprivileged or less privileged eBPF, though.  Let's see:
> 
> systemd --user: systemd --user *is not privileged at all*.  There's no
> issue of reducing privilege, since systemd --user doesn't have any
> privilege to begin with.  But systemd supports some eBPF features, and
> presumably it would like to support them in the systemd --user case.
> This is unprivileged eBPF.

Let's disambiguate the terminology.
This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
I think that was a mistake.
Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
This is not unprivileged.
'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.

There is a huge difference between the two.
I'm against extending 'unprivileged bpf' even a bit more than what it is
today for many reasons mentioned earlier.
The /dev/bpf is about 'less privileged'.
Less privileged than root. We need to split part of full root capability
into bpf capability. So that most of the root can be dropped.
This is very similar to what cap_net_admin does.
cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
cap_net_admin is very much privileged. Just 'less privileged' than root.
Same thing for cap_bpf.

May be we should do both cap_bpf and /dev/bpf to make it clear that
this is the same thing. Two interfaces to achieve the same result.

> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
> it doesn't involve the bpf() syscall.  There are some seccomp
> proposals in the works that will want some stuff from eBPF.  In

I'm afraid these proposals won't go anywhere.

> So it's a bit of a chicken-and-egg situation.  There aren't major
> unprivileged eBPF users because the kernel support isn't there.

As I said before there are zero known use cases of 'unprivileged bpf'.

If I understand you correctly you're refusing to accept that
'less privileged bpf' is a valid use case while pushing for extending
scope of 'unprivileged'.
Extending the scope is an orthogonal discussion. Currently I'm
opposed to that. Whereas 'less privileged' is what people require.

> It will remain extremely awkward for containers and
> especially nested containers to use eBPF.

I'm afraid we have to agree to disagree here and move on.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 22:05                                                                 ` Alexei Starovoitov
@ 2019-08-14 22:30                                                                   ` Andy Lutomirski
  2019-08-14 23:33                                                                     ` Alexei Starovoitov
  2019-08-15 11:24                                                                   ` Jordan Glover
  1 sibling, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-14 22:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List



> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>> 
>> If eBPF is genuinely not usable by programs that are not fully trusted
>> by the admin, then no kernel changes at all are needed.  Programs that
>> want to reduce their own privileges can easily fork() a privileged
>> subprocess or run a little helper to which they delegate BPF
>> operations.  This is far more flexible than anything that will ever be
>> in the kernel because it allows the helper to verify that the rest of
>> the program is doing exactly what it's supposed to and restrict eBPF
>> operations to exactly the subset that is needed.  So a container
>> manager or network manager that drops some provilege could have a
>> little bpf-helper that manages its BPF XDP, firewalling, etc
>> configuration.  The two processes would talk over a socketpair.
> 
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
> 
>> The interesting cases you're talking about really *do* involved
>> unprivileged or less privileged eBPF, though.  Let's see:
>> 
>> systemd --user: systemd --user *is not privileged at all*.  There's no
>> issue of reducing privilege, since systemd --user doesn't have any
>> privilege to begin with.  But systemd supports some eBPF features, and
>> presumably it would like to support them in the systemd --user case.
>> This is unprivileged eBPF.
> 
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> 
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.

The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

> 
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.

What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

> 
>> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
>> it doesn't involve the bpf() syscall.  There are some seccomp
>> proposals in the works that will want some stuff from eBPF.  In
> 
> I'm afraid these proposals won't go anywhere.

Can you explain why?

> 
>> So it's a bit of a chicken-and-egg situation.  There aren't major
>> unprivileged eBPF users because the kernel support isn't there.
> 
> As I said before there are zero known use cases of 'unprivileged bpf'.
> 
> If I understand you correctly you're refusing to accept that
> 'less privileged bpf' is a valid use case while pushing for extending
> scope of 'unprivileged'.

No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 22:30                                                                   ` Andy Lutomirski
@ 2019-08-14 23:33                                                                     ` Alexei Starovoitov
  2019-08-14 23:59                                                                       ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-14 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >> 
> >> If eBPF is genuinely not usable by programs that are not fully trusted
> >> by the admin, then no kernel changes at all are needed.  Programs that
> >> want to reduce their own privileges can easily fork() a privileged
> >> subprocess or run a little helper to which they delegate BPF
> >> operations.  This is far more flexible than anything that will ever be
> >> in the kernel because it allows the helper to verify that the rest of
> >> the program is doing exactly what it's supposed to and restrict eBPF
> >> operations to exactly the subset that is needed.  So a container
> >> manager or network manager that drops some provilege could have a
> >> little bpf-helper that manages its BPF XDP, firewalling, etc
> >> configuration.  The two processes would talk over a socketpair.
> > 
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> > 
> >> The interesting cases you're talking about really *do* involved
> >> unprivileged or less privileged eBPF, though.  Let's see:
> >> 
> >> systemd --user: systemd --user *is not privileged at all*.  There's no
> >> issue of reducing privilege, since systemd --user doesn't have any
> >> privilege to begin with.  But systemd supports some eBPF features, and
> >> presumably it would like to support them in the systemd --user case.
> >> This is unprivileged eBPF.
> > 
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > 
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> 
> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

Initially I agreed that it's probably too broad, but then realized
that they're perfect as-is. There is no need to partition further.

> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> 
> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

Indeed, ambient capabilities should work for all cases.

> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

There are not that many bits left. I prefer to consume single CAP_BPF bit.
All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
This is no-brainer.

The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
should be extended to CAP_BPF or not.
imo devmap and xskmap can stay CAP_NET_ADMIN,
but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
Initially cgroup-bpf hooks were limited to networking.
It's no longer the case. Requiring NET_ADMIN there make little sense now.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 23:33                                                                     ` Alexei Starovoitov
@ 2019-08-14 23:59                                                                       ` Andy Lutomirski
  2019-08-15  0:36                                                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-14 23:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List



> On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
>> 
>> 
>>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 
>>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>>>> 
>>>> If eBPF is genuinely not usable by programs that are not fully trusted
>>>> by the admin, then no kernel changes at all are needed.  Programs that
>>>> want to reduce their own privileges can easily fork() a privileged
>>>> subprocess or run a little helper to which they delegate BPF
>>>> operations.  This is far more flexible than anything that will ever be
>>>> in the kernel because it allows the helper to verify that the rest of
>>>> the program is doing exactly what it's supposed to and restrict eBPF
>>>> operations to exactly the subset that is needed.  So a container
>>>> manager or network manager that drops some provilege could have a
>>>> little bpf-helper that manages its BPF XDP, firewalling, etc
>>>> configuration.  The two processes would talk over a socketpair.
>>> 
>>> there were three projects that tried to delegate bpf operations.
>>> All of them failed.
>>> bpf operational workflow is much more complex than you're imagining.
>>> fork() also doesn't work for all cases.
>>> I gave this example before: consider multiple systemd-like deamons
>>> that need to do bpf operations that want to pass this 'bpf capability'
>>> to other deamons written by other teams. Some of them will start
>>> non-root, but still need to do bpf. They will be rpm installed
>>> and live upgraded while running.
>>> We considered to make systemd such centralized bpf delegation
>>> authority too. It didn't work. bpf in kernel grows quickly.
>>> libbpf part grows independently. llvm keeps evolving.
>>> All of them are being changed while system overall has to stay
>>> operational. Centralized approach breaks apart.
>>> 
>>>> The interesting cases you're talking about really *do* involved
>>>> unprivileged or less privileged eBPF, though.  Let's see:
>>>> 
>>>> systemd --user: systemd --user *is not privileged at all*.  There's no
>>>> issue of reducing privilege, since systemd --user doesn't have any
>>>> privilege to begin with.  But systemd supports some eBPF features, and
>>>> presumably it would like to support them in the systemd --user case.
>>>> This is unprivileged eBPF.
>>> 
>>> Let's disambiguate the terminology.
>>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
>>> I think that was a mistake.
>>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
>>> This is not unprivileged.
>>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>>> 
>>> There is a huge difference between the two.
>>> I'm against extending 'unprivileged bpf' even a bit more than what it is
>>> today for many reasons mentioned earlier.
>>> The /dev/bpf is about 'less privileged'.
>>> Less privileged than root. We need to split part of full root capability
>>> into bpf capability. So that most of the root can be dropped.
>>> This is very similar to what cap_net_admin does.
>>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
>>> cap_net_admin is very much privileged. Just 'less privileged' than root.
>>> Same thing for cap_bpf.
>> 
>> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> 
> Initially I agreed that it's probably too broad, but then realized
> that they're perfect as-is. There is no need to partition further.
> 
>>> May be we should do both cap_bpf and /dev/bpf to make it clear that
>>> this is the same thing. Two interfaces to achieve the same result.
>> 
>> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> 
> Indeed, ambient capabilities should work for all cases.
> 
>> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> 
> There are not that many bits left. I prefer to consume single CAP_BPF bit.
> All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> This is no-brainer.
> 
> The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> should be extended to CAP_BPF or not.
> imo devmap and xskmap can stay CAP_NET_ADMIN,
> but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> Initially cgroup-bpf hooks were limited to networking.
> It's no longer the case. Requiring NET_ADMIN there make little sense now.
> 

Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar.  The rest could plausibly be hardened in the future, although the by_id stuff may be tricky too.

Do new programs really need the by_id calls?  It could make sense to leave those unchanged and to have new programs use persistent maps instead.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 23:59                                                                       ` Andy Lutomirski
@ 2019-08-15  0:36                                                                         ` Alexei Starovoitov
  0 siblings, 0 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-15  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Wed, Aug 14, 2019 at 04:59:18PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> >> 
> >> 
> >>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> 
> >>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >>>> 
> >>>> If eBPF is genuinely not usable by programs that are not fully trusted
> >>>> by the admin, then no kernel changes at all are needed.  Programs that
> >>>> want to reduce their own privileges can easily fork() a privileged
> >>>> subprocess or run a little helper to which they delegate BPF
> >>>> operations.  This is far more flexible than anything that will ever be
> >>>> in the kernel because it allows the helper to verify that the rest of
> >>>> the program is doing exactly what it's supposed to and restrict eBPF
> >>>> operations to exactly the subset that is needed.  So a container
> >>>> manager or network manager that drops some provilege could have a
> >>>> little bpf-helper that manages its BPF XDP, firewalling, etc
> >>>> configuration.  The two processes would talk over a socketpair.
> >>> 
> >>> there were three projects that tried to delegate bpf operations.
> >>> All of them failed.
> >>> bpf operational workflow is much more complex than you're imagining.
> >>> fork() also doesn't work for all cases.
> >>> I gave this example before: consider multiple systemd-like deamons
> >>> that need to do bpf operations that want to pass this 'bpf capability'
> >>> to other deamons written by other teams. Some of them will start
> >>> non-root, but still need to do bpf. They will be rpm installed
> >>> and live upgraded while running.
> >>> We considered to make systemd such centralized bpf delegation
> >>> authority too. It didn't work. bpf in kernel grows quickly.
> >>> libbpf part grows independently. llvm keeps evolving.
> >>> All of them are being changed while system overall has to stay
> >>> operational. Centralized approach breaks apart.
> >>> 
> >>>> The interesting cases you're talking about really *do* involved
> >>>> unprivileged or less privileged eBPF, though.  Let's see:
> >>>> 
> >>>> systemd --user: systemd --user *is not privileged at all*.  There's no
> >>>> issue of reducing privilege, since systemd --user doesn't have any
> >>>> privilege to begin with.  But systemd supports some eBPF features, and
> >>>> presumably it would like to support them in the systemd --user case.
> >>>> This is unprivileged eBPF.
> >>> 
> >>> Let's disambiguate the terminology.
> >>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> >>> I think that was a mistake.
> >>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> >>> This is not unprivileged.
> >>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >>> 
> >>> There is a huge difference between the two.
> >>> I'm against extending 'unprivileged bpf' even a bit more than what it is
> >>> today for many reasons mentioned earlier.
> >>> The /dev/bpf is about 'less privileged'.
> >>> Less privileged than root. We need to split part of full root capability
> >>> into bpf capability. So that most of the root can be dropped.
> >>> This is very similar to what cap_net_admin does.
> >>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> >>> cap_net_admin is very much privileged. Just 'less privileged' than root.
> >>> Same thing for cap_bpf.
> >> 
> >> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> > 
> > Initially I agreed that it's probably too broad, but then realized
> > that they're perfect as-is. There is no need to partition further.
> > 
> >>> May be we should do both cap_bpf and /dev/bpf to make it clear that
> >>> this is the same thing. Two interfaces to achieve the same result.
> >> 
> >> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> > 
> > Indeed, ambient capabilities should work for all cases.
> > 
> >> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> > 
> > There are not that many bits left. I prefer to consume single CAP_BPF bit.
> > All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> > This is no-brainer.
> > 
> > The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> > should be extended to CAP_BPF or not.
> > imo devmap and xskmap can stay CAP_NET_ADMIN,
> > but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> > Initially cgroup-bpf hooks were limited to networking.
> > It's no longer the case. Requiring NET_ADMIN there make little sense now.
> > 
> 
> Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar. 

'never to less trusted process' ? why do you think so?
I don't see a problem adding /dev/bpf/foo in the future and make things
more granular. There is no such use case today. Hence I don't want to
spend time and design something without clear use case in mind.

> Do new programs really need the by_id calls? 

yes. Lorenz gave an example earlier. map-in-map returns map_id.
To operate on that map by_id is needed.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-14 22:05                                                                 ` Alexei Starovoitov
  2019-08-14 22:30                                                                   ` Andy Lutomirski
@ 2019-08-15 11:24                                                                   ` Jordan Glover
  2019-08-15 17:28                                                                     ` Alexei Starovoitov
  1 sibling, 1 reply; 61+ messages in thread
From: Jordan Glover @ 2019-08-15 11:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>
> > If eBPF is genuinely not usable by programs that are not fully trusted
> > by the admin, then no kernel changes at all are needed. Programs that
> > want to reduce their own privileges can easily fork() a privileged
> > subprocess or run a little helper to which they delegate BPF
> > operations. This is far more flexible than anything that will ever be
> > in the kernel because it allows the helper to verify that the rest of
> > the program is doing exactly what it's supposed to and restrict eBPF
> > operations to exactly the subset that is needed. So a container
> > manager or network manager that drops some provilege could have a
> > little bpf-helper that manages its BPF XDP, firewalling, etc
> > configuration. The two processes would talk over a socketpair.
>
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
>
> > The interesting cases you're talking about really do involved
> > unprivileged or less privileged eBPF, though. Let's see:
> > systemd --user: systemd --user is not privileged at all. There's no
> > issue of reducing privilege, since systemd --user doesn't have any
> > privilege to begin with. But systemd supports some eBPF features, and
> > presumably it would like to support them in the systemd --user case.
> > This is unprivileged eBPF.
>
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.
>
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.
>

systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
Granting them cap_bpf is the same as granting it to every other unprivileged user
process. Also unprivileged user process can start systemd --user process with any
command they like.

Jordan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 11:24                                                                   ` Jordan Glover
@ 2019-08-15 17:28                                                                     ` Alexei Starovoitov
  2019-08-15 18:36                                                                       ` Andy Lutomirski
  2019-08-15 18:43                                                                       ` Jordan Glover
  0 siblings, 2 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-15 17:28 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >
> > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > by the admin, then no kernel changes at all are needed. Programs that
> > > want to reduce their own privileges can easily fork() a privileged
> > > subprocess or run a little helper to which they delegate BPF
> > > operations. This is far more flexible than anything that will ever be
> > > in the kernel because it allows the helper to verify that the rest of
> > > the program is doing exactly what it's supposed to and restrict eBPF
> > > operations to exactly the subset that is needed. So a container
> > > manager or network manager that drops some provilege could have a
> > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > configuration. The two processes would talk over a socketpair.
> >
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> >
> > > The interesting cases you're talking about really do involved
> > > unprivileged or less privileged eBPF, though. Let's see:
> > > systemd --user: systemd --user is not privileged at all. There's no
> > > issue of reducing privilege, since systemd --user doesn't have any
> > > privilege to begin with. But systemd supports some eBPF features, and
> > > presumably it would like to support them in the systemd --user case.
> > > This is unprivileged eBPF.
> >
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> >
> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> >
> 
> systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> Granting them cap_bpf is the same as granting it to every other unprivileged user
> process. Also unprivileged user process can start systemd --user process with any
> command they like.

systemd itself is trusted. It's the same binary whether it runs as pid=1
or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
CAP_BPF is a natural step in the same direction.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 17:28                                                                     ` Alexei Starovoitov
@ 2019-08-15 18:36                                                                       ` Andy Lutomirski
  2019-08-15 23:08                                                                         ` Alexei Starovoitov
  2019-08-15 18:43                                                                       ` Jordan Glover
  1 sibling, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-15 18:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.
>

I have the feeling that we're somehow speaking different languages.
What, precisely, do you mean when you say "systemd itself is trusted"?
 Do you mean "the administrator trusts that the /lib/systemd/systemd
binary is not malicious"?  Do you mean "the administrator trusts that
the running systemd process is not malicious"?

On a regular Linux desktop or server box, passing CAP_NET_ADMIN, your
envisioned CAP_BPF, or /dev/bpf as in this patchset through to a
systemd --user binary would be a gaping security hole.  You are
welcome to do it on your own systemd, but if a distro did it, it would
be a major error.

If you want IPAddressDeny= to work in a user systemd unit (i.e.
/etc/systemd/user/*), then I think you have two choices.  You could
have an API by which systemd --user can ask a privileged helper to
assist (which has all the challenges you mentioned but is definitely
*possible*) or the kernel bpf() interfaces need to be designed so
that, in the absence of kernel bugs, they are safe to use from an
unprivileged process.  By "safe", I mean "would not expose the system
to attack if the kernel's implementation of the bpf() ABI were
perfect".

My suggestions upthread for incrementally making bpf() depend less on
privilege would accomplish this goal.  It would be entirely reasonable
to say that, even with those changes, bpf() is still a large attack
surface and access to it should be restricted, and having a capability
or other mechanism to explicitly grant access to the
hopefully-secure-but-plausibly-buggy parts of bpf() would make sense.
But you rejected that idea and said you "realized that [changing all
the capable() checks is] perfect as-is" without much explanation,
which makes it hard to understand where you're coming from.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 17:28                                                                     ` Alexei Starovoitov
  2019-08-15 18:36                                                                       ` Andy Lutomirski
@ 2019-08-15 18:43                                                                       ` Jordan Glover
  1 sibling, 0 replies; 61+ messages in thread
From: Jordan Glover @ 2019-08-15 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Thursday, August 15, 2019 5:28 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
>
> > On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> >
> > > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> > >
> > > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > > by the admin, then no kernel changes at all are needed. Programs that
> > > > want to reduce their own privileges can easily fork() a privileged
> > > > subprocess or run a little helper to which they delegate BPF
> > > > operations. This is far more flexible than anything that will ever be
> > > > in the kernel because it allows the helper to verify that the rest of
> > > > the program is doing exactly what it's supposed to and restrict eBPF
> > > > operations to exactly the subset that is needed. So a container
> > > > manager or network manager that drops some provilege could have a
> > > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > > configuration. The two processes would talk over a socketpair.
> > >
> > > there were three projects that tried to delegate bpf operations.
> > > All of them failed.
> > > bpf operational workflow is much more complex than you're imagining.
> > > fork() also doesn't work for all cases.
> > > I gave this example before: consider multiple systemd-like deamons
> > > that need to do bpf operations that want to pass this 'bpf capability'
> > > to other deamons written by other teams. Some of them will start
> > > non-root, but still need to do bpf. They will be rpm installed
> > > and live upgraded while running.
> > > We considered to make systemd such centralized bpf delegation
> > > authority too. It didn't work. bpf in kernel grows quickly.
> > > libbpf part grows independently. llvm keeps evolving.
> > > All of them are being changed while system overall has to stay
> > > operational. Centralized approach breaks apart.
> > >
> > > > The interesting cases you're talking about really do involved
> > > > unprivileged or less privileged eBPF, though. Let's see:
> > > > systemd --user: systemd --user is not privileged at all. There's no
> > > > issue of reducing privilege, since systemd --user doesn't have any
> > > > privilege to begin with. But systemd supports some eBPF features, and
> > > > presumably it would like to support them in the systemd --user case.
> > > > This is unprivileged eBPF.
> > >
> > > Let's disambiguate the terminology.
> > > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > > I think that was a mistake.
> > > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > > This is not unprivileged.
> > > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > > There is a huge difference between the two.
> > > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > > today for many reasons mentioned earlier.
> > > The /dev/bpf is about 'less privileged'.
> > > Less privileged than root. We need to split part of full root capability
> > > into bpf capability. So that most of the root can be dropped.
> > > This is very similar to what cap_net_admin does.
> > > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > > Same thing for cap_bpf.
> > > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > > this is the same thing. Two interfaces to achieve the same result.
> >
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.

The point was that systemd will run any arbitrary command you'll throw at it and you want
to automatically attach CAP_BPF to it. AmbientCapabilities is not valid option for
systemd --user instance (otherwise it would be nuts).

I think we may have misunderstanding here. Did you mean systemd "system" service with
"User=xxx" option instead of "systemd --user" service? It would make sense then.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-13 21:58                                                         ` Alexei Starovoitov
  2019-08-13 22:26                                                           ` Daniel Colascione
  2019-08-13 23:06                                                           ` Andy Lutomirski
@ 2019-08-15 19:46                                                           ` Kees Cook
  2019-08-15 23:46                                                             ` Alexei Starovoitov
  2 siblings, 1 reply; 61+ messages in thread
From: Kees Cook @ 2019-08-15 19:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.

Unsurprisingly, I totally disagree: this is the very definition of
improved "security": reduced attack surface, confined trust, etc.

> Linux has become a single user system.

I hope this is just hyperbole, because it's not true in reality. I agree
that the vast majority of Linux devices are single-user-at-a-time
systems now (rather than the "shell servers" of yore), but the system
still has to be expected to confine users from each other, root, and the
hardware. Switching users on Chrome OS or a distro laptop, etc is still
very much expected to _mean_ something.

> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.
> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.

I'm not sure why you draw the line for VMs -- they're just as buggy
as anything else. Regardless, I reject this line of thinking: yes,
all software is buggy, but that isn't a reason to give up. In fact,
we should be trying very hard to create safe code (*insert arguments
for sane languages and toolchains here*).

If you look at software safety as a binary, you will always be
disappointed. If you look at it as it manifests in the real world,
then there is some perspective to be had. Reachability of flaws becomes
a major factor; exploit chain length becomes a factor. There are very
real impacts to be had from security hardening, sandboxing, etc. Of
course nothing is perfect, but the current state of the world isn't
as you describe. (And I say this with the knowledge of how long
the lifetime of bugs are in the kernel.)

> Containers are used to make production systems safer.

Yes.

> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code

Perhaps it's just a language issue. "More secure" and "safer" mean
mostly the same thing to me. I tend to think "safer" is actually
a superset that includes things that wreck the user experience but
aren't actually in the privilege manipulation realm. In the traditional
"security" triad of confidentiality, integrity, and availability, I tend
to weigh availability less highly, but a bug that stops someone from
doing their work but doesn't wreck data, let them switch users, etc,
is still considered a "security" issue by many folks. The fewer bugs
someone is exposed to improves their security, safety, whatever. The
easiest way to do that is confinement and its associated attack surface
reduction. tl;dr: security and safety are very use-case-specific
continuum, not a binary state.

> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.
> There is not a single use case to allow arbitrary malicious bpf
> program to be loaded and executed.

The world isn't binary (safe code/malicious code), and we need to build
systems that can be used safely even when things go wrong. Yes, probably
no one has a system that _intentionally_ feeds eBPF into the kernel from
a web form. But there is probably someone who does it unintentionally,
or has a user login exposed on a system where unpriv BPF is enabled. The
point is to create primitives as safely as possible so when things DO
go wrong, they fail safe instead of making things worse.

I'm all for a "less privileged than root" API for eBPF, but I get worried
when I see "security" being treated as a binary state. Especially when
it is considered an always-failed state. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 18:36                                                                       ` Andy Lutomirski
@ 2019-08-15 23:08                                                                         ` Alexei Starovoitov
  2019-08-16  9:34                                                                           ` Jordan Glover
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-15 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jordan Glover, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > process. Also unprivileged user process can start systemd --user process with any
> > > command they like.
> >
> > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > CAP_BPF is a natural step in the same direction.
> >
> 
> I have the feeling that we're somehow speaking different languages.
> What, precisely, do you mean when you say "systemd itself is trusted"?
>  Do you mean "the administrator trusts that the /lib/systemd/systemd
> binary is not malicious"?  Do you mean "the administrator trusts that
> the running systemd process is not malicious"?

please see
https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
I'm not advocating for or against this approach.
Call it 'security hole' or 'better security'.
There are two categories of people for any feature like this.
My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
level of privileges is acceptable.
Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
But CAP_BPF is clearly better way.

> My suggestions upthread for incrementally making bpf() depend less on
> privilege would accomplish this goal.

As I pointed out countless times it would make the system overall _less_ secure.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1 to
make it _more_ secure.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 19:46                                                           ` Kees Cook
@ 2019-08-15 23:46                                                             ` Alexei Starovoitov
  2019-08-16  0:54                                                               ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-15 23:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List

On Thu, Aug 15, 2019 at 12:46:41PM -0700, Kees Cook wrote:
> On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> 
> Unsurprisingly, I totally disagree: this is the very definition of
> improved "security": reduced attack surface, confined trust, etc.

there are different ways to define the meaning of the word "security".
Of course containers reduce attack surface, etc.
The 'attack surface' as a mitigation from malicious users is not always the goal
of running containers. Ask yourself why containers are used in the datacenters
where only root can ssh into a server, only signed packages can
ever be installed, no browsers running, and no remote code is ever downloaded?

> > Linux has become a single user system.
> 
> I hope this is just hyperbole, because it's not true in reality. I agree
> that the vast majority of Linux devices are single-user-at-a-time
> systems now (rather than the "shell servers" of yore), but the system
> still has to be expected to confine users from each other, root, and the
> hardware. Switching users on Chrome OS or a distro laptop, etc is still
> very much expected to _mean_ something.

of course.

> 
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> 
> I'm not sure why you draw the line for VMs -- they're just as buggy
> as anything else. Regardless, I reject this line of thinking: yes,
> all software is buggy, but that isn't a reason to give up.

hmm. are you saying you want kernel community to work towards
making containers (namespaces) being able to run arbitrary code
downloaded from the internet?
In other words the problems solved by user space sandboxing, gvisor, etc
should be solved by the kernel?
I really don't think it's a good idea.

> If you look at software safety as a binary, you will always be
> disappointed. If you look at it as it manifests in the real world,
> then there is some perspective to be had. Reachability of flaws becomes
> a major factor; exploit chain length becomes a factor. There are very
> real impacts to be had from security hardening, sandboxing, etc. Of
> course nothing is perfect, but the current state of the world isn't
> as you describe. (And I say this with the knowledge of how long
> the lifetime of bugs are in the kernel.)

No arguing here. Security today is mainly the number of layers.
Hardening at all levels, sanboxing do help.
namespaces is one of the layers provided by the kernel.
The point that the kernel did its job already.
All other security layers are in user space.
Looking for bugs at every layer is still must have.
In the kernel, systemd, qemu, OS, browsers, etc.
Containers provide one level of security. VMs have another.

> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code
> 
> Perhaps it's just a language issue. "More secure" and "safer" mean
> mostly the same thing to me. I tend to think "safer" is actually
> a superset that includes things that wreck the user experience but
> aren't actually in the privilege manipulation realm. In the traditional
> "security" triad of confidentiality, integrity, and availability, I tend
> to weigh availability less highly, but a bug that stops someone from
> doing their work but doesn't wreck data, let them switch users, etc,
> is still considered a "security" issue by many folks. The fewer bugs
> someone is exposed to improves their security, safety, whatever. The
> easiest way to do that is confinement and its associated attack surface
> reduction. tl;dr: security and safety are very use-case-specific
> continuum, not a binary state.

yep

> 
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> > There is not a single use case to allow arbitrary malicious bpf
> > program to be loaded and executed.
> 
> The world isn't binary (safe code/malicious code), and we need to build
> systems that can be used safely even when things go wrong. Yes, probably
> no one has a system that _intentionally_ feeds eBPF into the kernel from
> a web form. But there is probably someone who does it unintentionally,
> or has a user login exposed on a system where unpriv BPF is enabled. The
> point is to create primitives as safely as possible so when things DO
> go wrong, they fail safe instead of making things worse.
> 
> I'm all for a "less privileged than root" API for eBPF, but I get worried
> when I see "security" being treated as a binary state. Especially when
> it is considered an always-failed state. :)

'security as always failed state' ? hmm.
not sure where this impression came from.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1
which will make the system overall _more_ secure.
I hope we can agree on that.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 23:46                                                             ` Alexei Starovoitov
@ 2019-08-16  0:54                                                               ` Andy Lutomirski
  2019-08-16  5:56                                                                 ` Song Liu
  2019-08-16 21:45                                                                 ` Alexei Starovoitov
  0 siblings, 2 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-16  0:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Song Liu, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List



> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:


>> 
>> I'm not sure why you draw the line for VMs -- they're just as buggy
>> as anything else. Regardless, I reject this line of thinking: yes,
>> all software is buggy, but that isn't a reason to give up.
> 
> hmm. are you saying you want kernel community to work towards
> making containers (namespaces) being able to run arbitrary code
> downloaded from the internet?

Yes.

As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)

To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.

Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:

CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.

Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.

(Sandstorm still exists but is no longer as actively developed, sadly.)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16  0:54                                                               ` Andy Lutomirski
@ 2019-08-16  5:56                                                                 ` Song Liu
  2019-08-16 21:45                                                                 ` Alexei Starovoitov
  1 sibling, 0 replies; 61+ messages in thread
From: Song Liu @ 2019-08-16  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List



> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
>>> 
>>> I'm not sure why you draw the line for VMs -- they're just as buggy
>>> as anything else. Regardless, I reject this line of thinking: yes,
>>> all software is buggy, but that isn't a reason to give up.
>> 
>> hmm. are you saying you want kernel community to work towards
>> making containers (namespaces) being able to run arbitrary code
>> downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.
> 
> Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:
> 
> CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.
> 
> Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.
> 
> (Sandstorm still exists but is no longer as actively developed, sadly.)

I am trying to understand different perspectives here. 

Disclaimer: Alexei and I both work for Facebook. But he may disagree 
with everything I am about to say below, because we haven't sync'ed 
about this for a while. :)

I think there are two types of use cases here: 

    1. CAP_BPF_ADMIN: one big key to all sys_bpf(). 
    2. CAP_BPF: subset of sys_bpf() that is safe for containers.

IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN. 
And there aren't many real world use cases for CAP_BPF. 

The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN.
On the other hand, Andy would like to introduce CAP_BPF and build
amazing use cases around it (chicken-egg problem). 

Did I misunderstand anything?

If not, I think these two use cases do not really conflict with each
other, and we probably need both of them. Then, the next question is 
do we really need both/either of them. Maybe having two separate 
discussions would make it easier?


The following are some questions I am trying to understand for 
the two cases. 

For CAP_BPF_ADMIN (or /dev/bpf):
Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and
reuse existing CAP_ should be easier than introducing a new one? 

For CAP_BPF: 
Do we really need it for the containers? Is it possible to implement 
all container use cases with SUID? At this moment, I think SUID is 
the right way to go for this use case, because this is likely to 
start with a small set of functionalities. We can introduce CAP_BPF
when the container use case is too complicated for SUID. 


I hope some of these questions/thoughts would make some sense?

Thanks,
Song

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-15 23:08                                                                         ` Alexei Starovoitov
@ 2019-08-16  9:34                                                                           ` Jordan Glover
  2019-08-16  9:59                                                                             ` Thomas Gleixner
  0 siblings, 1 reply; 61+ messages in thread
From: Jordan Glover @ 2019-08-16  9:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Thursday, August 15, 2019 11:08 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
>
> > On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> > alexei.starovoitov@gmail.com wrote:
> >
> > > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > >
> > > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > > process. Also unprivileged user process can start systemd --user process with any
> > > > command they like.
> > >
> > > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > > CAP_BPF is a natural step in the same direction.
> >
> > I have the feeling that we're somehow speaking different languages.
> > What, precisely, do you mean when you say "systemd itself is trusted"?
> > Do you mean "the administrator trusts that the /lib/systemd/systemd
> > binary is not malicious"? Do you mean "the administrator trusts that
> > the running systemd process is not malicious"?
>
> please see
> https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
> I'm not advocating for or against this approach.
> Call it 'security hole' or 'better security'.
> There are two categories of people for any feature like this.
> My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
> level of privileges is acceptable.
> Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
> But CAP_BPF is clearly better way.
>

Do you realize it's not possible to grant CAP_NET_ADMIN or any other CAP in
"systemd --user" service? Trying to do so will fail with:
"Failed to apply ambient capabilities (before UID change): Operation not permitted"

I think it's crucial to clear that point to avoid confusion in this discussion
where people are talking about different things.

On the other hand running "systemd --system" service with:

User=nobody
AmbientCapabilities=CAP_NET_ADMIN

is perfectly legit and clears some security concerns as only privileged user
can start such service.

Jordan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16  9:34                                                                           ` Jordan Glover
@ 2019-08-16  9:59                                                                             ` Thomas Gleixner
  2019-08-16 11:33                                                                               ` Jordan Glover
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Gleixner @ 2019-08-16  9:59 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Fri, 16 Aug 2019, Jordan Glover wrote:
> "systemd --user" service? Trying to do so will fail with:
> "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> 
> I think it's crucial to clear that point to avoid confusion in this discussion
> where people are talking about different things.
> 
> On the other hand running "systemd --system" service with:
> 
> User=nobody
> AmbientCapabilities=CAP_NET_ADMIN
> 
> is perfectly legit and clears some security concerns as only privileged user
> can start such service.

While we are at it, can we please stop looking at this from a systemd only
perspective. There is a world outside of systemd.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16  9:59                                                                             ` Thomas Gleixner
@ 2019-08-16 11:33                                                                               ` Jordan Glover
  2019-08-16 19:52                                                                                 ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Jordan Glover @ 2019-08-16 11:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Friday, August 16, 2019 9:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 16 Aug 2019, Jordan Glover wrote:
>
> > "systemd --user" service? Trying to do so will fail with:
> > "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> > I think it's crucial to clear that point to avoid confusion in this discussion
> > where people are talking about different things.
> > On the other hand running "systemd --system" service with:
> > User=nobody
> > AmbientCapabilities=CAP_NET_ADMIN
> > is perfectly legit and clears some security concerns as only privileged user
> > can start such service.
>
> While we are at it, can we please stop looking at this from a systemd only
> perspective. There is a world outside of systemd.
>
> Thanks,
>
> tglx

If you define:

"systemd --user" == unprivileged process started by unprivileged user
"systemd --system" == process started by privileged user but run as another
user which keeps some of parent user privileges and drops others

you can get rid of "systemd" from the equation.

"systemd --user" was the example provided by Alexei when asked about the usecase
but his description didn't match what it does so it's not obvious what the real
usecase is. I'm sure there can be many more examples and systemd isn't important
here in particular beside to understand this specific example.

Jordan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16 11:33                                                                               ` Jordan Glover
@ 2019-08-16 19:52                                                                                 ` Alexei Starovoitov
  2019-08-16 20:28                                                                                   ` Thomas Gleixner
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-16 19:52 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Thomas Gleixner, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Fri, Aug 16, 2019 at 11:33:57AM +0000, Jordan Glover wrote:
> On Friday, August 16, 2019 9:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Fri, 16 Aug 2019, Jordan Glover wrote:
> >
> > > "systemd --user" service? Trying to do so will fail with:
> > > "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> > > I think it's crucial to clear that point to avoid confusion in this discussion
> > > where people are talking about different things.
> > > On the other hand running "systemd --system" service with:
> > > User=nobody
> > > AmbientCapabilities=CAP_NET_ADMIN
> > > is perfectly legit and clears some security concerns as only privileged user
> > > can start such service.
> >
> > While we are at it, can we please stop looking at this from a systemd only
> > perspective. There is a world outside of systemd.
> >
> > Thanks,
> >
> > tglx
> 
> If you define:
> 
> "systemd --user" == unprivileged process started by unprivileged user
> "systemd --system" == process started by privileged user but run as another
> user which keeps some of parent user privileges and drops others
> 
> you can get rid of "systemd" from the equation.
> 
> "systemd --user" was the example provided by Alexei when asked about the usecase
> but his description didn't match what it does so it's not obvious what the real
> usecase is. I'm sure there can be many more examples and systemd isn't important
> here in particular beside to understand this specific example.

It's both of the above when 'systemd' is not taken literally.
To earlier Thomas's point: the use case is not only about systemd.
There are other containers management systems.
I've used 'systemd-like' terminology as an attempt to explain that such
daemons are trusted signed binaries that can be run as pid=1.
Sometimes it's the later:
"process started by privileged user but run as another user which keeps
some of parent user privileges and drops others".
Sometimes capability delegation to another container management daemon
is too cumbersome, so it's easier to use suid bit on that other daemon.
So it will become like the former:
"sort-of unprivileged process started by unprivileged user."
where daemon has suid and drops most of the capabilities as it starts.
Let's not focus on the model being good or bad security wise.
The point that those are the use cases that folks are thinking about.
That secondary daemon can be full root just fine.
All outer and inner daemons can be root.
These daemons need to drop privileges to make the system safer ==
less prone to corruption due to bugs in themselves. Not necessary security bugs.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16 19:52                                                                                 ` Alexei Starovoitov
@ 2019-08-16 20:28                                                                                   ` Thomas Gleixner
  2019-08-17 15:02                                                                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Thomas Gleixner @ 2019-08-16 20:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

Alexei,

On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> It's both of the above when 'systemd' is not taken literally.
> To earlier Thomas's point: the use case is not only about systemd.
> There are other containers management systems.

<SNIP>

> These daemons need to drop privileges to make the system safer == less
> prone to corruption due to bugs in themselves. Not necessary security
> bugs.

Let's take a step back.

While real usecases are helpful to understand a design decision, the design
needs to be usecase independent.

The kernel provides mechanisms, not policies. My impression of this whole
discussion is that it is policy driven. That's the wrong approach.

So let's look at the mechanisms which we have at hand:

 1) Capabilities
 
 2) SUID and dropping priviledges

 3) Seccomp and LSM

Now the real interesting questions are:

 A) What kind of restrictions does BPF allow? Is it a binary on/off or is
    there a more finegrained control of BPF functionality?

    TBH, I can't tell.

 B) Depending on the answer to #A what is the control possibility for
    #1/#2/#3 ?

Answering those questions gives us a real scope of what can be achieved
independent of use cases and wishful thought out policies.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16  0:54                                                               ` Andy Lutomirski
  2019-08-16  5:56                                                                 ` Song Liu
@ 2019-08-16 21:45                                                                 ` Alexei Starovoitov
  2019-08-16 22:22                                                                   ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-16 21:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Andy Lutomirski, Song Liu, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List

On Thu, Aug 15, 2019 at 05:54:59PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
> >> 
> >> I'm not sure why you draw the line for VMs -- they're just as buggy
> >> as anything else. Regardless, I reject this line of thinking: yes,
> >> all software is buggy, but that isn't a reason to give up.
> > 
> > hmm. are you saying you want kernel community to work towards
> > making containers (namespaces) being able to run arbitrary code
> > downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)

exactly: "meltdown", "spectre", "sigh".
Side channels effectively stalled the work on secure containers.
And killed projects like sandstorm.
Why work on improving kaslr when there are several ways to
get kernel addresses through hw bugs? Patch mouse holes when roof is leaking ?
In case of unprivileged bpf I'm confident that all known holes are patched.
Until disclosures stop happening with the frequency they do now the time
of bpf developers is better spent on something other than unprivileged bpf.

> I’m suggesting that you engage the security community ...
> .. so that normal users can use bpf filtering

yes, but not soon. unfortunately.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16 21:45                                                                 ` Alexei Starovoitov
@ 2019-08-16 22:22                                                                   ` Christian Brauner
  2019-08-17 15:08                                                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2019-08-16 22:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Fri, Aug 16, 2019 at 02:45:44PM -0700, Alexei Starovoitov wrote:
> On Thu, Aug 15, 2019 at 05:54:59PM -0700, Andy Lutomirski wrote:
> > 
> > 
> > > On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > 
> > >> 
> > >> I'm not sure why you draw the line for VMs -- they're just as buggy
> > >> as anything else. Regardless, I reject this line of thinking: yes,
> > >> all software is buggy, but that isn't a reason to give up.
> > > 
> > > hmm. are you saying you want kernel community to work towards
> > > making containers (namespaces) being able to run arbitrary code
> > > downloaded from the internet?
> > 
> > Yes.

If I may weigh in here too: Yes. In fact, we already do that large
scale!

> > 
> > As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> exactly: "meltdown", "spectre", "sigh".
> Side channels effectively stalled the work on secure containers.
> And killed projects like sandstorm.

If I may, Sandstorm's death has very likely nothing to do with
Meltdown/Spectre etc. since that should've also killed Qemu, Crosvm and
all the others in one fell swoop. It's also not a very good example (no
offense, Andy :)) and probably a bit dated.
We have LXD which is a full-fledged container manager that runs
*unprivileged system* containers on a large scale and is very much
alive. That is it runs systemd, openrc, what have you, i.e. simply
unmodifed distro images at this point.

It's used to run Linux workloads on all Chromebooks and in a bunch of
other places. Since its inception we did not have a single
*unprivileged* container to init userns/host breakout.
At this point in time the really bad CVEs are almost exclusively against
*privileged* containers (see this year's leading nomination for
container CVE grand mal of the year: CVE-2019-5736) which were never and
will never be considered root safe despite everyone pretending
otherwise.


> Why work on improving kaslr when there are several ways to
> get kernel addresses through hw bugs? Patch mouse holes when roof is leaking ?
> In case of unprivileged bpf I'm confident that all known holes are patched.
> Until disclosures stop happening with the frequency they do now the time
> of bpf developers is better spent on something other than unprivileged bpf.
> 
> > I’m suggesting that you engage the security community ...
> > .. so that normal users can use bpf filtering
> 
> yes, but not soon. unfortunately.

Tbh, I do not have a strong opinion on unprivileged bpf at this moment.
If the community thinks that the bits and pieces are not in place or the
timing is not right that's fine with me.
What we should make sure though is that we don't end up with something
halfbaked. And this /dev/bpf thing very much feels like a hack. If
unprivileged bpf is not a thing why bother with /dev/bpf or CAP_BPF at
all.

(The one usecase I'd care about is to extend seccomp to do pointer-based
syscall filtering. Whether or not that'd require (unprivileged) ebpf is
up for discussion at KSummit.)

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16 20:28                                                                                   ` Thomas Gleixner
@ 2019-08-17 15:02                                                                                     ` Alexei Starovoitov
  2019-08-17 15:44                                                                                       ` Andy Lutomirski
  2019-08-19  9:15                                                                                       ` Thomas Gleixner
  0 siblings, 2 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-17 15:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> Alexei,
> 
> On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > It's both of the above when 'systemd' is not taken literally.
> > To earlier Thomas's point: the use case is not only about systemd.
> > There are other containers management systems.
> 
> <SNIP>
> 
> > These daemons need to drop privileges to make the system safer == less
> > prone to corruption due to bugs in themselves. Not necessary security
> > bugs.
> 
> Let's take a step back.
> 
> While real usecases are helpful to understand a design decision, the design
> needs to be usecase independent.
> 
> The kernel provides mechanisms, not policies. My impression of this whole
> discussion is that it is policy driven. That's the wrong approach.

not sure what you mean by 'policy driven'.
Proposed CAP_BPF is a policy?

My desire to do kernel.unprivileged_bpf_disabled=1 is driven by
text in Documentation/x86/mds.rst which says:
"There is one exception, which is untrusted BPF. The functionality of
untrusted BPF is limited, but it needs to be thoroughly investigated
whether it can be used to create such a construct."

commit 6a9e52927251 ("x86/speculation/mds: Add mds_clear_cpu_buffers()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>

The way I read this text:
- there is a concern that mds is exploitable via bpf
- there is a desire to investigate to address this concern

I'm committed to help with the investigation.

In the mean time I propose a path to do
kernel.unprivileged_bpf_disabled=1 which is CAP_BPF.

Can kernel.unprivileged_bpf_disabled=1 be used now?
Yes, but it will weaken overall system security because things that
use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
to move to stronger CAP_SYS_ADMIN.

With CAP_BPF both load and attach would happen under CAP_BPF
instead of CAP_SYS_ADMIN.

> So let's look at the mechanisms which we have at hand:
> 
>  1) Capabilities
>  
>  2) SUID and dropping priviledges
> 
>  3) Seccomp and LSM
> 
> Now the real interesting questions are:
> 
>  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
>     there a more finegrained control of BPF functionality?
> 
>     TBH, I can't tell.
> 
>  B) Depending on the answer to #A what is the control possibility for
>     #1/#2/#3 ?

Can any of the mechanisms 1/2/3 address the concern in mds.rst?

I believe Andy wants to expand the attack surface when
kernel.unprivileged_bpf_disabled=0
Before that happens I'd like the community to work on addressing the text above.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-16 22:22                                                                   ` Christian Brauner
@ 2019-08-17 15:08                                                                     ` Alexei Starovoitov
  2019-08-17 15:16                                                                       ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-17 15:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
> 
> (The one usecase I'd care about is to extend seccomp to do pointer-based
> syscall filtering. Whether or not that'd require (unprivileged) ebpf is
> up for discussion at KSummit.)

Kees have been always against using ebpf in seccomp. I believe he still
holds this opinion. Until he changes his mind let's stop bringing seccomp
as a use case for unpriv bpf.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-17 15:08                                                                     ` Alexei Starovoitov
@ 2019-08-17 15:16                                                                       ` Christian Brauner
  2019-08-17 15:36                                                                         ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2019-08-17 15:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
>> 
>> (The one usecase I'd care about is to extend seccomp to do
>pointer-based
>> syscall filtering. Whether or not that'd require (unprivileged) ebpf
>is
>> up for discussion at KSummit.)
>
>Kees have been always against using ebpf in seccomp. I believe he still
>holds this opinion. Until he changes his mind let's stop bringing
>seccomp
>as a use case for unpriv bpf.

That's why I said "whether or not".
For the record, I do prefer a non-unpriv-ebpf way.
It's still something that will most surely come up in the discussion though.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-17 15:16                                                                       ` Christian Brauner
@ 2019-08-17 15:36                                                                         ` Alexei Starovoitov
  2019-08-17 15:42                                                                           ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-17 15:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Sat, Aug 17, 2019 at 05:16:53PM +0200, Christian Brauner wrote:
> On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
> >> 
> >> (The one usecase I'd care about is to extend seccomp to do
> >pointer-based
> >> syscall filtering. Whether or not that'd require (unprivileged) ebpf
> >is
> >> up for discussion at KSummit.)
> >
> >Kees have been always against using ebpf in seccomp. I believe he still
> >holds this opinion. Until he changes his mind let's stop bringing
> >seccomp
> >as a use case for unpriv bpf.
> 
> That's why I said "whether or not".
> For the record, I do prefer a non-unpriv-ebpf way.
> It's still something that will most surely come up in the discussion though.

It's very un-kernely way to defer to in-person meetings.
If there is anything to discuss please discuss it on the public mailing list.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-17 15:36                                                                         ` Alexei Starovoitov
@ 2019-08-17 15:42                                                                           ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2019-08-17 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Kees Cook, Andy Lutomirski, Song Liu,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On August 17, 2019 5:36:54 PM GMT+02:00, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>On Sat, Aug 17, 2019 at 05:16:53PM +0200, Christian Brauner wrote:
>> On August 17, 2019 5:08:45 PM GMT+02:00, Alexei Starovoitov
><alexei.starovoitov@gmail.com> wrote:
>> >On Sat, Aug 17, 2019 at 12:22:53AM +0200, Christian Brauner wrote:
>> >> 
>> >> (The one usecase I'd care about is to extend seccomp to do
>> >pointer-based
>> >> syscall filtering. Whether or not that'd require (unprivileged)
>ebpf
>> >is
>> >> up for discussion at KSummit.)
>> >
>> >Kees have been always against using ebpf in seccomp. I believe he
>still
>> >holds this opinion. Until he changes his mind let's stop bringing
>> >seccomp
>> >as a use case for unpriv bpf.
>> 
>> That's why I said "whether or not".
>> For the record, I do prefer a non-unpriv-ebpf way.
>> It's still something that will most surely come up in the discussion
>though.
>
>It's very un-kernely way to defer to in-person meetings.
>If there is anything to discuss please discuss it on the public mailing
>list.

https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2019-July/006699.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-17 15:02                                                                                     ` Alexei Starovoitov
@ 2019-08-17 15:44                                                                                       ` Andy Lutomirski
  2019-08-19  9:15                                                                                       ` Thomas Gleixner
  1 sibling, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-17 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jordan Glover, Andy Lutomirski,
	Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List



> On Aug 17, 2019, at 8:02 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> 
> Can any of the mechanisms 1/2/3 address the concern in mds.rst?
> 

seccomp() can. It’s straightforward to use seccomp to disable bpf() outright for a process tree.  In this regard, bpf() isn’t particularly unique — it’s a system call that exposes some attack surface and that isn’t required by most programs for basic functionality.

At LPC this year, there will be a discussion about seccomp improvements that will, among other things, offer fiber-grained control. It’s quite likely, for example, that seccomp will soon be able to enable and disable specific map types or attach types.  The exact mechanism isn’t decided yet,  but I think everyone expects that this is mostly a design problem, not an implementation problem.

This is off topic for the current thread, but it could be useful to allow bpf programs to be loaded from files directly (i.e. pass an fd to a file into bpf() to load the program), which would enable LSMs to check that the file is appropriately labeled. This would dramatically raise the bar for exploitation of verifier bugs or speculation attacks, since anyone trying to exploit it would need to get the bpf payload through LSM policy first.

> I believe Andy wants to expand the attack surface when
> kernel.unprivileged_bpf_disabled=0
> Before that happens I'd like the community to work on addressing the text above.
> 

Not by much. BPF maps are already largely exposed to unprivileged code (when unprivileged_bpf_disabled=0).  The attack surface is there, and they’re arguably even more exposed than they should be.  My patch 1 earlier was about locking these interfaces down.

Similarly, my suggestions about reworking cgroup attach and program load don’t actually allow fully unprivileged users to run arbitrary bpf() programs [0] — under my proposal, to attach a bpf cgroup program, you need a delegated cgroup. The mechanism could be extended by a requirement that a privileged cgroup manager explicitly enable certain attach types for a delegated subtree.

A cgroup knob to turn unprivileged bpf on and off for tasks in the cgroup might actually be quite useful.

[0] on some thought, the test run mechanism should probably remain root-only.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-17 15:02                                                                                     ` Alexei Starovoitov
  2019-08-17 15:44                                                                                       ` Andy Lutomirski
@ 2019-08-19  9:15                                                                                       ` Thomas Gleixner
  2019-08-19 17:27                                                                                         ` Alexei Starovoitov
  1 sibling, 1 reply; 61+ messages in thread
From: Thomas Gleixner @ 2019-08-19  9:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

Alexei,

On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
> On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> > On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > While real usecases are helpful to understand a design decision, the design
> > needs to be usecase independent.
> > 
> > The kernel provides mechanisms, not policies. My impression of this whole
> > discussion is that it is policy driven. That's the wrong approach.
> 
> not sure what you mean by 'policy driven'.
> Proposed CAP_BPF is a policy?

I was referring to the discussion as a whole.
 
> Can kernel.unprivileged_bpf_disabled=1 be used now?
> Yes, but it will weaken overall system security because things that
> use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
> to move to stronger CAP_SYS_ADMIN.
> 
> With CAP_BPF both load and attach would happen under CAP_BPF
> instead of CAP_SYS_ADMIN.

I'm not arguing against that.

> > So let's look at the mechanisms which we have at hand:
> > 
> >  1) Capabilities
> >  
> >  2) SUID and dropping priviledges
> > 
> >  3) Seccomp and LSM
> > 
> > Now the real interesting questions are:
> > 
> >  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
> >     there a more finegrained control of BPF functionality?
> > 
> >     TBH, I can't tell.
> > 
> >  B) Depending on the answer to #A what is the control possibility for
> >     #1/#2/#3 ?
> 
> Can any of the mechanisms 1/2/3 address the concern in mds.rst?

Well, that depends. As with any other security policy which is implemented
via these mechanisms, the policy can be strict enough to prevent it by not
allowing certain operations. The more fine-grained the control is, it
allows the administrator who implements the policy to remove the
'dangerous' parts from an untrusted user.

So really question #A is important for this. Is BPF just providing a binary
ON/OFF knob or does it allow to disable/enable certain aspects of BPF
functionality in a more fine grained way? If the latter, then it might be
possible to control functionality which might be abused for exploits of
some sorts (including MDS) in a way which allows other parts of BBF to be
exposed to less priviledged contexts.

> I believe Andy wants to expand the attack surface when
> kernel.unprivileged_bpf_disabled=0
> Before that happens I'd like the community to work on addressing the text above.

Well, that text above can be removed when the BPF wizards are entirely sure
that BPF cannot be abused to exploit stuff. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-19  9:15                                                                                       ` Thomas Gleixner
@ 2019-08-19 17:27                                                                                         ` Alexei Starovoitov
  2019-08-19 17:38                                                                                           ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-19 17:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List

On Mon, Aug 19, 2019 at 11:15:11AM +0200, Thomas Gleixner wrote:
> Alexei,
> 
> On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
> > On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
> > > On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
> > > While real usecases are helpful to understand a design decision, the design
> > > needs to be usecase independent.
> > > 
> > > The kernel provides mechanisms, not policies. My impression of this whole
> > > discussion is that it is policy driven. That's the wrong approach.
> > 
> > not sure what you mean by 'policy driven'.
> > Proposed CAP_BPF is a policy?
> 
> I was referring to the discussion as a whole.
>  
> > Can kernel.unprivileged_bpf_disabled=1 be used now?
> > Yes, but it will weaken overall system security because things that
> > use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
> > to move to stronger CAP_SYS_ADMIN.
> > 
> > With CAP_BPF both load and attach would happen under CAP_BPF
> > instead of CAP_SYS_ADMIN.
> 
> I'm not arguing against that.
> 
> > > So let's look at the mechanisms which we have at hand:
> > > 
> > >  1) Capabilities
> > >  
> > >  2) SUID and dropping priviledges
> > > 
> > >  3) Seccomp and LSM
> > > 
> > > Now the real interesting questions are:
> > > 
> > >  A) What kind of restrictions does BPF allow? Is it a binary on/off or is
> > >     there a more finegrained control of BPF functionality?
> > > 
> > >     TBH, I can't tell.
> > > 
> > >  B) Depending on the answer to #A what is the control possibility for
> > >     #1/#2/#3 ?
> > 
> > Can any of the mechanisms 1/2/3 address the concern in mds.rst?
> 
> Well, that depends. As with any other security policy which is implemented
> via these mechanisms, the policy can be strict enough to prevent it by not
> allowing certain operations. The more fine-grained the control is, it
> allows the administrator who implements the policy to remove the
> 'dangerous' parts from an untrusted user.
> 
> So really question #A is important for this. Is BPF just providing a binary
> ON/OFF knob or does it allow to disable/enable certain aspects of BPF
> functionality in a more fine grained way? If the latter, then it might be
> possible to control functionality which might be abused for exploits of
> some sorts (including MDS) in a way which allows other parts of BBF to be
> exposed to less priviledged contexts.

I see. So the kernel.unprivileged_bpf_disabled knob is binary and I think it's
the right mechanism to expose to users.
Having N knobs for every map/prog type won't decrease attack surface.
In the other email Andy's quoting seccomp man page...
Today seccomp cannot really look into bpf_attr syscall args, but even
if it could it won't secure the system.
Examples:
1.
spectre v2 is using bpf in-kernel interpreter in speculative way.
The mere presence of interpreter as part of kernel .text makes the exploit
easier to do. That was the reason to do CONFIG_BPF_JIT_ALWAYS_ON.
For this case even kernel.unprivileged_bpf_disabled=1 was hopeless.

2.
var4 doing store hazard. It doesn't matter which program type is used.
load/store instructions are the same across program types.

3.
prog_array was used as part of var1. I guess it was simply more
convenient for Jann to do it this way :) All other map types
have the same out-of-bounds speculation issue.

In general side channels are cpu bugs that are exploited via sequences
of cpu instructions. In that sense bpf infra provides these instructions.
So all program types and all maps have the same level of 'side channel risk'.

> > I believe Andy wants to expand the attack surface when
> > kernel.unprivileged_bpf_disabled=0
> > Before that happens I'd like the community to work on addressing the text above.
> 
> Well, that text above can be removed when the BPF wizards are entirely sure
> that BPF cannot be abused to exploit stuff. 

Myself and Daniel looked at it in detail. I think we understood
MDS mechanism well enough. Right now we're fairly confident that
combination of existing mechanisms we did for var4 and
verifier speculative analysis protect us from MDS.
The thing is that every new cpu bug is looked at through the bpf lenses.
Can it be exploited through bpf? Complexity of side channels
is growing. Can the most recent swapgs be exploited ?
What if we kprobe+bpf somewhere ?
I don't think there is an issue, but we will never be 'entirely sure'.
Even if myself and Daniel are sure the concern will stay.
Unprivileged bpf as a whole is the concern due to side channels.
The number of them are not yet disclosed. Who is going to analyze them?
imo the only answer to that is kernel.unprivileged_bpf_disabled=1
which together with CONFIG_BPF_JIT_ALWAYS_ON is secure enough.
The other option is to sprinkle every bpf load/store with lfence
which will make execution so slow that it will be unusable.
Which is effectively the same as unprivileged_bpf_disabled=1.

There are other things we can do. Like kasan-style shadow memory
for bpf execution. Auto re-JITing the code after it's running.
We can do lfences everywhere for some time then re-JIT
when kasan-ed shadow memory shows only clean memory accesses.
The beauty of BPF that it is analyze-able and JIT-able instruction set.
The verifier speculative analysis is an example that the kernel
can analyze the speculative execution path that cpu will
take before the code starts executing.
Unprivileged bpf can made absolutely secure. It can be
made more secure than the rest of the kernel.
But today we should just go with unprivileged_bpf_disabled=1
and CAP_BPF.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-19 17:27                                                                                         ` Alexei Starovoitov
@ 2019-08-19 17:38                                                                                           ` Andy Lutomirski
  0 siblings, 0 replies; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-19 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jordan Glover, Andy Lutomirski,
	Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List



> On Aug 19, 2019, at 10:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Mon, Aug 19, 2019 at 11:15:11AM +0200, Thomas Gleixner wrote:
>> Alexei,
>> 
>>> On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
>>>> On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
>>>> On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
>>>> While real usecases are helpful to understand a design decision, the design
>>>> needs to be usecase independent.
>>>> 
>>>> The kernel provides mechanisms, not policies. My impression of this whole
>>>> discussion is that it is policy driven. That's the wrong approach.
>>> 
>>> not sure what you mean by 'policy driven'.
>>> Proposed CAP_BPF is a policy?
>> 
>> I was referring to the discussion as a whole.
>> 
>>> Can kernel.unprivileged_bpf_disabled=1 be used now?
>>> Yes, but it will weaken overall system security because things that
>>> use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
>>> to move to stronger CAP_SYS_ADMIN.
>>> 
>>> With CAP_BPF both load and attach would happen under CAP_BPF
>>> instead of CAP_SYS_ADMIN.
>> 
>> I'm not arguing against that.
>> 
>>>> So let's look at the mechanisms which we have at hand:
>>>> 
>>>> 1) Capabilities
>>>> 
>>>> 2) SUID and dropping priviledges
>>>> 
>>>> 3) Seccomp and LSM
>>>> 
>>>> Now the real interesting questions are:
>>>> 
>>>> A) What kind of restrictions does BPF allow? Is it a binary on/off or is
>>>>    there a more finegrained control of BPF functionality?
>>>> 
>>>>    TBH, I can't tell.
>>>> 
>>>> B) Depending on the answer to #A what is the control possibility for
>>>>    #1/#2/#3 ?
>>> 
>>> Can any of the mechanisms 1/2/3 address the concern in mds.rst?
>> 
>> Well, that depends. As with any other security policy which is implemented
>> via these mechanisms, the policy can be strict enough to prevent it by not
>> allowing certain operations. The more fine-grained the control is, it
>> allows the administrator who implements the policy to remove the
>> 'dangerous' parts from an untrusted user.
>> 
>> So really question #A is important for this. Is BPF just providing a binary
>> ON/OFF knob or does it allow to disable/enable certain aspects of BPF
>> functionality in a more fine grained way? If the latter, then it might be
>> possible to control functionality which might be abused for exploits of
>> some sorts (including MDS) in a way which allows other parts of BBF to be
>> exposed to less priviledged contexts.
> 
> I see. So the kernel.unprivileged_bpf_disabled knob is binary and I think it's
> the right mechanism to expose to users.
> Having N knobs for every map/prog type won't decrease attack surface.
> In the other email Andy's quoting seccomp man page...
> Today seccomp cannot really look into bpf_attr syscall args, but even
> if it could it won't secure the system.
> Examples:
> 1.
> spectre v2 is using bpf in-kernel interpreter in speculative way.
> The mere presence of interpreter as part of kernel .text makes the exploit
> easier to do. That was the reason to do CONFIG_BPF_JIT_ALWAYS_ON.
> For this case even kernel.unprivileged_bpf_disabled=1 was hopeless.
> 
> 2.
> var4 doing store hazard. It doesn't matter which program type is used.
> load/store instructions are the same across program types.
> 
> 3.
> prog_array was used as part of var1. I guess it was simply more
> convenient for Jann to do it this way :) All other map types
> have the same out-of-bounds speculation issue.
> 
> In general side channels are cpu bugs that are exploited via sequences
> of cpu instructions. In that sense bpf infra provides these instructions.
> So all program types and all maps have the same level of 'side channel risk'.
> 
>>> I believe Andy wants to expand the attack surface when
>>> kernel.unprivileged_bpf_disabled=0
>>> Before that happens I'd like the community to work on addressing the text above.
>> 
>> Well, that text above can be removed when the BPF wizards are entirely sure
>> that BPF cannot be abused to exploit stuff. 
> 
> Myself and Daniel looked at it in detail. I think we understood
> MDS mechanism well enough. Right now we're fairly confident that
> combination of existing mechanisms we did for var4 and
> verifier speculative analysis protect us from MDS.
> The thing is that every new cpu bug is looked at through the bpf lenses.
> Can it be exploited through bpf? Complexity of side channels
> is growing. Can the most recent swapgs be exploited ?
> What if we kprobe+bpf somewhere ?
> I don't think there is an issue, but we will never be 'entirely sure'.
> Even if myself and Daniel are sure the concern will stay.
> Unprivileged bpf as a whole is the concern due to side channels.
> The number of them are not yet disclosed. Who is going to analyze them?
> imo the only answer to that is kernel.unprivileged_bpf_disabled=1
> which together with CONFIG_BPF_JIT_ALWAYS_ON is secure enough.
> The other option is to sprinkle every bpf load/store with lfence
> which will make execution so slow that it will be unusable.
> Which is effectively the same as unprivileged_bpf_disabled=1.
> 
> There are other things we can do. Like kasan-style shadow memory
> for bpf execution. Auto re-JITing the code after it's running.
> We can do lfences everywhere for some time then re-JIT
> when kasan-ed shadow memory shows only clean memory accesses.
> The beauty of BPF that it is analyze-able and JIT-able instruction set.
> The verifier speculative analysis is an example that the kernel
> can analyze the speculative execution path that cpu will
> take before the code starts executing.
> Unprivileged bpf can made absolutely secure. It can be
> made more secure than the rest of the kernel.
> But today we should just go with unprivileged_bpf_disabled=1

I’m still okay with this.

> and CAP_BPF.
> 

I think this needs more design work.  I’m halfway through writing up an actual proposal. I’ll send it soon.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-07  5:24                                                       ` Andy Lutomirski
  2019-08-07  9:03                                                         ` Lorenz Bauer
  2019-08-13 21:58                                                         ` Alexei Starovoitov
@ 2019-08-22 14:17                                                         ` Daniel Borkmann
  2019-08-22 15:16                                                           ` Andy Lutomirski
  2019-08-22 22:48                                                           ` [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf Alexei Starovoitov
  2 siblings, 2 replies; 61+ messages in thread
From: Daniel Borkmann @ 2019-08-22 14:17 UTC (permalink / raw)
  To: Andy Lutomirski, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List, Chenbo Feng

On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
>>> It tries to make the kernel respect the access modes for fds.  Without
>>> this patch, there seem to be some holes: nothing looked at program fds
>>> and, unless I missed something, you could take a readonly fd for a
>>> program, pin the program, and reopen it RW.
>>
>> I think it's by design. iirc Daniel had a use case for something like this.
> 
> That seems odd.  Daniel, can you elaborate?

[ ... catching up late. ]

Not from my side, the change was added by Chenbo back then for Android
use-case to replace xt_qtaguid and xt_owner with BPF programs and to
allow unprivileged applications to read maps. More on their architecture:

   https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor

 From the cover-letter:

   [...]
   The network-control daemon (netd) creates and loads an eBPF object for
   network packet filtering and analysis. It passes the object FD to an
   unprivileged network monitor app (netmonitor), which is not allowed to
   create, modify or load eBPF objects, but is allowed to read the traffic
   stats from the map.
   [...]

Iuuc, netd would be in charge with the ability to r/w into maps and
pin them, but with the ability to to hand off some map fds as r/o to
unprivileged applications in order for them to query data.

>> Hence unprivileged bpf is actually something that can be deprecated.

There is actually a publicly known use-case on unprivileged bpf wrt
socket filters, see the SO_ATTACH_BPF on sockets section as an example:

   https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/

If I'd have to take a good guess, I'd think it's major use-case is in
SO_ATTACH_REUSEPORT_EBPF in the wild, I don't think the sysctl can be
outright flipped or deprecated w/o breaking existing applications unless
it's cleanly modeled into some sort of customizable CAP_BPF* type policy
(more below) where this would be the lowest common denominator.

> I hope not.  There are a couple setsockopt uses right now, and and
> seccomp will surely want it someday.  And the bpf-inside-container use
> case really is unprivileged bpf -- containers are, in many (most?)
> cases, explicitly not trusted by the host.
[...]
>> Inside containers and inside nested containers we need to start processes
>> that will use bpf. All of the processes are trusted.
> 
> Trusted by whom?  In a non-nested container, the container manager
> *might* be trusted by the outside world.  In a *nested* container,
> unless the inner container management is controlled from outside the
> outer container, it's not trusted.  I don't know much about how
> Facebook's containers work, but the LXC/LXD/Podman world is moving
> very strongly toward user namespaces and maximally-untrusted
> containers, and I think bpf() should work in that context.

[...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
ideally be possible for that container to install BPF programs for
mangling, dropping, forwarding etc as long as it's only affecting it's
/own/ netns like the rest of networking subsystem controls that work
in that case. I would actually like to get to this at some point and
make it more approachable as long as there is a way for an admin to
/opt into it/ via policy (aka not by default). Thinking out loud, I'd
love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
customizable seccomp policy. Meaning, there could be several CAP_BPF
type sub-policies e.g. from allowing everything (equivalent to the
/dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
programmable user defined policy that can be tailored to specific
needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
their own netns, and if that is untrusted, then same restrictions/
mitigations could be done by the verifier as with (current) unprivileged
BPF, enabled via programmable policy as well. We wouldn't make any
static/fixed assumptions, but allow users to define them based on their
own use-cases. Haven't looked how feasible this would be, but something
to take into consideration when we rework the current [admittedly
suboptimal all-or-nothing] model we have. Is this something you had in
mind as well for your wip proposal, Andy?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-22 14:17                                                         ` Daniel Borkmann
@ 2019-08-22 15:16                                                           ` Andy Lutomirski
  2019-08-22 15:17                                                             ` RFC: very rough draft of a bpf permission model Andy Lutomirski
  2019-08-22 22:48                                                           ` [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf Alexei Starovoitov
  1 sibling, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-22 15:16 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng

On Thu, Aug 22, 2019 at 7:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> > On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> >>> It tries to make the kernel respect the access modes for fds.  Without
> >>> this patch, there seem to be some holes: nothing looked at program fds
> >>> and, unless I missed something, you could take a readonly fd for a
> >>> program, pin the program, and reopen it RW.
> >>
> >> I think it's by design. iirc Daniel had a use case for something like this.
> >
> > That seems odd.  Daniel, can you elaborate?
>
> [ ... catching up late. ]
>
> Not from my side, the change was added by Chenbo back then for Android
> use-case to replace xt_qtaguid and xt_owner with BPF programs and to
> allow unprivileged applications to read maps. More on their architecture:
>
>    https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor
>
>  From the cover-letter:
>
>    [...]
>    The network-control daemon (netd) creates and loads an eBPF object for
>    network packet filtering and analysis. It passes the object FD to an
>    unprivileged network monitor app (netmonitor), which is not allowed to
>    create, modify or load eBPF objects, but is allowed to read the traffic
>    stats from the map.
>    [...]

I suspect that this use case is, in fact, mostly broken in current
kernels.  An unprivileged process with a read-only fd to a bpf map can
BPF_OBJ_PIN the map and then re-open it read-write.  As far as I can
tell, the only thing mitigating this is that it won't work unless the
attacker has write access to some directory in bpffs.

> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> [...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
> ideally be possible for that container to install BPF programs for
> mangling, dropping, forwarding etc as long as it's only affecting it's
> /own/ netns like the rest of networking subsystem controls that work
> in that case. I would actually like to get to this at some point and
> make it more approachable as long as there is a way for an admin to
> /opt into it/ via policy (aka not by default).

For better or for worse, I think this would need a massive
re-architecting of the way bpf filtering works.  bpf filters attach to
cgroups, which aren't scoped to network namespaces at all.  So we need
a different permission model.

> Thinking out loud, I'd
> love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
> customizable seccomp policy. Meaning, there could be several CAP_BPF
> type sub-policies e.g. from allowing everything (equivalent to the
> /dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
> programmable user defined policy that can be tailored to specific
> needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
> or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
> HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
> their own netns, and if that is untrusted, then same restrictions/
> mitigations could be done by the verifier as with (current) unprivileged
> BPF, enabled via programmable policy as well. We wouldn't make any
> static/fixed assumptions, but allow users to define them based on their
> own use-cases. Haven't looked how feasible this would be, but something
> to take into consideration when we rework the current [admittedly
> suboptimal all-or-nothing] model we have. Is this something you had in
> mind as well for your wip proposal, Andy?
>

Hmm.  Fine-grained seccomp stuff like this is very much in scope for
the seccomp discussion that's happening at LPC this year.
Unfortunately, I'm not there, but I'm participating via the mailing
list.

I also finally finished typing a very rough draft of my bpf ideas.
I'll email them out momentarily in a separate email.  I think it
should come fairly close to doing what you want.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RFC: very rough draft of a bpf permission model
  2019-08-22 15:16                                                           ` Andy Lutomirski
@ 2019-08-22 15:17                                                             ` Andy Lutomirski
  2019-08-22 23:26                                                               ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-22 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng

BPF security strawman, v0.1

This is very rough.  Most of this, especially the API details, needs
work before it's ready to implement.  The whole concept also needs
review.

= Goals =

The overall goal is to make it possible to use eBPF without having
what is effectively administrator access.  For example, an eBPF user
should not be able to directly tamper with other processes (unless
this permission is explicitly granted) and should not be able to
read or write other users' eBPF maps.

It should be possible to use eBPF inside a user namespace without breaking
the userns security model.

Due to the risk of speculation attacks and such being carried out via
eBPF, it should not become possible to use too much of eBPF without the
administrator's permission.  (NB: it is already possible to use
*classic* BPF without any permission, and classic BPF is translated
internally to eBPF, so this goal can only be met to a limited extent.)

= Definitions =

Global capability: A capability bit in the caller's effective mask, so
long as the caller is in the root user namespace.  Tasks in non-root
user namespaces never have global capabilibies.  This is what capable()
checks.

Namespace capability: A capability over a specific user namespace.
Tasks in a user namespace have all the capabilities in their effective
mask over their user namespace.  A namespace capability generally
indicates that the capability applies to the user namespace itself and
to all non-user namespaces that live in the user namespace.  For
example, CAP_NET_ADMIN means that you can configure all networks
namespaces in the current user namespace.  This is what ns_capable()
checks.

Anything that requires a global capability will not work in a non-root
user namespace.

= unprivileged_bpf_disabled =

Nothing in here supercedes unprivileged_bpf_disabled.  If
unprivileged_bpf_disabled = 1, then these proposals should not allow anything
that is disallowed today.  The idea is to make unprivileged_bpf_disabled=0
both safer and more useful.

= Test runs =

Global CAP_SYS_ADMIN is needed to test-run a program.  Test-running a program
exposes its own attack surface.  It's also the only way to run a program at
all if you merely have permission to load the program but not to attach it
anywhere.  Some of the proposed changes below will make it possible to load
most program types without

= Access to programs and maps =

There are two basic security concerns when accessing programs and maps:
the attack surface against the kernel and the ability to access other
people's maps.

Unprivileged processes may read a map if they have an FMODE_READ descriptor
for the map.  Unprivileged processes may write a map if they have an
FMODE_WRITE descriptor to the map.  Unprivileged processes may open a
persistent map with a mode consistent with the permissions in bpffs.

Unprivileged processes may create a bpffs inode for an existing map
if the have an RW file descriptor for the map.  (This is a change to
current behavior.  Daniel, Alexei thought the current behavior was
intentional.  Do you recall whether this is the case?)

The _BY_ID map APIs inherently have no concept of ownership of maps.  These
APIs will continue to require global CAP_SYS_ADMIN.

The small number of things that currently require the _BY_ID APIs, e.g.,
reading maps of maps, can be addressed if needed with new APIs that
return fds instead of ids.  Otherwise using them will continue to require
global capabilities.

Unprivileged processes may create exactly the set of maps that they can
create today.  Future proposals may extend this by a variety of means;
this current proposal makes no changes.

= Program loading =

Loading a program carries the following risks:

 - It exposes the attack surface in the program verifier itself.  That is
   possible, although unlikely, that merely verifying a malicious program
   could crash or otherwise cause a kernel malfunction.

 - It exposes the attack surface of insufficient checks in the verifier.
   That is, a verifier bug could allow a malicious program that is dangerous
   when run.

 - It exposes all of the functions that the program type can call.
   Some functions, e.g. bpf_probe_read(), should require privilege to call.

 - It exposes resource attacks.  Currently, privileged users can load programs
   that use more resources than unprivileged users can load.

 - It exposes pointer-to-integer conversions.  This requires global
   capabilities.

 - The program could contain speculation attack gadgets.

 - Loading a program is a prerequisite to attaching the program.

I propose the following:

Flag functions that require privilege as such.  Loading a program that calls
such a function will require a global capability.  The privileged functions are
mainly used for tracing, I think, and kernel tracing should require global
capabilities.

Loading a program that uses privileged verifier features (function calls or
pointer-to-integer-conversions) will continue to require privileges.

Loading a function that uses excessive resources can continue to require
global capabilities or it could use a new set of cgroup settings that
adjust the bpf complexity limits.

Loading a function that bypasses the various speculation attack hardening
features (e.g. constant blinding) requires global capabilities.

Other than this, bpf program types can have a new flag set to allow
them to be loaded without any privileges.  Some bpf program types
may need additional care, e.g. perf bpf events.  They can be attached
without privilege even in current kernels, and this might need to change.

(optional) Add an API to load a program where the program source comes from a
file specified by id instead of in memory.  This would allow LSMs to require
that bpf() programs be appropriately labeled.  If LSMs require use of this
API, it will make it much harder to exploit the verifier or speculation bugs.

As a possible future extension, a way to selectively grant the ability to
use specific program types without privilege could be useful.  This
could be done with a cgroup option, for example.

= Cgroup attach =

Cgroups have their own hierarchy that does not necessarily follow the
namespace hierarchy.  Unless cgroups integrate with namespaces in ways
that they currently don't, namespace capabilites cannot be used to grant
permission to operate on cgroups.

I propose that attaching and detaching bpf programs to cgroups use a
permission model similar to the model for changing non-bpf cgroup
settings.  In particular, each bpf_attach_type will get a new file in a
cgroup directory.  So there will be
/sys/fs/cgroup/cgroup_name/bpf.inet_ingress, bpf.inet_egress, etc.

A new API will be added to bpf() to attach and detach programs.  The new
API will take an fd to the bpf.attach_type file instead of to the cgroup
directory.  It will require FMODE_WRITE.  This API will *not* require
any capability.

To prevent anyone with a delegated cgroup from automatically being
able to use all bpf program types, the new bpf.attach_type files
will be opt-in as part of the hierarchy.  This could be done by writing
"+bpf.*" or "+bpf.inet_ingress" to cgroup.subtree_control to make
all the bpf.attach_type files or just bpf.inet_ingress available
in descendents of the cgroup in question.  This could alternatively
be a new bpf.subtree_control file if that seems better.

The result of these changes will be that root can use the old
attach API or the new attach API.  Unprivileged programs cannot
use the old attach API.  Unprivileged programs can use the new
attach API if they are explicitly granted permission by all their
ancestor cgroup managers.


= Additional mitigations =

Optional: there may be cases where a user can load a bpf program
but can't attach or otherwise execute it.  Nonetheless, it's plausible
that such a program could be speculatively executed.  The kernel could
mitigate this by only marking a JITted bpf program executable when it
is first attached or test-run.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
  2019-08-22 14:17                                                         ` Daniel Borkmann
  2019-08-22 15:16                                                           ` Andy Lutomirski
@ 2019-08-22 22:48                                                           ` Alexei Starovoitov
  1 sibling, 0 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-22 22:48 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng

On Thu, Aug 22, 2019 at 04:17:43PM +0200, Daniel Borkmann wrote:
> 
> > > Hence unprivileged bpf is actually something that can be deprecated.
> 
> There is actually a publicly known use-case on unprivileged bpf wrt
> socket filters, see the SO_ATTACH_BPF on sockets section as an example:
> 
>   https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/
> 
> If I'd have to take a good guess, I'd think it's major use-case is in
> SO_ATTACH_REUSEPORT_EBPF in the wild, I don't think the sysctl can be
> outright flipped or deprecated w/o breaking existing applications unless
> it's cleanly modeled into some sort of customizable CAP_BPF* type policy
> (more below) where this would be the lowest common denominator.

The cloudflare use case is the perfect example that a lot of program types
are used together.
Do people use SO_ATTACH_BPF ? Of course.
All program types are used by somebody. Before accepting them we had long
conversations with authors to understand that the use cases are real.
Some progs are probably used less than others by now.
Like cls_bpf without exts_integrated is probably not used at all.
We still have to support it, of course.
That cloudflare example demonstrates that kernel.unprivileged_bpf_disabled=1
is a reality. Companies that care about security already switched it on.
Different bits in cloudflare setup need different level of capabilities.
Some (like SO_ATTACH_BPF) need unpriv. Another need CAP_NET_ADMIN.
But common demoninator for them all is still CAP_SYS_ADMIN.
And that's why the system as a whole is not as safe as it could have
been with CAP_BPF. The system needs root in many places.
Folks going out of the way to reduce that SYS_ADMIN to something less.
The example with systemd and NET_ADMIN is just one of them.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RFC: very rough draft of a bpf permission model
  2019-08-22 15:17                                                             ` RFC: very rough draft of a bpf permission model Andy Lutomirski
@ 2019-08-22 23:26                                                               ` Alexei Starovoitov
  2019-08-23 23:09                                                                 ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-22 23:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng

On Thu, Aug 22, 2019 at 08:17:54AM -0700, Andy Lutomirski wrote:
> BPF security strawman, v0.1
> 
> This is very rough.  Most of this, especially the API details, needs
> work before it's ready to implement.  The whole concept also needs
> review.
> 
> = Goals =
> 
> The overall goal is to make it possible to use eBPF without having
> what is effectively administrator access.  For example, an eBPF user
> should not be able to directly tamper with other processes (unless
> this permission is explicitly granted) and should not be able to
> read or write other users' eBPF maps.
> 
> It should be possible to use eBPF inside a user namespace without breaking
> the userns security model.
> 
> Due to the risk of speculation attacks and such being carried out via
> eBPF, it should not become possible to use too much of eBPF without the
> administrator's permission.  (NB: it is already possible to use
> *classic* BPF without any permission, and classic BPF is translated
> internally to eBPF, so this goal can only be met to a limited extent.)

agree with the goals.

> = Definitions =
> 
> Global capability: A capability bit in the caller's effective mask, so
> long as the caller is in the root user namespace.  Tasks in non-root
> user namespaces never have global capabilibies.  This is what capable()
> checks.
> 
> Namespace capability: A capability over a specific user namespace.
> Tasks in a user namespace have all the capabilities in their effective
> mask over their user namespace.  A namespace capability generally
> indicates that the capability applies to the user namespace itself and
> to all non-user namespaces that live in the user namespace.  For
> example, CAP_NET_ADMIN means that you can configure all networks
> namespaces in the current user namespace.  This is what ns_capable()
> checks.

definitions make sense too.

> Anything that requires a global capability will not work in a non-root
> user namespace.
> 
> = unprivileged_bpf_disabled =
> 
> Nothing in here supercedes unprivileged_bpf_disabled.  If
> unprivileged_bpf_disabled = 1, then these proposals should not allow anything
> that is disallowed today.  The idea is to make unprivileged_bpf_disabled=0
> both safer and more useful.

... a bunch of new features skipped for brevity...

You're proposing all of the above in addition to CAP_BPF, right?
Otherwise I don't see how it addresses the use cases I kept
explaining for the last few weeks.

I don't mind additional features if people who propose them
actively help to maintain that new code and address inevitable
side channel issues in the new code.
But first things first.

Here is another example of use case that CAP_BPF is solving:
The daemon X is started by pid=1 and currently runs as root.
It loads a bunch of tracing progs and attaches them to kprobes
and tracepoints. It also loads cgroup-bpf progs and attaches them
to cgroups. All progs are collecting data about the system and
logging it for further analysis.
There can be different bugs (not security bugs) in the daemon.
Simple coding bugs, but due to processing running as root they
may make the system inoperable. There is a strong desire to
drop privileges for this daemon. Let it do all BPF things the
way it does today and drop root, since other operations do not
require root.
Essentially a bunch of daemons run as root only because
they need bpf. This tracing bpf is looking into kernel memory
and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
The system is not going to crash because of BPF,
but it can easily crash because of simple coding bugs in the user
space bits of that daemon.

Flagging functions is not going to help this case.
bpf_probe_read is necessary.
pointer-to-integer-conversions is also necessary.
bypass hardening features is also necessary for speed,
since this data collection is 24/7.
cgroup.subtree_control idea can help some of it, but not all.

I still think that CAP_BPF is the best way to split this root privilege
universe into smaller 'bpf piece'. Just like CAP_NET_ADMIN splits
all of root into networking specific privileges.

Potentially we can go sysctl_perf_event_paranoid approach, but
it's less flexible, since it's single sysctl for the whole system.

Loading progs via FD instead of memory is something that android folks
proposed some time ago. The need is real. Whether it's going to be
loading via FD or some other form of signing the program is TBD.
imo this is orthogonal.

I hope I answered all points of your proposal.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RFC: very rough draft of a bpf permission model
  2019-08-22 23:26                                                               ` Alexei Starovoitov
@ 2019-08-23 23:09                                                                 ` Andy Lutomirski
  2019-08-26 22:36                                                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-23 23:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng

On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> You're proposing all of the above in addition to CAP_BPF, right?
> Otherwise I don't see how it addresses the use cases I kept
> explaining for the last few weeks.

None of my proposal is intended to exclude changes like CAP_BPF to
make privileged bpf() operations need less privilege.  But I think
it's very hard to evaluate CAP_BPF without both a full description of
exactly what CAP_BPF would do and what at least one full example of a
user would look like.

I also think that users who want CAP_BPF should look at manipulating
their effective capability set instead.  A daemon that wants to use
bpf() but otherwise minimize the chance of accidentally causing a
problem can use capset() to clear its effective and inheritable masks.
Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
effective set again.  This works in current kernels and is generally
good practice.

Aside from this, and depending on exactly what CAP_BPF would be, I
have some further concerns.  Looking at your example in this email:

> Here is another example of use case that CAP_BPF is solving:
> The daemon X is started by pid=1 and currently runs as root.
> It loads a bunch of tracing progs and attaches them to kprobes
> and tracepoints. It also loads cgroup-bpf progs and attaches them
> to cgroups. All progs are collecting data about the system and
> logging it for further analysis.

This needs more than just bpf().  Creating a perf kprobe event
requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
attach a bpf program.  And the privilege to attach bpf programs to
cgroups without any DAC or MAC checks (which is what the current API
does) is an extremely broad privilege that is not that much weaker
than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:

> This tracing bpf is looking into kernel memory
> and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
> The system is not going to crash because of BPF,
> but it can easily crash because of simple coding bugs in the user
> space bits of that daemon.

The BPF verifier and interpreter, taken in isolation, may be extremely
safe, but attaching BPF programs to various hooks can easily take down
the system, deliberately or by accident.  A handler, especially if it
can access user memory or otherwise fault, will explode if attached to
an inappropriate kprobe, hw_breakpoint, or function entry trace event.
(I and the other maintainers consider this to be a bug if it happens,
and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
that blocks all network traffic will effectively kill a machine,
especially if it's a server.  A bpf program that runs excessively
slowly attached to a high-frequency hook will kill the system, too.
(I bet a buggy bpf program that calls bpf_probe_read() on an unmapped
address repeatedly could be make extremely slow.  Page faults take
thousands to tens of thousands of cycles.)  A bpf firewall rule that's
wrong can cut a machine off from the network -- I've killed machines
using iptables more than once, and bpf isn't magically safer.

Something finer-grained can mitigate some of this.  CAP_BPF as I think
you're imagining it will not.

I'm wondering if something like CAP_TRACING would make sense.
CAP_TRACING would allow operations that can reveal kernel memory and
other secret kernel state but that do not, by design, allow modifying
system behavior.  So, for example, CAP_TRACING would allow privileged
perf_event_open() operations and privileged bpf verifier usage.  But
it would not allow cgroup-bpf unless further restrictions were added,
and it would not allow the *_BY_ID operations, as those can modify
other users' bpf programs' behavior.

(To get CAP_TRACING to work with cgroup-bpf, there could be a flag to
attach a "tracing" bpf program to a cgroup.  This program would run in
addition to normal or MULTI programs, but it would not be allowed to
return a rejection result.)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RFC: very rough draft of a bpf permission model
  2019-08-23 23:09                                                                 ` Andy Lutomirski
@ 2019-08-26 22:36                                                                   ` Alexei Starovoitov
  2019-08-27  0:05                                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-26 22:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng

On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > You're proposing all of the above in addition to CAP_BPF, right?
> > Otherwise I don't see how it addresses the use cases I kept
> > explaining for the last few weeks.
> 
> None of my proposal is intended to exclude changes like CAP_BPF to
> make privileged bpf() operations need less privilege.  But I think
> it's very hard to evaluate CAP_BPF without both a full description of
> exactly what CAP_BPF would do and what at least one full example of a
> user would look like.

the example is previous email and systemd example was not "full" ?

> I also think that users who want CAP_BPF should look at manipulating
> their effective capability set instead.  A daemon that wants to use
> bpf() but otherwise minimize the chance of accidentally causing a
> problem can use capset() to clear its effective and inheritable masks.
> Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
> or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
> effective set again.  This works in current kernels and is generally
> good practice.

Such logic means that CAP_NET_ADMIN is not necessary either.
The process could re-add CAP_SYS_ADMIN when it needs to reconfigure
network and then drop it.

> Aside from this, and depending on exactly what CAP_BPF would be, I
> have some further concerns.  Looking at your example in this email:
> 
> > Here is another example of use case that CAP_BPF is solving:
> > The daemon X is started by pid=1 and currently runs as root.
> > It loads a bunch of tracing progs and attaches them to kprobes
> > and tracepoints. It also loads cgroup-bpf progs and attaches them
> > to cgroups. All progs are collecting data about the system and
> > logging it for further analysis.
> 
> This needs more than just bpf().  Creating a perf kprobe event
> requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
> attach a bpf program.  

that is already solved sysctl_perf_event_paranoid.
CAP_BPF is about BPF part only.

> And the privilege to attach bpf programs to
> cgroups without any DAC or MAC checks (which is what the current API
> does) is an extremely broad privilege that is not that much weaker
> than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:

I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN
vs CAP_BPF.
CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN.
Just like all other caps.

> > This tracing bpf is looking into kernel memory
> > and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
> > The system is not going to crash because of BPF,
> > but it can easily crash because of simple coding bugs in the user
> > space bits of that daemon.
> 
> The BPF verifier and interpreter, taken in isolation, may be extremely
> safe, but attaching BPF programs to various hooks can easily take down
> the system, deliberately or by accident.  A handler, especially if it
> can access user memory or otherwise fault, will explode if attached to
> an inappropriate kprobe, hw_breakpoint, or function entry trace event.

absolutely not true.

> (I and the other maintainers consider this to be a bug if it happens,
> and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
> that blocks all network traffic will effectively kill a machine,
> especially if it's a server. 

this permission is granted by CAP_NET_ADMIN. Nothing changes here.

> A bpf program that runs excessively
> slowly attached to a high-frequency hook will kill the system, too.

not true either.

> (I bet a buggy bpf program that calls bpf_probe_read() on an unmapped
> address repeatedly could be make extremely slow.  Page faults take
> thousands to tens of thousands of cycles.) 

kprobe probing and faulting on non-existent address will do
the same 'damage'. So it's not bpf related.
Also it won't make the system "extremely slow".
Nothing to do with CAP_BPF.

> A bpf firewall rule that's
> wrong can cut a machine off from the network -- I've killed machines
> using iptables more than once, and bpf isn't magically safer.

this is CAP_NET_ADMIN permission. It's a different capability.

> 
> I'm wondering if something like CAP_TRACING would make sense.
> CAP_TRACING would allow operations that can reveal kernel memory and
> other secret kernel state but that do not, by design, allow modifying
> system behavior.  So, for example, CAP_TRACING would allow privileged
> perf_event_open() operations and privileged bpf verifier usage.  But
> it would not allow cgroup-bpf unless further restrictions were added,
> and it would not allow the *_BY_ID operations, as those can modify
> other users' bpf programs' behavior.

Makes little sense to me.
I can imagine CAP_TRACING controlling kprobe/uprobe creation
and probe_read() both from bpf side and from vanilla kprobe.
That would be much nicer interface to use than existing
sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF
which is strictly about BPF.

> Something finer-grained can mitigate some of this.  CAP_BPF as I think
> you're imagining it will not.

I'm afraid this discussion goes nowhere.
We'll post CAP_BPF patches soon so we can discuss code.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RFC: very rough draft of a bpf permission model
  2019-08-26 22:36                                                                   ` Alexei Starovoitov
@ 2019-08-27  0:05                                                                     ` Andy Lutomirski
  2019-08-27  0:34                                                                       ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Andy Lutomirski @ 2019-08-27  0:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng

> On Aug 26, 2019, at 3:36 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> You're proposing all of the above in addition to CAP_BPF, right?
>>> Otherwise I don't see how it addresses the use cases I kept
>>> explaining for the last few weeks.
>>
>> None of my proposal is intended to exclude changes like CAP_BPF to
>> make privileged bpf() operations need less privilege.  But I think
>> it's very hard to evaluate CAP_BPF without both a full description of
>> exactly what CAP_BPF would do and what at least one full example of a
>> user would look like.
>
> the example is previous email and systemd example was not "full" ?

Can you give an example of how a real user would want to configure
their system such that a non-root systemd instance has capabilities,
sets up a BPF firewall, and does something useful with it?  You
mentioned systemd, multiple people pointed out that, on a normal
system, systemd —user has no capabilities. That was the end of the
discussion.

A full example is one where peoples’ confusion as to what the example
is gets answered.

>
>> I also think that users who want CAP_BPF should look at manipulating
>> their effective capability set instead.  A daemon that wants to use
>> bpf() but otherwise minimize the chance of accidentally causing a
>> problem can use capset() to clear its effective and inheritable masks.
>> Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN
>> or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its
>> effective set again.  This works in current kernels and is generally
>> good practice.
>
> Such logic means that CAP_NET_ADMIN is not necessary either.
> The process could re-add CAP_SYS_ADMIN when it needs to reconfigure
> network and then drop it.

This isn't really true. By giving a process CAP_NET_ADMIN and not
CAP_SYS_ADMIN, that process can configure the network but can’t load
kernel modules or reconfigure the machine deliberately or by accident.

But that's besides the point.  Can you give an example where this
approach doesn't help and CAP_BPF does?


>
>> Aside from this, and depending on exactly what CAP_BPF would be, I
>> have some further concerns.  Looking at your example in this email:
>>
>>> Here is another example of use case that CAP_BPF is solving:
>>> The daemon X is started by pid=1 and currently runs as root.
>>> It loads a bunch of tracing progs and attaches them to kprobes
>>> and tracepoints. It also loads cgroup-bpf progs and attaches them
>>> to cgroups. All progs are collecting data about the system and
>>> logging it for further analysis.
>>
>> This needs more than just bpf().  Creating a perf kprobe event
>> requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't
>> attach a bpf program.
>
> that is already solved sysctl_perf_event_paranoid.
> CAP_BPF is about BPF part only.

Hence my point: I'd like to see a real example where CAP_BPF helps.
perf_event_paranoid does not appear to grant the ability to add
kprobes.  With perf_event_paranoid set to -1:

$ perf probe --add vfs_mknod
Failed to open kprobe_events: Permission denied
  Error: Failed to add events.
$ sudo perf probe --add vfs_mknod
Added new event:
  probe:vfs_mknod      (on vfs_mknod)

I suppose I could modify permissions on debugfs and set
perf_event_paranoid=-1, but at that point the overall security of the
system is so weak that talking about refining the bpf part seems
pointless.

>
>> And the privilege to attach bpf programs to
>> cgroups without any DAC or MAC checks (which is what the current API
>> does) is an extremely broad privilege that is not that much weaker
>> than CAP_SYS_ADMIN or CAP_NET_ADMIN.  Also:
>
> I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN
> vs CAP_BPF.
> CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN.
> Just like all other caps.

The whole set of capabilities on Linux us a bit of a mess.  Their
features are mostly disjoint but, on a normal Linux machine, many of
the capabilities can be used to become root with full capabilities.

>
>>> This tracing bpf is looking into kernel memory
>>> and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_.
>>> The system is not going to crash because of BPF,
>>> but it can easily crash because of simple coding bugs in the user
>>> space bits of that daemon.
>>
>> The BPF verifier and interpreter, taken in isolation, may be extremely
>> safe, but attaching BPF programs to various hooks can easily take down
>> the system, deliberately or by accident.  A handler, especially if it
>> can access user memory or otherwise fault, will explode if attached to
>> an inappropriate kprobe, hw_breakpoint, or function entry trace event.
>
> absolutely not true.

This is not a constructive way to have a conversation.  When you get
an email that contains a statement you disagree with, perhaps you
could try to give some argument as to why you disagree rather than
just saying "absolutely not true".  Especially when you are talking to
one of the maintainers of the affected system who has a
not-yet-finished branch that addresses some of the bugs that you claim
absolutely don't exist.  If it's really truly necessary, I can go and
write an example that will crash an x86 kernel, but I feel like it
would be a waste of everyone's time.

Right now, on all kernels, an hw_breakpoint on memory used in a
non-recursion-safe part of any of the x86 IST entry handlers will
corrupt the kernel no later than when the hw_breakpoint handler
returns.  It does not matter in the slightest what the BPF payload is.
The payload doesn't even have to be BPF for this to blow up.

Similarly, until very, very recently, a handler that pagefaulted (due
to generating a stack trace or failing a bpf_probe_read() in the
trace_hardirqs_on path would crash the system due to corrupting cr2 in
the x86 entry code.  PeterZ just fixed this bug recently.  I believe
that there are similar bugs relating to DR6, but they probably don't
kill the system as easily.  I wouldn't rule out a full system crash,
though.  Again, a not-really-done fix for this is part-way done in my
tree.

How confident are you that a BPF program that calls bpf_probe_read()
the maximum allowable number of times on the address
0xffffffffffffffff attached to, say, an network interrupt probe will
actually leave the system in a usable state?  Maybe it will, but I'd
be a bit surprised.

How confident are you that the BPF program that calls bpf_probe_read()
on an MMIO address has well-defined semantics?  How confident are you
that the system will still work if such a program runs?

>
>> (I and the other maintainers consider this to be a bug if it happens,
>> and we'll fix it, but these bugs definitely exist.)  A cgroup-bpf hook
>> that blocks all network traffic will effectively kill a machine,
>> especially if it's a server.
>
> this permission is granted by CAP_NET_ADMIN. Nothing changes here.
>
>> A bpf program that runs excessively
>> slowly attached to a high-frequency hook will kill the system, too.
>
> not true either.

What prevents this from happening?  Is there a specific mitigation in place?

My point here is that the bpf is 'safe' in isolation, but that bpf
tracing is only somewhat 'safe'.

>> A bpf firewall rule that's
>> wrong can cut a machine off from the network -- I've killed machines
>> using iptables more than once, and bpf isn't magically safer.
>
> this is CAP_NET_ADMIN permission. It's a different capability.

Since you haven't fully defined what CAP_BPF would do, I can only
assume that you intend for CAP_BPF to enable installation of a BPF
inet_ingress hook on the root cgroup.  A BPF program that rejects
everything will block all traffic.

>
>>
>> I'm wondering if something like CAP_TRACING would make sense.
>> CAP_TRACING would allow operations that can reveal kernel memory and
>> other secret kernel state but that do not, by design, allow modifying
>> system behavior.  So, for example, CAP_TRACING would allow privileged
>> perf_event_open() operations and privileged bpf verifier usage.  But
>> it would not allow cgroup-bpf unless further restrictions were added,
>> and it would not allow the *_BY_ID operations, as those can modify
>> other users' bpf programs' behavior.
>
> Makes little sense to me.
> I can imagine CAP_TRACING controlling kprobe/uprobe creation
> and probe_read() both from bpf side and from vanilla kprobe.
> That would be much nicer interface to use than existing
> sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF
> which is strictly about BPF.

I'm suggesting that CAP_TRACING would also enable most of all of the
things in the verifier that are currently CAP_SYS_ADMIN and would
enable loading and attaching BPF programs to perf events.  So it's not
orthogonal.

You're welcome to post CAP_BPF patches, but perhaps you could also
comment on CAP_TRACING and capset?

--Andy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RFC: very rough draft of a bpf permission model
  2019-08-27  0:05                                                                     ` Andy Lutomirski
@ 2019-08-27  0:34                                                                       ` Alexei Starovoitov
  0 siblings, 0 replies; 61+ messages in thread
From: Alexei Starovoitov @ 2019-08-27  0:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn,
	Greg KH, Linux API, LSM List, Chenbo Feng

On Mon, Aug 26, 2019 at 05:05:58PM -0700, Andy Lutomirski wrote:
> >>
> >> The BPF verifier and interpreter, taken in isolation, may be extremely
> >> safe, but attaching BPF programs to various hooks can easily take down
> >> the system, deliberately or by accident.  A handler, especially if it
> >> can access user memory or otherwise fault, will explode if attached to
> >> an inappropriate kprobe, hw_breakpoint, or function entry trace event.
> >
> > absolutely not true.
> 
> This is not a constructive way to have a conversation.  When you get
> an email that contains a statement you disagree with, perhaps you
> could try to give some argument as to why you disagree rather than
> just saying "absolutely not true".  Especially when you are talking to
> one of the maintainers of the affected system who has a
> not-yet-finished branch that addresses some of the bugs that you claim
> absolutely don't exist.  If it's really truly necessary, I can go and
> write an example that will crash an x86 kernel, but I feel like it
> would be a waste of everyone's time.

Please do write an example and prove your earlier sensational statement
that "can _easily_ take down the system" by attaching bpf to kprobe.

Most of the functions where kprobes are not allowed are already
marked by 'nokprobe'. All of them marked? Probably not.
There could be places where kprobe will blow the system, but
1. it's not easy to do. unlike your claim
2. that issue has nothing to do with bpf safety guarantees.

> How confident are you that the BPF program that calls bpf_probe_read()
> on an MMIO address has well-defined semantics?  How confident are you
> that the system will still work if such a program runs?

bpf_probe_read is a wrapper of probe_read. Nothing more.
I'm confident that probe_read maintainers are doing good job.

All of the bpf tracing is relying on existing kernel mechanisms
like kprobe, uprobe, perf, probe_read, etc.
bpf verifier cannot make them safer.
If reading mmio via bpf_probe_read will trigger undesired
hw behavior there is nothing bpf verifier can do about it.


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, back to index

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190627201923.2589391-1-songliubraving@fb.com>
     [not found] ` <20190627201923.2589391-2-songliubraving@fb.com>
     [not found]   ` <21894f45-70d8-dfca-8c02-044f776c5e05@kernel.org>
     [not found]     ` <3C595328-3ABE-4421-9772-8D41094A4F57@fb.com>
     [not found]       ` <CALCETrWBnH4Q43POU8cQ7YMjb9LioK28FDEQf7aHZbdf1eBZWg@mail.gmail.com>
     [not found]         ` <0DE7F23E-9CD2-4F03-82B5-835506B59056@fb.com>
     [not found]           ` <CALCETrWBWbNFJvsTCeUchu3BZJ3SH3dvtXLUB2EhnPrzFfsLNA@mail.gmail.com>
     [not found]             ` <201907021115.DCD56BBABB@keescook>
     [not found]               ` <CALCETrXTta26CTtEDnzvtd03-WOGdXcnsAogP8JjLkcj4-mHvg@mail.gmail.com>
     [not found]                 ` <4A7A225A-6C23-4C0F-9A95-7C6C56B281ED@fb.com>
     [not found]                   ` <CALCETrX2bMnwC6_t4b_G-hzJSfMPrkK4YKs5ebcecv2LJ0rt3w@mail.gmail.com>
     [not found]                     ` <514D5453-0AEE-420F-AEB6-3F4F58C62E7E@fb.com>
     [not found]                       ` <1DE886F3-3982-45DE-B545-67AD6A4871AB@amacapital.net>
     [not found]                         ` <7F51F8B8-CF4C-4D82-AAE1-F0F28951DB7F@fb.com>
     [not found]                           ` <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>
     [not found]                             ` <369476A8-4CE1-43DA-9239-06437C0384C7@fb.com>
2019-07-30 20:24                               ` [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf Andy Lutomirski
2019-07-31  8:10                                 ` Song Liu
2019-07-31 19:09                                   ` Andy Lutomirski
2019-08-02  7:21                                     ` Song Liu
2019-08-04 22:16                                       ` Andy Lutomirski
2019-08-05  0:08                                         ` Andy Lutomirski
2019-08-05  5:47                                           ` Andy Lutomirski
2019-08-05  7:36                                             ` Song Liu
2019-08-05 17:23                                               ` Andy Lutomirski
2019-08-05 19:21                                                 ` Alexei Starovoitov
2019-08-05 21:25                                                   ` Andy Lutomirski
2019-08-05 22:21                                                     ` Andy Lutomirski
2019-08-06  1:11                                                     ` Alexei Starovoitov
2019-08-07  5:24                                                       ` Andy Lutomirski
2019-08-07  9:03                                                         ` Lorenz Bauer
2019-08-07 13:52                                                           ` Andy Lutomirski
2019-08-13 21:58                                                         ` Alexei Starovoitov
2019-08-13 22:26                                                           ` Daniel Colascione
2019-08-13 23:24                                                             ` Andy Lutomirski
2019-08-13 23:06                                                           ` Andy Lutomirski
2019-08-14  0:57                                                             ` Alexei Starovoitov
2019-08-14 17:51                                                               ` Andy Lutomirski
2019-08-14 22:05                                                                 ` Alexei Starovoitov
2019-08-14 22:30                                                                   ` Andy Lutomirski
2019-08-14 23:33                                                                     ` Alexei Starovoitov
2019-08-14 23:59                                                                       ` Andy Lutomirski
2019-08-15  0:36                                                                         ` Alexei Starovoitov
2019-08-15 11:24                                                                   ` Jordan Glover
2019-08-15 17:28                                                                     ` Alexei Starovoitov
2019-08-15 18:36                                                                       ` Andy Lutomirski
2019-08-15 23:08                                                                         ` Alexei Starovoitov
2019-08-16  9:34                                                                           ` Jordan Glover
2019-08-16  9:59                                                                             ` Thomas Gleixner
2019-08-16 11:33                                                                               ` Jordan Glover
2019-08-16 19:52                                                                                 ` Alexei Starovoitov
2019-08-16 20:28                                                                                   ` Thomas Gleixner
2019-08-17 15:02                                                                                     ` Alexei Starovoitov
2019-08-17 15:44                                                                                       ` Andy Lutomirski
2019-08-19  9:15                                                                                       ` Thomas Gleixner
2019-08-19 17:27                                                                                         ` Alexei Starovoitov
2019-08-19 17:38                                                                                           ` Andy Lutomirski
2019-08-15 18:43                                                                       ` Jordan Glover
2019-08-15 19:46                                                           ` Kees Cook
2019-08-15 23:46                                                             ` Alexei Starovoitov
2019-08-16  0:54                                                               ` Andy Lutomirski
2019-08-16  5:56                                                                 ` Song Liu
2019-08-16 21:45                                                                 ` Alexei Starovoitov
2019-08-16 22:22                                                                   ` Christian Brauner
2019-08-17 15:08                                                                     ` Alexei Starovoitov
2019-08-17 15:16                                                                       ` Christian Brauner
2019-08-17 15:36                                                                         ` Alexei Starovoitov
2019-08-17 15:42                                                                           ` Christian Brauner
2019-08-22 14:17                                                         ` Daniel Borkmann
2019-08-22 15:16                                                           ` Andy Lutomirski
2019-08-22 15:17                                                             ` RFC: very rough draft of a bpf permission model Andy Lutomirski
2019-08-22 23:26                                                               ` Alexei Starovoitov
2019-08-23 23:09                                                                 ` Andy Lutomirski
2019-08-26 22:36                                                                   ` Alexei Starovoitov
2019-08-27  0:05                                                                     ` Andy Lutomirski
2019-08-27  0:34                                                                       ` Alexei Starovoitov
2019-08-22 22:48                                                           ` [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf Alexei Starovoitov

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org linux-security-module@archiver.kernel.org
	public-inbox-index linux-security-module

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/ public-inbox