All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] eBPF Seccomp filters
@ 2018-02-13 15:42 Sargun Dhillon
  2018-02-13 15:47 ` Kees Cook
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 15:42 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, containers, keescook, luto, wad

This patchset enables seccomp filters to be written in eBPF. Although,
this patchset doesn't introduce much of the functionality enabled by
eBPF, it lays the ground work for it.

It also introduces the capability to dump eBPF filters via the PTRACE
API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
In the attached samples, there's an example of this. One can then use
BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
and use that at reload time.

The primary reason for not adding maps support in this patchset is
to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
If we have a map that the BPF program can read, it can potentially
"change" privileges after running. It seems like doing writes only
is safe, because it can be pure, and side effect free, and therefore
not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
to an agreement, this can be in a follow-up patchset.


Sargun Dhillon (3):
  bpf, seccomp: Add eBPF filter capabilities
  seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
    filters
  bpf: Add eBPF seccomp sample programs

 arch/Kconfig                 |   7 ++
 include/linux/bpf_types.h    |   3 +
 include/linux/seccomp.h      |  12 +++
 include/uapi/linux/bpf.h     |   2 +
 include/uapi/linux/ptrace.h  |   5 +-
 include/uapi/linux/seccomp.h |  15 ++--
 kernel/bpf/syscall.c         |   1 +
 kernel/ptrace.c              |   3 +
 kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
 samples/bpf/Makefile         |   9 +++
 samples/bpf/bpf_load.c       |   9 ++-
 samples/bpf/seccomp1_kern.c  |  17 ++++
 samples/bpf/seccomp1_user.c  |  34 ++++++++
 samples/bpf/seccomp2_kern.c  |  24 ++++++
 samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
 15 files changed, 362 insertions(+), 30 deletions(-)
 create mode 100644 samples/bpf/seccomp1_kern.c
 create mode 100644 samples/bpf/seccomp1_user.c
 create mode 100644 samples/bpf/seccomp2_kern.c
 create mode 100644 samples/bpf/seccomp2_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found] ` <20180213154244.GA3292-du9IEJ8oIxHXYT48pCVpJ3c7ZZ+wIVaZYkHkVr5ML8kVGlcevz2xqA@public.gmane.org>
@ 2018-02-13 15:47   ` Kees Cook
  2018-02-14  0:47   ` Mickaël Salaün
  1 sibling, 0 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 15:47 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski

On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> This patchset enables seccomp filters to be written in eBPF. Although,
> this patchset doesn't introduce much of the functionality enabled by
> eBPF, it lays the ground work for it.
>
> It also introduces the capability to dump eBPF filters via the PTRACE
> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> In the attached samples, there's an example of this. One can then use
> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> and use that at reload time.
>
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.

What's the reason for adding eBPF support? seccomp shouldn't need it,
and it only makes the code more complex. I'd rather stick with cBPF
until we have an overwhelmingly good reason to use eBPF as a "native"
seccomp filter language.

-Kees

>
>
> Sargun Dhillon (3):
>   bpf, seccomp: Add eBPF filter capabilities
>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>     filters
>   bpf: Add eBPF seccomp sample programs
>
>  arch/Kconfig                 |   7 ++
>  include/linux/bpf_types.h    |   3 +
>  include/linux/seccomp.h      |  12 +++
>  include/uapi/linux/bpf.h     |   2 +
>  include/uapi/linux/ptrace.h  |   5 +-
>  include/uapi/linux/seccomp.h |  15 ++--
>  kernel/bpf/syscall.c         |   1 +
>  kernel/ptrace.c              |   3 +
>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>  samples/bpf/Makefile         |   9 +++
>  samples/bpf/bpf_load.c       |   9 ++-
>  samples/bpf/seccomp1_kern.c  |  17 ++++
>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>  15 files changed, 362 insertions(+), 30 deletions(-)
>  create mode 100644 samples/bpf/seccomp1_kern.c
>  create mode 100644 samples/bpf/seccomp1_user.c
>  create mode 100644 samples/bpf/seccomp2_kern.c
>  create mode 100644 samples/bpf/seccomp2_user.c
>
> --
> 2.14.1
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 15:42 [PATCH net-next 0/3] eBPF Seccomp filters Sargun Dhillon
@ 2018-02-13 15:47 ` Kees Cook
  2018-02-13 16:29   ` Sargun Dhillon
                     ` (2 more replies)
  2018-02-14  0:47 ` Mickaël Salaün
       [not found] ` <20180213154244.GA3292-du9IEJ8oIxHXYT48pCVpJ3c7ZZ+wIVaZYkHkVr5ML8kVGlcevz2xqA@public.gmane.org>
  2 siblings, 3 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 15:47 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Network Development, Alexei Starovoitov, Daniel Borkmann,
	Linux Containers, Andy Lutomirski, Will Drewry

On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> This patchset enables seccomp filters to be written in eBPF. Although,
> this patchset doesn't introduce much of the functionality enabled by
> eBPF, it lays the ground work for it.
>
> It also introduces the capability to dump eBPF filters via the PTRACE
> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> In the attached samples, there's an example of this. One can then use
> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> and use that at reload time.
>
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.

What's the reason for adding eBPF support? seccomp shouldn't need it,
and it only makes the code more complex. I'd rather stick with cBPF
until we have an overwhelmingly good reason to use eBPF as a "native"
seccomp filter language.

-Kees

>
>
> Sargun Dhillon (3):
>   bpf, seccomp: Add eBPF filter capabilities
>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>     filters
>   bpf: Add eBPF seccomp sample programs
>
>  arch/Kconfig                 |   7 ++
>  include/linux/bpf_types.h    |   3 +
>  include/linux/seccomp.h      |  12 +++
>  include/uapi/linux/bpf.h     |   2 +
>  include/uapi/linux/ptrace.h  |   5 +-
>  include/uapi/linux/seccomp.h |  15 ++--
>  kernel/bpf/syscall.c         |   1 +
>  kernel/ptrace.c              |   3 +
>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>  samples/bpf/Makefile         |   9 +++
>  samples/bpf/bpf_load.c       |   9 ++-
>  samples/bpf/seccomp1_kern.c  |  17 ++++
>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>  15 files changed, 362 insertions(+), 30 deletions(-)
>  create mode 100644 samples/bpf/seccomp1_kern.c
>  create mode 100644 samples/bpf/seccomp1_user.c
>  create mode 100644 samples/bpf/seccomp2_kern.c
>  create mode 100644 samples/bpf/seccomp2_user.c
>
> --
> 2.14.1
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]   ` <CAGXu5jLiYh0rSRuJ_-2xLB03Wod5G07njpoESR4SnmsmiUnsEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 16:29     ` Sargun Dhillon
  2018-02-14 17:25     ` Andy Lutomirski
  1 sibling, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 16:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski

On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> What's the reason for adding eBPF support? seccomp shouldn't need it,
> and it only makes the code more complex. I'd rather stick with  -- cBPF
> until we have an overwhelmingly good reason to use eBPF as a "native"
> seccomp filter language.
>
> -Kees
>
Three reasons:
1) The userspace tooling for eBPF is much better than the user space
tooling for cBPF. Our use case is specifically to optimize Docker
policies. This is roughly what their seccomp policy looks like:
https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
It would be much nicer to be able to leverage eBPF to write this in C,
or any other the other languages targetting eBPF. In addition, if we
have write-only maps, we can exfiltrate information from seccomp, like
arguments, and errors in a relatively cheap way compared to cBPF, and
then extract this via the bcc stack. Writing cBPF via C macros is a
pain, and the off the shelf cBPF libraries are getting no love. The
eBPF community is *exploding* with contributions.

2) In my testing, which thus so far has been very rudimentary, with
rewriting the policy that libseccomp generates from the Docker policy
to use eBPF, and eBPF maps performs much better than cBPF. The
specific case tested was to use a bpf array to lookup rules for a
particular syscall. In a super trivial test, this was about 5% low
latency than using traditional branches. If you need more evidence of
this, I can work a little bit more on the maps related patches, and
see if I can get some more benchmarking. From my understanding, we
would need to add "sealing" support for maps, in which they can be
marked as read-only, and only at that point should an eBPF seccomp
program be able to read from them.

3) Eventually, I'd like to use some more advanced capabilities of
eBPF, like being able to rewrite arguments safely (not things referred
to by pointers, but just plain old arguments).

>>
>>
>> Sargun Dhillon (3):
>>   bpf, seccomp: Add eBPF filter capabilities
>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>     filters
>>   bpf: Add eBPF seccomp sample programs
>>
>>  arch/Kconfig                 |   7 ++
>>  include/linux/bpf_types.h    |   3 +
>>  include/linux/seccomp.h      |  12 +++
>>  include/uapi/linux/bpf.h     |   2 +
>>  include/uapi/linux/ptrace.h  |   5 +-
>>  include/uapi/linux/seccomp.h |  15 ++--
>>  kernel/bpf/syscall.c         |   1 +
>>  kernel/ptrace.c              |   3 +
>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>  samples/bpf/Makefile         |   9 +++
>>  samples/bpf/bpf_load.c       |   9 ++-
>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>  create mode 100644 samples/bpf/seccomp1_user.c
>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>  create mode 100644 samples/bpf/seccomp2_user.c
>>
>> --
>> 2.14.1
>>
>
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 15:47 ` Kees Cook
@ 2018-02-13 16:29   ` Sargun Dhillon
       [not found]     ` <CAMp4zn8VNurTjmrUtHnaK21A4hUQQz5tnarj15vmTU+TjY79XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 17:02     ` Jessie Frazelle
  2018-02-14 17:25   ` Andy Lutomirski
       [not found]   ` <CAGXu5jLiYh0rSRuJ_-2xLB03Wod5G07njpoESR4SnmsmiUnsEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 16:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Network Development, Alexei Starovoitov, Daniel Borkmann,
	Linux Containers, Andy Lutomirski, Will Drewry

On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> What's the reason for adding eBPF support? seccomp shouldn't need it,
> and it only makes the code more complex. I'd rather stick with  -- cBPF
> until we have an overwhelmingly good reason to use eBPF as a "native"
> seccomp filter language.
>
> -Kees
>
Three reasons:
1) The userspace tooling for eBPF is much better than the user space
tooling for cBPF. Our use case is specifically to optimize Docker
policies. This is roughly what their seccomp policy looks like:
https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
It would be much nicer to be able to leverage eBPF to write this in C,
or any other the other languages targetting eBPF. In addition, if we
have write-only maps, we can exfiltrate information from seccomp, like
arguments, and errors in a relatively cheap way compared to cBPF, and
then extract this via the bcc stack. Writing cBPF via C macros is a
pain, and the off the shelf cBPF libraries are getting no love. The
eBPF community is *exploding* with contributions.

2) In my testing, which thus so far has been very rudimentary, with
rewriting the policy that libseccomp generates from the Docker policy
to use eBPF, and eBPF maps performs much better than cBPF. The
specific case tested was to use a bpf array to lookup rules for a
particular syscall. In a super trivial test, this was about 5% low
latency than using traditional branches. If you need more evidence of
this, I can work a little bit more on the maps related patches, and
see if I can get some more benchmarking. From my understanding, we
would need to add "sealing" support for maps, in which they can be
marked as read-only, and only at that point should an eBPF seccomp
program be able to read from them.

3) Eventually, I'd like to use some more advanced capabilities of
eBPF, like being able to rewrite arguments safely (not things referred
to by pointers, but just plain old arguments).

>>
>>
>> Sargun Dhillon (3):
>>   bpf, seccomp: Add eBPF filter capabilities
>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>     filters
>>   bpf: Add eBPF seccomp sample programs
>>
>>  arch/Kconfig                 |   7 ++
>>  include/linux/bpf_types.h    |   3 +
>>  include/linux/seccomp.h      |  12 +++
>>  include/uapi/linux/bpf.h     |   2 +
>>  include/uapi/linux/ptrace.h  |   5 +-
>>  include/uapi/linux/seccomp.h |  15 ++--
>>  kernel/bpf/syscall.c         |   1 +
>>  kernel/ptrace.c              |   3 +
>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>  samples/bpf/Makefile         |   9 +++
>>  samples/bpf/bpf_load.c       |   9 ++-
>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>  create mode 100644 samples/bpf/seccomp1_user.c
>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>  create mode 100644 samples/bpf/seccomp2_user.c
>>
>> --
>> 2.14.1
>>
>
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]     ` <CAMp4zn8VNurTjmrUtHnaK21A4hUQQz5tnarj15vmTU+TjY79XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 17:02       ` Jessie Frazelle
  0 siblings, 0 replies; 34+ messages in thread
From: Jessie Frazelle @ 2018-02-13 17:02 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Will Drewry, Kees Cook, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski

On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>> this patchset doesn't introduce much of the functionality enabled by
>>> eBPF, it lays the ground work for it.
>>>
>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>> In the attached samples, there's an example of this. One can then use
>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>> and use that at reload time.
>>>
>>> The primary reason for not adding maps support in this patchset is
>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>> If we have a map that the BPF program can read, it can potentially
>>> "change" privileges after running. It seems like doing writes only
>>> is safe, because it can be pure, and side effect free, and therefore
>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>> to an agreement, this can be in a follow-up patchset.
>>
>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>> until we have an overwhelmingly good reason to use eBPF as a "native"
>> seccomp filter language.
>>
>> -Kees
>>
> Three reasons:
> 1) The userspace tooling for eBPF is much better than the user space
> tooling for cBPF. Our use case is specifically to optimize Docker
> policies. This is roughly what their seccomp policy looks like:
> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
> It would be much nicer to be able to leverage eBPF to write this in C,
> or any other the other languages targetting eBPF. In addition, if we
> have write-only maps, we can exfiltrate information from seccomp, like
> arguments, and errors in a relatively cheap way compared to cBPF, and
> then extract this via the bcc stack. Writing cBPF via C macros is a
> pain, and the off the shelf cBPF libraries are getting no love. The
> eBPF community is *exploding* with contributions.

Is stage two of this getting runc to support eBPF and docker to change
the default to be written as eBPF, because I foresee that being a
problem mainly with the kernel versions people use. The point of that
patch was to help the most people and as your point in (2) is made
about performance, that is a trade-off I would be willing to make in
order to have this functionality on more kernel versions.

The other alternative would be to have docker translate to use eBPF if
the kernel supported it, but that amount of complexity seems a bit
unnecessary for a feature that was trying to also be "simple".

Or do you plan on wrapping filters onto processes tangentially from
the runtime, in which case, that should be totally fine :)

Anyways this is kinda a tangent from the main point of getting it in
the kernel, just I would hate to see someone having to maintain this
without there being a path to getting it upstream elsewhere.

>
> 2) In my testing, which thus so far has been very rudimentary, with
> rewriting the policy that libseccomp generates from the Docker policy
> to use eBPF, and eBPF maps performs much better than cBPF. The
> specific case tested was to use a bpf array to lookup rules for a
> particular syscall. In a super trivial test, this was about 5% low
> latency than using traditional branches. If you need more evidence of
> this, I can work a little bit more on the maps related patches, and
> see if I can get some more benchmarking. From my understanding, we
> would need to add "sealing" support for maps, in which they can be
> marked as read-only, and only at that point should an eBPF seccomp
> program be able to read from them.
>
> 3) Eventually, I'd like to use some more advanced capabilities of
> eBPF, like being able to rewrite arguments safely (not things referred
> to by pointers, but just plain old arguments).
>
>>>
>>>
>>> Sargun Dhillon (3):
>>>   bpf, seccomp: Add eBPF filter capabilities
>>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>>     filters
>>>   bpf: Add eBPF seccomp sample programs
>>>
>>>  arch/Kconfig                 |   7 ++
>>>  include/linux/bpf_types.h    |   3 +
>>>  include/linux/seccomp.h      |  12 +++
>>>  include/uapi/linux/bpf.h     |   2 +
>>>  include/uapi/linux/ptrace.h  |   5 +-
>>>  include/uapi/linux/seccomp.h |  15 ++--
>>>  kernel/bpf/syscall.c         |   1 +
>>>  kernel/ptrace.c              |   3 +
>>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>>  samples/bpf/Makefile         |   9 +++
>>>  samples/bpf/bpf_load.c       |   9 ++-
>>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>>  create mode 100644 samples/bpf/seccomp1_user.c
>>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>>  create mode 100644 samples/bpf/seccomp2_user.c
>>>
>>> --
>>> 2.14.1
>>>
>>
>>
>>
>> --
>> Kees Cook
>> Pixel Security



-- 


Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC  511E 18F3 685C 0022 BFF3
pgp.mit.edu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 16:29   ` Sargun Dhillon
       [not found]     ` <CAMp4zn8VNurTjmrUtHnaK21A4hUQQz5tnarj15vmTU+TjY79XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 17:02     ` Jessie Frazelle
       [not found]       ` <CAEk6tEw3ty0kBH+06TYt4=Ywt-4_cHBa9f8p3ajMghtjRkHmMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 17:31       ` Sargun Dhillon
  1 sibling, 2 replies; 34+ messages in thread
From: Jessie Frazelle @ 2018-02-13 17:02 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Kees Cook, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Linux Containers, Andy Lutomirski, Will Drewry

On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>> this patchset doesn't introduce much of the functionality enabled by
>>> eBPF, it lays the ground work for it.
>>>
>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>> In the attached samples, there's an example of this. One can then use
>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>> and use that at reload time.
>>>
>>> The primary reason for not adding maps support in this patchset is
>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>> If we have a map that the BPF program can read, it can potentially
>>> "change" privileges after running. It seems like doing writes only
>>> is safe, because it can be pure, and side effect free, and therefore
>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>> to an agreement, this can be in a follow-up patchset.
>>
>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>> until we have an overwhelmingly good reason to use eBPF as a "native"
>> seccomp filter language.
>>
>> -Kees
>>
> Three reasons:
> 1) The userspace tooling for eBPF is much better than the user space
> tooling for cBPF. Our use case is specifically to optimize Docker
> policies. This is roughly what their seccomp policy looks like:
> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
> It would be much nicer to be able to leverage eBPF to write this in C,
> or any other the other languages targetting eBPF. In addition, if we
> have write-only maps, we can exfiltrate information from seccomp, like
> arguments, and errors in a relatively cheap way compared to cBPF, and
> then extract this via the bcc stack. Writing cBPF via C macros is a
> pain, and the off the shelf cBPF libraries are getting no love. The
> eBPF community is *exploding* with contributions.

Is stage two of this getting runc to support eBPF and docker to change
the default to be written as eBPF, because I foresee that being a
problem mainly with the kernel versions people use. The point of that
patch was to help the most people and as your point in (2) is made
about performance, that is a trade-off I would be willing to make in
order to have this functionality on more kernel versions.

The other alternative would be to have docker translate to use eBPF if
the kernel supported it, but that amount of complexity seems a bit
unnecessary for a feature that was trying to also be "simple".

Or do you plan on wrapping filters onto processes tangentially from
the runtime, in which case, that should be totally fine :)

Anyways this is kinda a tangent from the main point of getting it in
the kernel, just I would hate to see someone having to maintain this
without there being a path to getting it upstream elsewhere.

>
> 2) In my testing, which thus so far has been very rudimentary, with
> rewriting the policy that libseccomp generates from the Docker policy
> to use eBPF, and eBPF maps performs much better than cBPF. The
> specific case tested was to use a bpf array to lookup rules for a
> particular syscall. In a super trivial test, this was about 5% low
> latency than using traditional branches. If you need more evidence of
> this, I can work a little bit more on the maps related patches, and
> see if I can get some more benchmarking. From my understanding, we
> would need to add "sealing" support for maps, in which they can be
> marked as read-only, and only at that point should an eBPF seccomp
> program be able to read from them.
>
> 3) Eventually, I'd like to use some more advanced capabilities of
> eBPF, like being able to rewrite arguments safely (not things referred
> to by pointers, but just plain old arguments).
>
>>>
>>>
>>> Sargun Dhillon (3):
>>>   bpf, seccomp: Add eBPF filter capabilities
>>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>>     filters
>>>   bpf: Add eBPF seccomp sample programs
>>>
>>>  arch/Kconfig                 |   7 ++
>>>  include/linux/bpf_types.h    |   3 +
>>>  include/linux/seccomp.h      |  12 +++
>>>  include/uapi/linux/bpf.h     |   2 +
>>>  include/uapi/linux/ptrace.h  |   5 +-
>>>  include/uapi/linux/seccomp.h |  15 ++--
>>>  kernel/bpf/syscall.c         |   1 +
>>>  kernel/ptrace.c              |   3 +
>>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>>  samples/bpf/Makefile         |   9 +++
>>>  samples/bpf/bpf_load.c       |   9 ++-
>>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>>  create mode 100644 samples/bpf/seccomp1_user.c
>>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>>  create mode 100644 samples/bpf/seccomp2_user.c
>>>
>>> --
>>> 2.14.1
>>>
>>
>>
>>
>> --
>> Kees Cook
>> Pixel Security



-- 


Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC  511E 18F3 685C 0022 BFF3
pgp.mit.edu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]       ` <CAEk6tEw3ty0kBH+06TYt4=Ywt-4_cHBa9f8p3ajMghtjRkHmMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 17:07         ` Brian Goff
  2018-02-13 17:31         ` Sargun Dhillon
  1 sibling, 0 replies; 34+ messages in thread
From: Brian Goff @ 2018-02-13 17:07 UTC (permalink / raw)
  To: Jessie Frazelle
  Cc: Will Drewry, Kees Cook, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski,
	Sargun Dhillon

Agreed. I like the idea, but we'll have to maintain backwards compat at the
docker/runc level... but doesn't mean it shouldn't be added.
It may just take a long time to add support.


On Tue, Feb 13, 2018 at 12:02 PM, Jessie Frazelle <me-XvZkT8t+Da5Wk0Htik3J/w@public.gmane.org> wrote:

> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> > On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> wrote:
> >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org>
> wrote:
> >>> This patchset enables seccomp filters to be written in eBPF. Although,
> >>> this patchset doesn't introduce much of the functionality enabled by
> >>> eBPF, it lays the ground work for it.
> >>>
> >>> It also introduces the capability to dump eBPF filters via the PTRACE
> >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> >>> In the attached samples, there's an example of this. One can then use
> >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> >>> and use that at reload time.
> >>>
> >>> The primary reason for not adding maps support in this patchset is
> >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> >>> If we have a map that the BPF program can read, it can potentially
> >>> "change" privileges after running. It seems like doing writes only
> >>> is safe, because it can be pure, and side effect free, and therefore
> >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> >>> to an agreement, this can be in a follow-up patchset.
> >>
> >> What's the reason for adding eBPF support? seccomp shouldn't need it,
> >> and it only makes the code more complex. I'd rather stick with  -- cBPF
> >> until we have an overwhelmingly good reason to use eBPF as a "native"
> >> seccomp filter language.
> >>
> >> -Kees
> >>
> > Three reasons:
> > 1) The userspace tooling for eBPF is much better than the user space
> > tooling for cBPF. Our use case is specifically to optimize Docker
> > policies. This is roughly what their seccomp policy looks like:
> > https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
> > It would be much nicer to be able to leverage eBPF to write this in C,
> > or any other the other languages targetting eBPF. In addition, if we
> > have write-only maps, we can exfiltrate information from seccomp, like
> > arguments, and errors in a relatively cheap way compared to cBPF, and
> > then extract this via the bcc stack. Writing cBPF via C macros is a
> > pain, and the off the shelf cBPF libraries are getting no love. The
> > eBPF community is *exploding* with contributions.
>
> Is stage two of this getting runc to support eBPF and docker to change
> the default to be written as eBPF, because I foresee that being a
> problem mainly with the kernel versions people use. The point of that
> patch was to help the most people and as your point in (2) is made
> about performance, that is a trade-off I would be willing to make in
> order to have this functionality on more kernel versions.
>
> The other alternative would be to have docker translate to use eBPF if
> the kernel supported it, but that amount of complexity seems a bit
> unnecessary for a feature that was trying to also be "simple".
>
> Or do you plan on wrapping filters onto processes tangentially from
> the runtime, in which case, that should be totally fine :)
>
> Anyways this is kinda a tangent from the main point of getting it in
> the kernel, just I would hate to see someone having to maintain this
> without there being a path to getting it upstream elsewhere.
>
> >
> > 2) In my testing, which thus so far has been very rudimentary, with
> > rewriting the policy that libseccomp generates from the Docker policy
> > to use eBPF, and eBPF maps performs much better than cBPF. The
> > specific case tested was to use a bpf array to lookup rules for a
> > particular syscall. In a super trivial test, this was about 5% low
> > latency than using traditional branches. If you need more evidence of
> > this, I can work a little bit more on the maps related patches, and
> > see if I can get some more benchmarking. From my understanding, we
> > would need to add "sealing" support for maps, in which they can be
> > marked as read-only, and only at that point should an eBPF seccomp
> > program be able to read from them.
> >
> > 3) Eventually, I'd like to use some more advanced capabilities of
> > eBPF, like being able to rewrite arguments safely (not things referred
> > to by pointers, but just plain old arguments).
> >
> >>>
> >>>
> >>> Sargun Dhillon (3):
> >>>   bpf, seccomp: Add eBPF filter capabilities
> >>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
> >>>     filters
> >>>   bpf: Add eBPF seccomp sample programs
> >>>
> >>>  arch/Kconfig                 |   7 ++
> >>>  include/linux/bpf_types.h    |   3 +
> >>>  include/linux/seccomp.h      |  12 +++
> >>>  include/uapi/linux/bpf.h     |   2 +
> >>>  include/uapi/linux/ptrace.h  |   5 +-
> >>>  include/uapi/linux/seccomp.h |  15 ++--
> >>>  kernel/bpf/syscall.c         |   1 +
> >>>  kernel/ptrace.c              |   3 +
> >>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++
> ++++++++-----
> >>>  samples/bpf/Makefile         |   9 +++
> >>>  samples/bpf/bpf_load.c       |   9 ++-
> >>>  samples/bpf/seccomp1_kern.c  |  17 ++++
> >>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
> >>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
> >>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
> >>>  15 files changed, 362 insertions(+), 30 deletions(-)
> >>>  create mode 100644 samples/bpf/seccomp1_kern.c
> >>>  create mode 100644 samples/bpf/seccomp1_user.c
> >>>  create mode 100644 samples/bpf/seccomp2_kern.c
> >>>  create mode 100644 samples/bpf/seccomp2_user.c
> >>>
> >>> --
> >>> 2.14.1
> >>>
> >>
> >>
> >>
> >> --
> >> Kees Cook
> >> Pixel Security
>
>
>
> --
>
>
> Jessie Frazelle
> 4096R / D4C4 DD60 0D66 F65A 8EFC  511E 18F3 685C 0022 BFF3
> pgp.mit.edu
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>



-- 


- Brian Goff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]       ` <CAEk6tEw3ty0kBH+06TYt4=Ywt-4_cHBa9f8p3ajMghtjRkHmMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 17:07         ` Brian Goff
@ 2018-02-13 17:31         ` Sargun Dhillon
  1 sibling, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 17:31 UTC (permalink / raw)
  To: Jessie Frazelle
  Cc: Will Drewry, Kees Cook, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski

On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me-XvZkT8t+Da5Wk0Htik3J/w@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>>> this patchset doesn't introduce much of the functionality enabled by
>>>> eBPF, it lays the ground work for it.
>>>>
>>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>>> In the attached samples, there's an example of this. One can then use
>>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>>> and use that at reload time.
>>>>
>>>> The primary reason for not adding maps support in this patchset is
>>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>>> If we have a map that the BPF program can read, it can potentially
>>>> "change" privileges after running. It seems like doing writes only
>>>> is safe, because it can be pure, and side effect free, and therefore
>>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>>> to an agreement, this can be in a follow-up patchset.
>>>
>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>> seccomp filter language.
>>>
>>> -Kees
>>>
>> Three reasons:
>> 1) The userspace tooling for eBPF is much better than the user space
>> tooling for cBPF. Our use case is specifically to optimize Docker
>> policies. This is roughly what their seccomp policy looks like:
>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>> It would be much nicer to be able to leverage eBPF to write this in C,
>> or any other the other languages targetting eBPF. In addition, if we
>> have write-only maps, we can exfiltrate information from seccomp, like
>> arguments, and errors in a relatively cheap way compared to cBPF, and
>> then extract this via the bcc stack. Writing cBPF via C macros is a
>> pain, and the off the shelf cBPF libraries are getting no love. The
>> eBPF community is *exploding* with contributions.
>
> Is stage two of this getting runc to support eBPF and docker to change
> the default to be written as eBPF, because I foresee that being a
> problem mainly with the kernel versions people use. The point of that
> patch was to help the most people and as your point in (2) is made
> about performance, that is a trade-off I would be willing to make in
> order to have this functionality on more kernel versions.
>
> The other alternative would be to have docker translate to use eBPF if
).> the kernel supported it, but that amount of complexity seems a bit
> unnecessary for a feature that was trying to also be "simple".
>
> Or do you plan on wrapping filters onto processes tangentially from
> the runtime, in which case, that should be totally fine :)
>
> Anyways this is kinda a tangent from the main point of getting it in
> the kernel, just I would hate to see someone having to maintain this
> without there being a path to getting it upstream elsewhere.
>
We (me) intend to do the work to get it into Docker / Moby /
Containerd / Runc / Whatever the kids call it these days. It already
has the idea of multiple security modules, like seccomp, apparmor,
etc.. I can imagine that the first approach would be just to let
people pass eBPF filters as code, in the same way. Afterwards, there
could be more sophisticated approaches in order to transparently
upgrade people's filters, and give them performance upgrades.

A really naive approach is to take the JSON seccomp policy document
and converting it to plain old C with switch / case statements. Then
we can just push that through LLVM and we're in business. Although,
for some reason, I don't think the folks will want to take a hard dep
on llvm at runtime, so maybe there's some mechanism where it first
tries llvm, then tries to create a eBPF application naively, and then
falls back to cBPF. My primary fear with the first two approaches is
that given how the policies are written today, it's not conducive to
the eBPF instruction limit.

Our initial approach for this internally, since we use Docker 1.13.1,
and backporting this can be a bit of a pain. Docker has the ability to
spawn a pid 1 in the container, and we can use that to install the
seccomp filter, while leaving seccomp in the daemon off. Whenever this
is ready for public consumption, we'll share. Anyway, a 5% performance
gain across our fleet is an exciting proposition, and we use Docker,
so it's a problem that we have to figure out anyway.

>>
>> 2) In my testing, which thus so far has been very rudimentary, with
>> rewriting the policy that libseccomp generates from the Docker policy
>> to use eBPF, and eBPF maps performs much better than cBPF. The
>> specific case tested was to use a bpf array to lookup rules for a
>> particular syscall. In a super trivial test, this was about 5% low
>> latency than using traditional branches. If you need more evidence of
>> this, I can work a little bit more on the maps related patches, and
>> see if I can get some more benchmarking. From my understanding, we
>> would need to add "sealing" support for maps, in which they can be
>> marked as read-only, and only at that point should an eBPF seccomp
>> program be able to read from them.
>>
>> 3) Eventually, I'd like to use some more advanced capabilities of
>> eBPF, like being able to rewrite arguments safely (not things referred
>> to by pointers, but just plain old arguments).
>>
>>>>
>>>>
>>>> Sargun Dhillon (3):
>>>>   bpf, seccomp: Add eBPF filter capabilities
>>>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>>>     filters
>>>>   bpf: Add eBPF seccomp sample programs
>>>>
>>>>  arch/Kconfig                 |   7 ++
>>>>  include/linux/bpf_types.h    |   3 +
>>>>  include/linux/seccomp.h      |  12 +++
>>>>  include/uapi/linux/bpf.h     |   2 +
>>>>  include/uapi/linux/ptrace.h  |   5 +-
>>>>  include/uapi/linux/seccomp.h |  15 ++--
>>>>  kernel/bpf/syscall.c         |   1 +
>>>>  kernel/ptrace.c              |   3 +
>>>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>>>  samples/bpf/Makefile         |   9 +++
>>>>  samples/bpf/bpf_load.c       |   9 ++-
>>>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>>>  create mode 100644 samples/bpf/seccomp1_user.c
>>>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>>>  create mode 100644 samples/bpf/seccomp2_user.c
>>>>
>>>> --
>>>> 2.14.1
>>>>
>>>
>>>
>>>
>>> --
>>> Kees Cook
>>> Pixel Security
>
>
>
> --
>
>
> Jessie Frazelle
> 4096R / D4C4 DD60 0D66 F65A 8EFC  511E 18F3 685C 0022 BFF3
> pgp.mit.edu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 17:02     ` Jessie Frazelle
       [not found]       ` <CAEk6tEw3ty0kBH+06TYt4=Ywt-4_cHBa9f8p3ajMghtjRkHmMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 17:31       ` Sargun Dhillon
  2018-02-13 20:16         ` Kees Cook
       [not found]         ` <CAMp4zn-Lw0grNrCyjHJZUje1Aznaj03iAUWZ86ki68MZMN1-zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 17:31 UTC (permalink / raw)
  To: Jessie Frazelle
  Cc: Kees Cook, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Linux Containers, Andy Lutomirski, Will Drewry

On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me@jessfraz.com> wrote:
> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>>> this patchset doesn't introduce much of the functionality enabled by
>>>> eBPF, it lays the ground work for it.
>>>>
>>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>>> In the attached samples, there's an example of this. One can then use
>>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>>> and use that at reload time.
>>>>
>>>> The primary reason for not adding maps support in this patchset is
>>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>>> If we have a map that the BPF program can read, it can potentially
>>>> "change" privileges after running. It seems like doing writes only
>>>> is safe, because it can be pure, and side effect free, and therefore
>>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>>> to an agreement, this can be in a follow-up patchset.
>>>
>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>> seccomp filter language.
>>>
>>> -Kees
>>>
>> Three reasons:
>> 1) The userspace tooling for eBPF is much better than the user space
>> tooling for cBPF. Our use case is specifically to optimize Docker
>> policies. This is roughly what their seccomp policy looks like:
>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>> It would be much nicer to be able to leverage eBPF to write this in C,
>> or any other the other languages targetting eBPF. In addition, if we
>> have write-only maps, we can exfiltrate information from seccomp, like
>> arguments, and errors in a relatively cheap way compared to cBPF, and
>> then extract this via the bcc stack. Writing cBPF via C macros is a
>> pain, and the off the shelf cBPF libraries are getting no love. The
>> eBPF community is *exploding* with contributions.
>
> Is stage two of this getting runc to support eBPF and docker to change
> the default to be written as eBPF, because I foresee that being a
> problem mainly with the kernel versions people use. The point of that
> patch was to help the most people and as your point in (2) is made
> about performance, that is a trade-off I would be willing to make in
> order to have this functionality on more kernel versions.
>
> The other alternative would be to have docker translate to use eBPF if
).> the kernel supported it, but that amount of complexity seems a bit
> unnecessary for a feature that was trying to also be "simple".
>
> Or do you plan on wrapping filters onto processes tangentially from
> the runtime, in which case, that should be totally fine :)
>
> Anyways this is kinda a tangent from the main point of getting it in
> the kernel, just I would hate to see someone having to maintain this
> without there being a path to getting it upstream elsewhere.
>
We (me) intend to do the work to get it into Docker / Moby /
Containerd / Runc / Whatever the kids call it these days. It already
has the idea of multiple security modules, like seccomp, apparmor,
etc.. I can imagine that the first approach would be just to let
people pass eBPF filters as code, in the same way. Afterwards, there
could be more sophisticated approaches in order to transparently
upgrade people's filters, and give them performance upgrades.

A really naive approach is to take the JSON seccomp policy document
and converting it to plain old C with switch / case statements. Then
we can just push that through LLVM and we're in business. Although,
for some reason, I don't think the folks will want to take a hard dep
on llvm at runtime, so maybe there's some mechanism where it first
tries llvm, then tries to create a eBPF application naively, and then
falls back to cBPF. My primary fear with the first two approaches is
that given how the policies are written today, it's not conducive to
the eBPF instruction limit.

Our initial approach for this internally, since we use Docker 1.13.1,
and backporting this can be a bit of a pain. Docker has the ability to
spawn a pid 1 in the container, and we can use that to install the
seccomp filter, while leaving seccomp in the daemon off. Whenever this
is ready for public consumption, we'll share. Anyway, a 5% performance
gain across our fleet is an exciting proposition, and we use Docker,
so it's a problem that we have to figure out anyway.

>>
>> 2) In my testing, which thus so far has been very rudimentary, with
>> rewriting the policy that libseccomp generates from the Docker policy
>> to use eBPF, and eBPF maps performs much better than cBPF. The
>> specific case tested was to use a bpf array to lookup rules for a
>> particular syscall. In a super trivial test, this was about 5% low
>> latency than using traditional branches. If you need more evidence of
>> this, I can work a little bit more on the maps related patches, and
>> see if I can get some more benchmarking. From my understanding, we
>> would need to add "sealing" support for maps, in which they can be
>> marked as read-only, and only at that point should an eBPF seccomp
>> program be able to read from them.
>>
>> 3) Eventually, I'd like to use some more advanced capabilities of
>> eBPF, like being able to rewrite arguments safely (not things referred
>> to by pointers, but just plain old arguments).
>>
>>>>
>>>>
>>>> Sargun Dhillon (3):
>>>>   bpf, seccomp: Add eBPF filter capabilities
>>>>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>>>>     filters
>>>>   bpf: Add eBPF seccomp sample programs
>>>>
>>>>  arch/Kconfig                 |   7 ++
>>>>  include/linux/bpf_types.h    |   3 +
>>>>  include/linux/seccomp.h      |  12 +++
>>>>  include/uapi/linux/bpf.h     |   2 +
>>>>  include/uapi/linux/ptrace.h  |   5 +-
>>>>  include/uapi/linux/seccomp.h |  15 ++--
>>>>  kernel/bpf/syscall.c         |   1 +
>>>>  kernel/ptrace.c              |   3 +
>>>>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>>>>  samples/bpf/Makefile         |   9 +++
>>>>  samples/bpf/bpf_load.c       |   9 ++-
>>>>  samples/bpf/seccomp1_kern.c  |  17 ++++
>>>>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>>>>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>>>>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>>>>  15 files changed, 362 insertions(+), 30 deletions(-)
>>>>  create mode 100644 samples/bpf/seccomp1_kern.c
>>>>  create mode 100644 samples/bpf/seccomp1_user.c
>>>>  create mode 100644 samples/bpf/seccomp2_kern.c
>>>>  create mode 100644 samples/bpf/seccomp2_user.c
>>>>
>>>> --
>>>> 2.14.1
>>>>
>>>
>>>
>>>
>>> --
>>> Kees Cook
>>> Pixel Security
>
>
>
> --
>
>
> Jessie Frazelle
> 4096R / D4C4 DD60 0D66 F65A 8EFC  511E 18F3 685C 0022 BFF3
> pgp.mit.edu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]         ` <CAMp4zn-Lw0grNrCyjHJZUje1Aznaj03iAUWZ86ki68MZMN1-zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 20:16           ` Kees Cook
  0 siblings, 0 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 20:16 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski,
	Paul Moore

On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me-XvZkT8t+Da5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>> seccomp filter language.
>>>>
>>> Three reasons:
>>> 1) The userspace tooling for eBPF is much better than the user space
>>> tooling for cBPF. Our use case is specifically to optimize Docker
>>> policies. This is roughly what their seccomp policy looks like:
>>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>>> It would be much nicer to be able to leverage eBPF to write this in C,
>>> or any other the other languages targetting eBPF. In addition, if we
>>> have write-only maps, we can exfiltrate information from seccomp, like
>>> arguments, and errors in a relatively cheap way compared to cBPF, and
>>> then extract this via the bcc stack. Writing cBPF via C macros is a
>>> pain, and the off the shelf cBPF libraries are getting no love. The
>>> eBPF community is *exploding* with contributions.

eBPF moving quickly is a disincentive from my perspective, as I want
absolutely zero surprises when it comes to seccomp. :) Given the
steady stream of exploitable flaws in eBPF, I don't want seccomp
anywhere near it. :( Many distros ship with the bpf() syscall
disabled, for example (or entirely compiled out, as in Chrome OS and
Android).

The convenience of writing C for eBPF output is certainly nice, but it
seems like either LLVM could grow a cBPF backend, or libseccomp could
be improved to provide the needed features.

Can you explain the exfiltration piece? Do you mean it would be
"cheap" in the sense that the results can be stored and studied
without needing a ptrace manager to catch the failures?

I remain unconvinced that seccomp needs a more descriptive language,
given its limited usage.

> A really naive approach is to take the JSON seccomp policy document
> and converting it to plain old C with switch / case statements. Then
> we can just push that through LLVM and we're in business. Although,
> for some reason, I don't think the folks will want to take a hard dep
> on llvm at runtime, so maybe there's some mechanism where it first
> tries llvm, then tries to create a eBPF application naively, and then
> falls back to cBPF. My primary fear with the first two approaches is
> that given how the policies are written today, it's not conducive to
> the eBPF instruction limit.

How about having libseccomp grow a JSON parser?

>>> 2) In my testing, which thus so far has been very rudimentary, with
>>> rewriting the policy that libseccomp generates from the Docker policy
>>> to use eBPF, and eBPF maps performs much better than cBPF. The
>>> specific case tested was to use a bpf array to lookup rules for a
>>> particular syscall. In a super trivial test, this was about 5% low
>>> latency than using traditional branches. If you need more evidence of
>>> this, I can work a little bit more on the maps related patches, and
>>> see if I can get some more benchmarking. From my understanding, we
>>> would need to add "sealing" support for maps, in which they can be
>>> marked as read-only, and only at that point should an eBPF seccomp
>>> program be able to read from them.

This came up recently on the libseccomp mailing list. The map lookup
is faster than a linear search, but for large filters, the filter can
be written as a balanced tree (as Chrome does), or reordered by
syscall frequency (as is recommended by minijail), and that appears to
get a much larger improvement than even the map lookup.

>>> 3) Eventually, I'd like to use some more advanced capabilities of
>>> eBPF, like being able to rewrite arguments safely (not things referred
>>> to by pointers, but just plain old arguments).

Much like 1), I don't find this an incentive, as the interactions
become much harder to reason about, and I am concerned we'll open
seccomp up to attack for a relatively small benefit. However,
rewriting arguments has come up in very narrow cases, and Tycho was
working on a method of doing userspace notifications (i.e. without a
ptrace manager) to get us closer.

If the needs Tycho outlined[1] could be addressed fully with eBPF, and
we can very narrowly scope the use of the "extra" eBPF features, I
might be more inclined to merge something like this, but I want to
take it very carefully. Besides creating a dependency on the bpf()
syscall, this would create side channels (via maps) that make me very
uncomfortable when dealing with process isolation. (Though, in theory,
this is already correctly constrained by no-new-privs...)

Tycho, could you get what you needed from eBPF? My impression would be
that you'd still need a user notification mechanism to stop the
process, as the decisions about how to rewrite arguments likely cannot
be fully characterized by the internal eBPF filter.

-Kees

[1] https://patchwork.kernel.org/patch/10199295/

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 17:31       ` Sargun Dhillon
@ 2018-02-13 20:16         ` Kees Cook
  2018-02-13 21:08           ` Paul Moore
       [not found]           ` <CAGXu5jKv3QFVKLhok1JWiPamE0b4CqLTO-hx8sP0KWED921=6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]         ` <CAMp4zn-Lw0grNrCyjHJZUje1Aznaj03iAUWZ86ki68MZMN1-zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 20:16 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jessie Frazelle, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Linux Containers, Andy Lutomirski, Will Drewry,
	Tycho Andersen, Paul Moore

On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me@jessfraz.com> wrote:
>> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>> seccomp filter language.
>>>>
>>> Three reasons:
>>> 1) The userspace tooling for eBPF is much better than the user space
>>> tooling for cBPF. Our use case is specifically to optimize Docker
>>> policies. This is roughly what their seccomp policy looks like:
>>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>>> It would be much nicer to be able to leverage eBPF to write this in C,
>>> or any other the other languages targetting eBPF. In addition, if we
>>> have write-only maps, we can exfiltrate information from seccomp, like
>>> arguments, and errors in a relatively cheap way compared to cBPF, and
>>> then extract this via the bcc stack. Writing cBPF via C macros is a
>>> pain, and the off the shelf cBPF libraries are getting no love. The
>>> eBPF community is *exploding* with contributions.

eBPF moving quickly is a disincentive from my perspective, as I want
absolutely zero surprises when it comes to seccomp. :) Given the
steady stream of exploitable flaws in eBPF, I don't want seccomp
anywhere near it. :( Many distros ship with the bpf() syscall
disabled, for example (or entirely compiled out, as in Chrome OS and
Android).

The convenience of writing C for eBPF output is certainly nice, but it
seems like either LLVM could grow a cBPF backend, or libseccomp could
be improved to provide the needed features.

Can you explain the exfiltration piece? Do you mean it would be
"cheap" in the sense that the results can be stored and studied
without needing a ptrace manager to catch the failures?

I remain unconvinced that seccomp needs a more descriptive language,
given its limited usage.

> A really naive approach is to take the JSON seccomp policy document
> and converting it to plain old C with switch / case statements. Then
> we can just push that through LLVM and we're in business. Although,
> for some reason, I don't think the folks will want to take a hard dep
> on llvm at runtime, so maybe there's some mechanism where it first
> tries llvm, then tries to create a eBPF application naively, and then
> falls back to cBPF. My primary fear with the first two approaches is
> that given how the policies are written today, it's not conducive to
> the eBPF instruction limit.

How about having libseccomp grow a JSON parser?

>>> 2) In my testing, which thus so far has been very rudimentary, with
>>> rewriting the policy that libseccomp generates from the Docker policy
>>> to use eBPF, and eBPF maps performs much better than cBPF. The
>>> specific case tested was to use a bpf array to lookup rules for a
>>> particular syscall. In a super trivial test, this was about 5% low
>>> latency than using traditional branches. If you need more evidence of
>>> this, I can work a little bit more on the maps related patches, and
>>> see if I can get some more benchmarking. From my understanding, we
>>> would need to add "sealing" support for maps, in which they can be
>>> marked as read-only, and only at that point should an eBPF seccomp
>>> program be able to read from them.

This came up recently on the libseccomp mailing list. The map lookup
is faster than a linear search, but for large filters, the filter can
be written as a balanced tree (as Chrome does), or reordered by
syscall frequency (as is recommended by minijail), and that appears to
get a much larger improvement than even the map lookup.

>>> 3) Eventually, I'd like to use some more advanced capabilities of
>>> eBPF, like being able to rewrite arguments safely (not things referred
>>> to by pointers, but just plain old arguments).

Much like 1), I don't find this an incentive, as the interactions
become much harder to reason about, and I am concerned we'll open
seccomp up to attack for a relatively small benefit. However,
rewriting arguments has come up in very narrow cases, and Tycho was
working on a method of doing userspace notifications (i.e. without a
ptrace manager) to get us closer.

If the needs Tycho outlined[1] could be addressed fully with eBPF, and
we can very narrowly scope the use of the "extra" eBPF features, I
might be more inclined to merge something like this, but I want to
take it very carefully. Besides creating a dependency on the bpf()
syscall, this would create side channels (via maps) that make me very
uncomfortable when dealing with process isolation. (Though, in theory,
this is already correctly constrained by no-new-privs...)

Tycho, could you get what you needed from eBPF? My impression would be
that you'd still need a user notification mechanism to stop the
process, as the decisions about how to rewrite arguments likely cannot
be fully characterized by the internal eBPF filter.

-Kees

[1] https://patchwork.kernel.org/patch/10199295/

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]           ` <CAGXu5jKv3QFVKLhok1JWiPamE0b4CqLTO-hx8sP0KWED921=6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 20:50             ` Tycho Andersen
  2018-02-13 21:08             ` Paul Moore
  1 sibling, 0 replies; 34+ messages in thread
From: Tycho Andersen @ 2018-02-13 20:50 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski,
	Paul Moore, Sargun Dhillon

On Tue, Feb 13, 2018 at 12:16:42PM -0800, Kees Cook wrote:
> If the needs Tycho outlined[1] could be addressed fully with eBPF, and
> we can very narrowly scope the use of the "extra" eBPF features, I
> might be more inclined to merge something like this, but I want to
> take it very carefully. Besides creating a dependency on the bpf()
> syscall, this would create side channels (via maps) that make me very
> uncomfortable when dealing with process isolation. (Though, in theory,
> this is already correctly constrained by no-new-privs...)
> 
> Tycho, could you get what you needed from eBPF?

We could get almost all the way there, I think. We could pass the
event via a bpf map, and then have a userspace daemon do:

    while (1) {
        bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
        if (!syscall_queued(&attr))
            continue;

        do_stuff(&attr);
        set_done(&attr);
        bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
    }

but as you say,

> My impression would be that you'd still need a user notification
> mechanism to stop the process, as the decisions about how to rewrite
> arguments likely cannot be fully characterized by the internal eBPF
> filter.

...there's no way to stop the seccomp'd task until userspace is
finished with whatever thing it needs to do on behalf of the seccomp'd
task (at least, IIUC).

That's of course ignoring the ergonomics from userspace: bpf_map_fops
doesn't implement poll() or anything, so we really do have to use a
while(1), if we want to allow more than one syscall queuing at a time,
we need to poll multiple map elements.

One of the extensions I had been considering floating for v2 of my set
was allowing users to pass fds back across (again, to make userspace
ergonomics a little better), which would be impossible via ebpf.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]           ` <CAGXu5jKv3QFVKLhok1JWiPamE0b4CqLTO-hx8sP0KWED921=6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 20:50             ` Tycho Andersen
@ 2018-02-13 21:08             ` Paul Moore
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Moore @ 2018-02-13 21:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Drewry, paul-r2n+y4ga6xFZroRs9YW3xA, Daniel Borkmann,
	Network Development, Linux Containers, Alexei Starovoitov,
	Andy Lutomirski, Sargun Dhillon

On Tue, Feb 13, 2018 at 3:16 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me-XvZkT8t+Da5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>>>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>>> seccomp filter language.
>>>>>
>>>> Three reasons:
>>>> 1) The userspace tooling for eBPF is much better than the user space
>>>> tooling for cBPF. Our use case is specifically to optimize Docker
>>>> policies. This is roughly what their seccomp policy looks like:
>>>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>>>> It would be much nicer to be able to leverage eBPF to write this in C,
>>>> or any other the other languages targetting eBPF. In addition, if we
>>>> have write-only maps, we can exfiltrate information from seccomp, like
>>>> arguments, and errors in a relatively cheap way compared to cBPF, and
>>>> then extract this via the bcc stack. Writing cBPF via C macros is a
>>>> pain, and the off the shelf cBPF libraries are getting no love.

What do you mean "no love"?  I would consider libseccomp is a cBPF
library and it is actively maintained/developed.

>>>> The eBPF community is *exploding* with contributions.
>
> eBPF moving quickly is a disincentive from my perspective, as I want
> absolutely zero surprises when it comes to seccomp. :) Given the
> steady stream of exploitable flaws in eBPF, I don't want seccomp
> anywhere near it. :( Many distros ship with the bpf() syscall
> disabled, for example (or entirely compiled out, as in Chrome OS and
> Android).
>
> The convenience of writing C for eBPF output is certainly nice, but it
> seems like either LLVM could grow a cBPF backend, or libseccomp could
> be improved to provide the needed features.

I'm always happy to discuss adding new functionality to libseccomp;
feel free to use the GH issue tracker or the libseccomp mailing list.

> Can you explain the exfiltration piece? Do you mean it would be
> "cheap" in the sense that the results can be stored and studied
> without needing a ptrace manager to catch the failures?

I'm a little confused about this piece too.

> I remain unconvinced that seccomp needs a more descriptive language,
> given its limited usage.

FWIW, I haven't yet seen a functionality request for libseccomp that
couldn't be addressed with cBPF and some creativity.

>> A really naive approach is to take the JSON seccomp policy document
>> and converting it to plain old C with switch / case statements. Then
>> we can just push that through LLVM and we're in business. Although,
>> for some reason, I don't think the folks will want to take a hard dep
>> on llvm at runtime, so maybe there's some mechanism where it first
>> tries llvm, then tries to create a eBPF application naively, and then
>> falls back to cBPF. My primary fear with the first two approaches is
>> that given how the policies are written today, it's not conducive to
>> the eBPF instruction limit.
>
> How about having libseccomp grow a JSON parser?

Generally my opinion is that seccomp filter configuration file formats
are best left to the calling application, not libseccomp.  This way
the seccomp filter configuration can be consistent with the rest of
the application's configuration.

However, if someone really wants to work on this, I'm not sure I would say "no".

>>>> 2) In my testing, which thus so far has been very rudimentary, with
>>>> rewriting the policy that libseccomp generates from the Docker policy
>>>> to use eBPF, and eBPF maps performs much better than cBPF. The
>>>> specific case tested was to use a bpf array to lookup rules for a
>>>> particular syscall. In a super trivial test, this was about 5% low
>>>> latency than using traditional branches. If you need more evidence of
>>>> this, I can work a little bit more on the maps related patches, and
>>>> see if I can get some more benchmarking. From my understanding, we
>>>> would need to add "sealing" support for maps, in which they can be
>>>> marked as read-only, and only at that point should an eBPF seccomp
>>>> program be able to read from them.
>
> This came up recently on the libseccomp mailing list. The map lookup
> is faster than a linear search, but for large filters, the filter can
> be written as a balanced tree (as Chrome does), or reordered by
> syscall frequency (as is recommended by minijail), and that appears to
> get a much larger improvement than even the map lookup.

For reference, the current libseccomp approach is to put the shorter
rules near the top of the filter (e.g. syscall only) with the longer
rules (e.g. syscall + arguments) towards the end.  The libseccomp API
does allow for callers to influence the ordering via syscall priority
hints.

Someone is currently looking a tree-based ordering of syscalls for
libseccomp, and I'm always open to new/better ideas.

-- 
paul moore
security @ redhat

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 20:16         ` Kees Cook
@ 2018-02-13 21:08           ` Paul Moore
       [not found]           ` <CAGXu5jKv3QFVKLhok1JWiPamE0b4CqLTO-hx8sP0KWED921=6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Moore @ 2018-02-13 21:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Sargun Dhillon, Jessie Frazelle, Network Development,
	Alexei Starovoitov, Daniel Borkmann, Linux Containers,
	Andy Lutomirski, Will Drewry, Tycho Andersen, paul

On Tue, Feb 13, 2018 at 3:16 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>> On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <me@jessfraz.com> wrote:
>>> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>>> and it only makes the code more complex. I'd rather stick with  -- cBPF
>>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>>> seccomp filter language.
>>>>>
>>>> Three reasons:
>>>> 1) The userspace tooling for eBPF is much better than the user space
>>>> tooling for cBPF. Our use case is specifically to optimize Docker
>>>> policies. This is roughly what their seccomp policy looks like:
>>>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json.
>>>> It would be much nicer to be able to leverage eBPF to write this in C,
>>>> or any other the other languages targetting eBPF. In addition, if we
>>>> have write-only maps, we can exfiltrate information from seccomp, like
>>>> arguments, and errors in a relatively cheap way compared to cBPF, and
>>>> then extract this via the bcc stack. Writing cBPF via C macros is a
>>>> pain, and the off the shelf cBPF libraries are getting no love.

What do you mean "no love"?  I would consider libseccomp is a cBPF
library and it is actively maintained/developed.

>>>> The eBPF community is *exploding* with contributions.
>
> eBPF moving quickly is a disincentive from my perspective, as I want
> absolutely zero surprises when it comes to seccomp. :) Given the
> steady stream of exploitable flaws in eBPF, I don't want seccomp
> anywhere near it. :( Many distros ship with the bpf() syscall
> disabled, for example (or entirely compiled out, as in Chrome OS and
> Android).
>
> The convenience of writing C for eBPF output is certainly nice, but it
> seems like either LLVM could grow a cBPF backend, or libseccomp could
> be improved to provide the needed features.

I'm always happy to discuss adding new functionality to libseccomp;
feel free to use the GH issue tracker or the libseccomp mailing list.

> Can you explain the exfiltration piece? Do you mean it would be
> "cheap" in the sense that the results can be stored and studied
> without needing a ptrace manager to catch the failures?

I'm a little confused about this piece too.

> I remain unconvinced that seccomp needs a more descriptive language,
> given its limited usage.

FWIW, I haven't yet seen a functionality request for libseccomp that
couldn't be addressed with cBPF and some creativity.

>> A really naive approach is to take the JSON seccomp policy document
>> and converting it to plain old C with switch / case statements. Then
>> we can just push that through LLVM and we're in business. Although,
>> for some reason, I don't think the folks will want to take a hard dep
>> on llvm at runtime, so maybe there's some mechanism where it first
>> tries llvm, then tries to create a eBPF application naively, and then
>> falls back to cBPF. My primary fear with the first two approaches is
>> that given how the policies are written today, it's not conducive to
>> the eBPF instruction limit.
>
> How about having libseccomp grow a JSON parser?

Generally my opinion is that seccomp filter configuration file formats
are best left to the calling application, not libseccomp.  This way
the seccomp filter configuration can be consistent with the rest of
the application's configuration.

However, if someone really wants to work on this, I'm not sure I would say "no".

>>>> 2) In my testing, which thus so far has been very rudimentary, with
>>>> rewriting the policy that libseccomp generates from the Docker policy
>>>> to use eBPF, and eBPF maps performs much better than cBPF. The
>>>> specific case tested was to use a bpf array to lookup rules for a
>>>> particular syscall. In a super trivial test, this was about 5% low
>>>> latency than using traditional branches. If you need more evidence of
>>>> this, I can work a little bit more on the maps related patches, and
>>>> see if I can get some more benchmarking. From my understanding, we
>>>> would need to add "sealing" support for maps, in which they can be
>>>> marked as read-only, and only at that point should an eBPF seccomp
>>>> program be able to read from them.
>
> This came up recently on the libseccomp mailing list. The map lookup
> is faster than a linear search, but for large filters, the filter can
> be written as a balanced tree (as Chrome does), or reordered by
> syscall frequency (as is recommended by minijail), and that appears to
> get a much larger improvement than even the map lookup.

For reference, the current libseccomp approach is to put the shorter
rules near the top of the filter (e.g. syscall only) with the longer
rules (e.g. syscall + arguments) towards the end.  The libseccomp API
does allow for callers to influence the ordering via syscall priority
hints.

Someone is currently looking a tree-based ordering of syscalls for
libseccomp, and I'm always open to new/better ideas.

-- 
paul moore
security @ redhat

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found] ` <20180213154244.GA3292-du9IEJ8oIxHXYT48pCVpJ3c7ZZ+wIVaZYkHkVr5ML8kVGlcevz2xqA@public.gmane.org>
  2018-02-13 15:47   ` Kees Cook
@ 2018-02-14  0:47   ` Mickaël Salaün
  1 sibling, 0 replies; 34+ messages in thread
From: Mickaël Salaün @ 2018-02-14  0:47 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: wad-F7+t8E8rja9g9hUCZPvPmw, keescook-F7+t8E8rja9g9hUCZPvPmw,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ast-DgEjT+Ai2ygdnm+yROfE0A, luto-kltTT9wpgjJwATOyAt5JVQ


[-- Attachment #1.1.1: Type: text/plain, Size: 3743 bytes --]

seccomp-bpf does not use cBPF but a subset of it. The reason is that it
is meant to reduce the attack surface of the kernel. By limiting the
number of instructions allowed by seccomp-bpf, it really reduce the
possibilities for an attacker to use seccomp-bpf as an entry point to
attack the kernel. Moreover, this subset of cBPF is just fine to filter
simple things as syscall numbers and arguments. Additional return codes
may be added to extend seccomp features.

FYI, I'm tweaking a new version of Landlock, which is not an extension
of seccomp-bpf (as it was at first) but a standalone LSM leveraging eBPF
to create security sandboxes (what seccomp-bpf does not do). I'll send
this version soon but you can get a sneak peek here (the documentation
will come with the final version):
https://github.com/landlock-lsm/linux/commit/6c9131a5ccdf7aa599999b23f3a9ae2b73008f41
(please, do not comment this code now)

I think the current seccomp-bpf bytecode is excellent for what it is
meant to do. Landlock leverage eBPF to tackle a more complex problem
(e.g. control access to files, and much more). It is not a seccomp
replacement but a complementary layer of security.

About the verbosity of seccomp filters, you may want to try other ways
to write policies (e.g. https://github.com/google/kafel/ or
https://android.googlesource.com/platform/external/minijail/+/master/tools/generate_seccomp_policy.py
or https://github.com/servo/gaol/blob/master/platform/linux/seccomp.rs).

Regards,
 Mickaël

On 13/02/2018 16:42, Sargun Dhillon wrote:
> This patchset enables seccomp filters to be written in eBPF. Although,
> this patchset doesn't introduce much of the functionality enabled by
> eBPF, it lays the ground work for it.
> 
> It also introduces the capability to dump eBPF filters via the PTRACE
> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> In the attached samples, there's an example of this. One can then use
> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> and use that at reload time.
> 
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.
> 
> 
> Sargun Dhillon (3):
>   bpf, seccomp: Add eBPF filter capabilities
>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>     filters
>   bpf: Add eBPF seccomp sample programs
> 
>  arch/Kconfig                 |   7 ++
>  include/linux/bpf_types.h    |   3 +
>  include/linux/seccomp.h      |  12 +++
>  include/uapi/linux/bpf.h     |   2 +
>  include/uapi/linux/ptrace.h  |   5 +-
>  include/uapi/linux/seccomp.h |  15 ++--
>  kernel/bpf/syscall.c         |   1 +
>  kernel/ptrace.c              |   3 +
>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>  samples/bpf/Makefile         |   9 +++
>  samples/bpf/bpf_load.c       |   9 ++-
>  samples/bpf/seccomp1_kern.c  |  17 ++++
>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>  15 files changed, 362 insertions(+), 30 deletions(-)
>  create mode 100644 samples/bpf/seccomp1_kern.c
>  create mode 100644 samples/bpf/seccomp1_user.c
>  create mode 100644 samples/bpf/seccomp2_kern.c
>  create mode 100644 samples/bpf/seccomp2_user.c
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 15:42 [PATCH net-next 0/3] eBPF Seccomp filters Sargun Dhillon
  2018-02-13 15:47 ` Kees Cook
@ 2018-02-14  0:47 ` Mickaël Salaün
       [not found] ` <20180213154244.GA3292-du9IEJ8oIxHXYT48pCVpJ3c7ZZ+wIVaZYkHkVr5ML8kVGlcevz2xqA@public.gmane.org>
  2 siblings, 0 replies; 34+ messages in thread
From: Mickaël Salaün @ 2018-02-14  0:47 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: netdev, wad, keescook, daniel, containers, ast, luto


[-- Attachment #1.1: Type: text/plain, Size: 3743 bytes --]

seccomp-bpf does not use cBPF but a subset of it. The reason is that it
is meant to reduce the attack surface of the kernel. By limiting the
number of instructions allowed by seccomp-bpf, it really reduce the
possibilities for an attacker to use seccomp-bpf as an entry point to
attack the kernel. Moreover, this subset of cBPF is just fine to filter
simple things as syscall numbers and arguments. Additional return codes
may be added to extend seccomp features.

FYI, I'm tweaking a new version of Landlock, which is not an extension
of seccomp-bpf (as it was at first) but a standalone LSM leveraging eBPF
to create security sandboxes (what seccomp-bpf does not do). I'll send
this version soon but you can get a sneak peek here (the documentation
will come with the final version):
https://github.com/landlock-lsm/linux/commit/6c9131a5ccdf7aa599999b23f3a9ae2b73008f41
(please, do not comment this code now)

I think the current seccomp-bpf bytecode is excellent for what it is
meant to do. Landlock leverage eBPF to tackle a more complex problem
(e.g. control access to files, and much more). It is not a seccomp
replacement but a complementary layer of security.

About the verbosity of seccomp filters, you may want to try other ways
to write policies (e.g. https://github.com/google/kafel/ or
https://android.googlesource.com/platform/external/minijail/+/master/tools/generate_seccomp_policy.py
or https://github.com/servo/gaol/blob/master/platform/linux/seccomp.rs).

Regards,
 Mickaël

On 13/02/2018 16:42, Sargun Dhillon wrote:
> This patchset enables seccomp filters to be written in eBPF. Although,
> this patchset doesn't introduce much of the functionality enabled by
> eBPF, it lays the ground work for it.
> 
> It also introduces the capability to dump eBPF filters via the PTRACE
> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> In the attached samples, there's an example of this. One can then use
> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> and use that at reload time.
> 
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.
> 
> 
> Sargun Dhillon (3):
>   bpf, seccomp: Add eBPF filter capabilities
>   seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
>     filters
>   bpf: Add eBPF seccomp sample programs
> 
>  arch/Kconfig                 |   7 ++
>  include/linux/bpf_types.h    |   3 +
>  include/linux/seccomp.h      |  12 +++
>  include/uapi/linux/bpf.h     |   2 +
>  include/uapi/linux/ptrace.h  |   5 +-
>  include/uapi/linux/seccomp.h |  15 ++--
>  kernel/bpf/syscall.c         |   1 +
>  kernel/ptrace.c              |   3 +
>  kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
>  samples/bpf/Makefile         |   9 +++
>  samples/bpf/bpf_load.c       |   9 ++-
>  samples/bpf/seccomp1_kern.c  |  17 ++++
>  samples/bpf/seccomp1_user.c  |  34 ++++++++
>  samples/bpf/seccomp2_kern.c  |  24 ++++++
>  samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
>  15 files changed, 362 insertions(+), 30 deletions(-)
>  create mode 100644 samples/bpf/seccomp1_kern.c
>  create mode 100644 samples/bpf/seccomp1_user.c
>  create mode 100644 samples/bpf/seccomp2_kern.c
>  create mode 100644 samples/bpf/seccomp2_user.c
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]   ` <CAGXu5jLiYh0rSRuJ_-2xLB03Wod5G07njpoESR4SnmsmiUnsEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 16:29     ` Sargun Dhillon
@ 2018-02-14 17:25     ` Andy Lutomirski
  1 sibling, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-02-14 17:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Sargun Dhillon

On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> What's the reason for adding eBPF support? seccomp shouldn't need it,
> and it only makes the code more complex. I'd rather stick with cBPF
> until we have an overwhelmingly good reason to use eBPF as a "native"
> seccomp filter language.
>

I can think of two fairly strong use cases for eBPF's ability to call
functions: logging and Tycho's user notifier thing.  They let seccomp
filters *do* something synchronously, which is a better match for both
use cases than the current approach of "hey, I'd like to log this
syscall, but it's really awkward to attach other information or to
track exactly *which* filter logged what or to stack any of it".

Also, eBPF's stronger arithmetic support would allow bitops (I think),
which would make "is the nr in this list" quite a bit faster in some
cases.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 15:47 ` Kees Cook
  2018-02-13 16:29   ` Sargun Dhillon
@ 2018-02-14 17:25   ` Andy Lutomirski
  2018-02-14 17:32     ` Tycho Andersen
       [not found]     ` <CALCETrV9xUd3XRgobTDgVNRFY_+o=pEDkfjvuxQ7w_UyH324zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]   ` <CAGXu5jLiYh0rSRuJ_-2xLB03Wod5G07njpoESR4SnmsmiUnsEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-02-14 17:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Sargun Dhillon, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Linux Containers, Will Drewry

On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> What's the reason for adding eBPF support? seccomp shouldn't need it,
> and it only makes the code more complex. I'd rather stick with cBPF
> until we have an overwhelmingly good reason to use eBPF as a "native"
> seccomp filter language.
>

I can think of two fairly strong use cases for eBPF's ability to call
functions: logging and Tycho's user notifier thing.  They let seccomp
filters *do* something synchronously, which is a better match for both
use cases than the current approach of "hey, I'd like to log this
syscall, but it's really awkward to attach other information or to
track exactly *which* filter logged what or to stack any of it".

Also, eBPF's stronger arithmetic support would allow bitops (I think),
which would make "is the nr in this list" quite a bit faster in some
cases.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]     ` <CALCETrV9xUd3XRgobTDgVNRFY_+o=pEDkfjvuxQ7w_UyH324zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-14 17:32       ` Tycho Andersen
  0 siblings, 0 replies; 34+ messages in thread
From: Tycho Andersen @ 2018-02-14 17:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Will Drewry, Kees Cook, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Sargun Dhillon

On Wed, Feb 14, 2018 at 05:25:00PM +0000, Andy Lutomirski wrote:
> On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> >> This patchset enables seccomp filters to be written in eBPF. Although,
> >> this patchset doesn't introduce much of the functionality enabled by
> >> eBPF, it lays the ground work for it.
> >>
> >> It also introduces the capability to dump eBPF filters via the PTRACE
> >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> >> In the attached samples, there's an example of this. One can then use
> >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> >> and use that at reload time.
> >>
> >> The primary reason for not adding maps support in this patchset is
> >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> >> If we have a map that the BPF program can read, it can potentially
> >> "change" privileges after running. It seems like doing writes only
> >> is safe, because it can be pure, and side effect free, and therefore
> >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> >> to an agreement, this can be in a follow-up patchset.
> >
> > What's the reason for adding eBPF support? seccomp shouldn't need it,
> > and it only makes the code more complex. I'd rather stick with cBPF
> > until we have an overwhelmingly good reason to use eBPF as a "native"
> > seccomp filter language.
> >
> 
> I can think of two fairly strong use cases for eBPF's ability to call
> functions: logging and Tycho's user notifier thing.

Worth noting that there is one additional thing that I didn't
implement, but which would be nice and is probably not possible with
eBPF (at least, not without a bunch of additional infrastructure):
passing fds back to the tracee from the manager if you intercept
socket(), or accept() or something.

This could again be accomplished via other means, though it would be a
lot nicer to have a primitive for it.

That said, I think it's more important that something like this gets
in, vs. that it gets in with some approach like I've posted. So if we
go with eBPF and some wait functions and acknowledge that you have to
do some ptrace surgery, that is better than nothing.

Tycho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-14 17:25   ` Andy Lutomirski
@ 2018-02-14 17:32     ` Tycho Andersen
  2018-02-15  4:30       ` Alexei Starovoitov
  2018-02-15  4:30       ` Alexei Starovoitov
       [not found]     ` <CALCETrV9xUd3XRgobTDgVNRFY_+o=pEDkfjvuxQ7w_UyH324zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 34+ messages in thread
From: Tycho Andersen @ 2018-02-14 17:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Sargun Dhillon

On Wed, Feb 14, 2018 at 05:25:00PM +0000, Andy Lutomirski wrote:
> On Tue, Feb 13, 2018 at 3:47 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
> >> This patchset enables seccomp filters to be written in eBPF. Although,
> >> this patchset doesn't introduce much of the functionality enabled by
> >> eBPF, it lays the ground work for it.
> >>
> >> It also introduces the capability to dump eBPF filters via the PTRACE
> >> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> >> In the attached samples, there's an example of this. One can then use
> >> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> >> and use that at reload time.
> >>
> >> The primary reason for not adding maps support in this patchset is
> >> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> >> If we have a map that the BPF program can read, it can potentially
> >> "change" privileges after running. It seems like doing writes only
> >> is safe, because it can be pure, and side effect free, and therefore
> >> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> >> to an agreement, this can be in a follow-up patchset.
> >
> > What's the reason for adding eBPF support? seccomp shouldn't need it,
> > and it only makes the code more complex. I'd rather stick with cBPF
> > until we have an overwhelmingly good reason to use eBPF as a "native"
> > seccomp filter language.
> >
> 
> I can think of two fairly strong use cases for eBPF's ability to call
> functions: logging and Tycho's user notifier thing.

Worth noting that there is one additional thing that I didn't
implement, but which would be nice and is probably not possible with
eBPF (at least, not without a bunch of additional infrastructure):
passing fds back to the tracee from the manager if you intercept
socket(), or accept() or something.

This could again be accomplished via other means, though it would be a
lot nicer to have a primitive for it.

That said, I think it's more important that something like this gets
in, vs. that it gets in with some approach like I've posted. So if we
go with eBPF and some wait functions and acknowledge that you have to
do some ptrace surgery, that is better than nothing.

Tycho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-14 17:32     ` Tycho Andersen
@ 2018-02-15  4:30       ` Alexei Starovoitov
  2018-02-15  4:30       ` Alexei Starovoitov
  1 sibling, 0 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2018-02-15  4:30 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Will Drewry, Kees Cook, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, Andy Lutomirski,
	Sargun Dhillon, David S. Miller, Lorenzo Colitti

On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
> > >
> > > What's the reason for adding eBPF support? seccomp shouldn't need it,
> > > and it only makes the code more complex. I'd rather stick with cBPF
> > > until we have an overwhelmingly good reason to use eBPF as a "native"
> > > seccomp filter language.
> > >
> > 
> > I can think of two fairly strong use cases for eBPF's ability to call
> > functions: logging and Tycho's user notifier thing.
> 
> Worth noting that there is one additional thing that I didn't
> implement, but which would be nice and is probably not possible with
> eBPF (at least, not without a bunch of additional infrastructure):
> passing fds back to the tracee from the manager if you intercept
> socket(), or accept() or something.
> 
> This could again be accomplished via other means, though it would be a
> lot nicer to have a primitive for it.

there is bpf_perf_event_output() interface that allows to stream
arbitrary data from kernel into user space via perf ring buffer.
User space can epoll on it. We use this in both tracing and networking
for notifications and streaming data transfers.
I suspect this can be used for 'logging' too, since it's cheap and fast.

Specifically for android we added bpf_lsm hooks, cookie/uid helpers,
and read-only maps.
Lorenzo,
there was a claim in this thread that bpf is disabled on android.
Can you please clarify ?
If it's actually disabled and there is no intent to enable it,
I'd rather not add any more android specific features to bpf.

What I think is important to understand is that BPF goes through
very active development. The verifier is constantly getting smarter.
There is work to add bounded loops, lock/unlock, get/put tracking,
global/percpu variables, dynamic linking and so on.
Most of the features are available to root only and unpriv
has very limited set. Like getting bpf_perf_event_output() to work
for unpriv will likely require additional verifier work.

So all cool bits will not be usable by seccomp+eBPF and unpriv
on day one. It's not a lot of work either, but once it's done
I'd hate to see arguments against adding more verifier features
just because eBPF is used by seccomp/landlock/other_security_thing.

Also I think the argument that seccomp+eBPF will be faster than
seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
under seccomp, since _all_ syscalls are already slow for sandboxed app.
Instead of making seccomp 5% faster with eBPF, I think it's
worth looking into extending LSM hooks to cover all syscalls and
have programmable (bpf or whatever) filtering applied per syscall.
Like we can have a white list syscall table covered by lsm hooks
and any other syscall will get into old seccomp-style
filtering category automatically.
lsm+bpf would need to follow process hierarchy. It shouldn't be
a runtime check at syscall entry either, but compile time
extra branch in SYSCALL_DEFINE for non-whitelisted syscalls.
There are bunch of other things to figure out, but I think
the perf win will be bigger than replacing cBPF with eBPF in
existing seccomp.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-14 17:32     ` Tycho Andersen
  2018-02-15  4:30       ` Alexei Starovoitov
@ 2018-02-15  4:30       ` Alexei Starovoitov
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2018-02-15  4:30 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Kees Cook, Will Drewry, daniel, netdev,
	Linux Containers, Sargun Dhillon, David S. Miller,
	Lorenzo Colitti

On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
> > >
> > > What's the reason for adding eBPF support? seccomp shouldn't need it,
> > > and it only makes the code more complex. I'd rather stick with cBPF
> > > until we have an overwhelmingly good reason to use eBPF as a "native"
> > > seccomp filter language.
> > >
> > 
> > I can think of two fairly strong use cases for eBPF's ability to call
> > functions: logging and Tycho's user notifier thing.
> 
> Worth noting that there is one additional thing that I didn't
> implement, but which would be nice and is probably not possible with
> eBPF (at least, not without a bunch of additional infrastructure):
> passing fds back to the tracee from the manager if you intercept
> socket(), or accept() or something.
> 
> This could again be accomplished via other means, though it would be a
> lot nicer to have a primitive for it.

there is bpf_perf_event_output() interface that allows to stream
arbitrary data from kernel into user space via perf ring buffer.
User space can epoll on it. We use this in both tracing and networking
for notifications and streaming data transfers.
I suspect this can be used for 'logging' too, since it's cheap and fast.

Specifically for android we added bpf_lsm hooks, cookie/uid helpers,
and read-only maps.
Lorenzo,
there was a claim in this thread that bpf is disabled on android.
Can you please clarify ?
If it's actually disabled and there is no intent to enable it,
I'd rather not add any more android specific features to bpf.

What I think is important to understand is that BPF goes through
very active development. The verifier is constantly getting smarter.
There is work to add bounded loops, lock/unlock, get/put tracking,
global/percpu variables, dynamic linking and so on.
Most of the features are available to root only and unpriv
has very limited set. Like getting bpf_perf_event_output() to work
for unpriv will likely require additional verifier work.

So all cool bits will not be usable by seccomp+eBPF and unpriv
on day one. It's not a lot of work either, but once it's done
I'd hate to see arguments against adding more verifier features
just because eBPF is used by seccomp/landlock/other_security_thing.

Also I think the argument that seccomp+eBPF will be faster than
seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
under seccomp, since _all_ syscalls are already slow for sandboxed app.
Instead of making seccomp 5% faster with eBPF, I think it's
worth looking into extending LSM hooks to cover all syscalls and
have programmable (bpf or whatever) filtering applied per syscall.
Like we can have a white list syscall table covered by lsm hooks
and any other syscall will get into old seccomp-style
filtering category automatically.
lsm+bpf would need to follow process hierarchy. It shouldn't be
a runtime check at syscall entry either, but compile time
extra branch in SYSCALL_DEFINE for non-whitelisted syscalls.
There are bunch of other things to figure out, but I think
the perf win will be bigger than replacing cBPF with eBPF in
existing seccomp.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
@ 2018-02-15  8:35           ` Lorenzo Colitti via Containers
  2018-02-15 16:05           ` Andy Lutomirski
  2018-02-16 18:39           ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Colitti via Containers @ 2018-02-15  8:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Will Drewry, Kees Cook, Daniel Borkmann,
	netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, Andy Lutomirski,
	Sargun Dhillon, David S. Miller

On Thu, Feb 15, 2018 at 1:30 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Specifically for android we added bpf_lsm hooks, cookie/uid helpers,
> and read-only maps.
> Lorenzo,
> there was a claim in this thread that bpf is disabled on android.
> Can you please clarify ?

It's not compiled out, at least at the moment.
https://android.googlesource.com/kernel/configs/+/master/android-4.9/android-base.cfg
has CONFIG_BPF_SYSCALL=y. As with many things on Android, use of EBPF
is (heavily) restricted via selinux, and I'm not aware of any plans to
allow unprivileged applications to use EBPF, or even or any usecases
other than network accounting. Even for this use case, we're looking
at having the program being completely read-only and baked into the
system image.

I definitely don't have a complete view of things though. Also, bear
in mind that none of this code has been released yet, so things could
change.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
  2018-02-15  8:35           ` Lorenzo Colitti via Containers
@ 2018-02-15 16:05           ` Andy Lutomirski
  2018-02-16 18:39           ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-02-15 16:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Will Drewry, Kees Cook, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers, Sargun Dhillon,
	David S. Miller, Lorenzo Colitti


> On Feb 14, 2018, at 8:30 PM, Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
>>>> 
>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>> and it only makes the code more complex. I'd rather stick with cBPF
>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>> seccomp filter language.
>>>> 
>>> 
>>> I can think of two fairly strong use cases for eBPF's ability to call
>>> functions: logging and Tycho's user notifier thing.
>> 
>> Worth noting that there is one additional thing that I didn't
>> implement, but which would be nice and is probably not possible with
>> eBPF (at least, not without a bunch of additional infrastructure):
>> passing fds back to the tracee from the manager if you intercept
>> socket(), or accept() or something.
>> 
>> This could again be accomplished via other means, though it would be a
>> lot nicer to have a primitive for it.
> 
> there is bpf_perf_event_output() interface that allows to stream
> arbitrary data from kernel into user space via perf ring buffer.
> User space can epoll on it. We use this in both tracing and networking
> for notifications and streaming data transfers.
> I suspect this can be used for 'logging' too, since it's cheap and fast.

I think this is the right idea but we'd want to tweak it.  We don't want the log messages to go to some systemwide buffer (seccomp can already so this and its annoying) -- we want them to go to the filter's creator.  In fact, the seccomp listener fd concept could easily be extended to do exactly this.

> 
> Also I think the argument that seccomp+eBPF will be faster than
> seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
> under seccomp, since _all_ syscalls are already slow for sandboxed app.

It's been a while since I benchmarked it, but I suspect that a simple seccomp filter is quite a bit faster than a PTI transition.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-15  4:30       ` Alexei Starovoitov
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
@ 2018-02-15 16:05         ` Andy Lutomirski
  2018-02-16 18:39         ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-02-15 16:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tycho Andersen, Kees Cook, Will Drewry, daniel, netdev,
	Linux Containers, Sargun Dhillon, David S. Miller,
	Lorenzo Colitti


> On Feb 14, 2018, at 8:30 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
>>>> 
>>>> What's the reason for adding eBPF support? seccomp shouldn't need it,
>>>> and it only makes the code more complex. I'd rather stick with cBPF
>>>> until we have an overwhelmingly good reason to use eBPF as a "native"
>>>> seccomp filter language.
>>>> 
>>> 
>>> I can think of two fairly strong use cases for eBPF's ability to call
>>> functions: logging and Tycho's user notifier thing.
>> 
>> Worth noting that there is one additional thing that I didn't
>> implement, but which would be nice and is probably not possible with
>> eBPF (at least, not without a bunch of additional infrastructure):
>> passing fds back to the tracee from the manager if you intercept
>> socket(), or accept() or something.
>> 
>> This could again be accomplished via other means, though it would be a
>> lot nicer to have a primitive for it.
> 
> there is bpf_perf_event_output() interface that allows to stream
> arbitrary data from kernel into user space via perf ring buffer.
> User space can epoll on it. We use this in both tracing and networking
> for notifications and streaming data transfers.
> I suspect this can be used for 'logging' too, since it's cheap and fast.

I think this is the right idea but we'd want to tweak it.  We don't want the log messages to go to some systemwide buffer (seccomp can already so this and its annoying) -- we want them to go to the filter's creator.  In fact, the seccomp listener fd concept could easily be extended to do exactly this.

> 
> Also I think the argument that seccomp+eBPF will be faster than
> seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
> under seccomp, since _all_ syscalls are already slow for sandboxed app.

It's been a while since I benchmarked it, but I suspect that a simple seccomp filter is quite a bit faster than a PTI transition.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
  2018-02-15  8:35           ` Lorenzo Colitti via Containers
  2018-02-15 16:05           ` Andy Lutomirski
@ 2018-02-16 18:39           ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-16 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Will Drewry, Kees Cook, Daniel Borkmann, netdev,
	Linux Containers, Andy Lutomirski, David S. Miller,
	Lorenzo Colitti

On Wed, Feb 14, 2018 at 8:30 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
>> > >
>> > > What's the reason for adding eBPF support? seccomp shouldn't need it,
>> > > and it only makes the code more complex. I'd rather stick with cBPF
>> > > until we have an overwhelmingly good reason to use eBPF as a "native"
>> > > seccomp filter language.
>> > >
>> >
>> > I can think of two fairly strong use cases for eBPF's ability to call
>> > functions: logging and Tycho's user notifier thing.
>>
>> Worth noting that there is one additional thing that I didn't
>> implement, but which would be nice and is probably not possible with
>> eBPF (at least, not without a bunch of additional infrastructure):
>> passing fds back to the tracee from the manager if you intercept
>> socket(), or accept() or something.
>>
>> This could again be accomplished via other means, though it would be a
>> lot nicer to have a primitive for it.
>
> there is bpf_perf_event_output() interface that allows to stream
> arbitrary data from kernel into user space via perf ring buffer.
> User space can epoll on it. We use this in both tracing and networking
> for notifications and streaming data transfers.
> I suspect this can be used for 'logging' too, since it's cheap and fast.
>
> Specifically for android we added bpf_lsm hooks, cookie/uid helpers,
> and read-only maps.
> Lorenzo,
> there was a claim in this thread that bpf is disabled on android.
> Can you please clarify ?
> If it's actually disabled and there is no intent to enable it,
> I'd rather not add any more android specific features to bpf.
>
> What I think is important to understand is that BPF goes through
> very active development. The verifier is constantly getting smarter.
> There is work to add bounded loops, lock/unlock, get/put tracking,
> global/percpu variables, dynamic linking and so on.
> Most of the features are available to root only and unpriv
> has very limited set. Like getting bpf_perf_event_output() to work
> for unpriv will likely require additional verifier work.
>
> So all cool bits will not be usable by seccomp+eBPF and unpriv
> on day one. It's not a lot of work either, but once it's done
> I'd hate to see arguments against adding more verifier features
> just because eBPF is used by seccomp/landlock/other_security_thing.
>
> Also I think the argument that seccomp+eBPF will be faster than
> seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
> under seccomp, since _all_ syscalls are already slow for sandboxed app.
> Instead of making seccomp 5% faster with eBPF, I think it's
> worth looking into extending LSM hooks to cover all syscalls and
> have programmable (bpf or whatever) filtering applied per syscall.
> Like we can have a white list syscall table covered by lsm hooks
> and any other syscall will get into old seccomp-style
> filtering category automatically.
> lsm+bpf would need to follow process hierarchy. It shouldn't be
> a runtime check at syscall entry either, but compile time
> extra branch in SYSCALL_DEFINE for non-whitelisted syscalls.
> There are bunch of other things to figure out, but I think
> the perf win will be bigger than replacing cBPF with eBPF in
> existing seccomp.
>
Given this test program:
for (i = 10; i < 99999999; i++) syscall(__NR_getpid);

If I implement an eBPF filter with PROG_ARRAYs, and tail call, the
numbers are such:
ebpf JIT 12.3% slower than native
ebpf no JIT 13.6% slower than native
seccomp JIT 17.6% slower than native
seccomp no JIT 37% slower than native

This is using libseccomp for the standard seccomp BPF program. There's
no reasonable way for our workload to know which syscalls come
"earlier", so we can't take that optimization. Potentially, libseccomp
can be smarter about ordering cases (using ranges), and use an
O(log(n)) search algorithm, but both of these are microptimizations
that scale with the number of syscalls and per-syscall rules. The
nicety of using a PROG_ARRAY means that adding additional filters
(syscalls) comes at no cost, whereas there's a tradeoff any time you
add another rule in traditional seccomp filters.

This was tested on an Amazon M4.16XL running with pcid, and KPTI.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-15  4:30       ` Alexei Starovoitov
       [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
  2018-02-15 16:05         ` Andy Lutomirski
@ 2018-02-16 18:39         ` Sargun Dhillon
  2 siblings, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-16 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tycho Andersen, Andy Lutomirski, Kees Cook, Will Drewry,
	Daniel Borkmann, netdev, Linux Containers, David S. Miller,
	Lorenzo Colitti

On Wed, Feb 14, 2018 at 8:30 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Feb 14, 2018 at 10:32:22AM -0700, Tycho Andersen wrote:
>> > >
>> > > What's the reason for adding eBPF support? seccomp shouldn't need it,
>> > > and it only makes the code more complex. I'd rather stick with cBPF
>> > > until we have an overwhelmingly good reason to use eBPF as a "native"
>> > > seccomp filter language.
>> > >
>> >
>> > I can think of two fairly strong use cases for eBPF's ability to call
>> > functions: logging and Tycho's user notifier thing.
>>
>> Worth noting that there is one additional thing that I didn't
>> implement, but which would be nice and is probably not possible with
>> eBPF (at least, not without a bunch of additional infrastructure):
>> passing fds back to the tracee from the manager if you intercept
>> socket(), or accept() or something.
>>
>> This could again be accomplished via other means, though it would be a
>> lot nicer to have a primitive for it.
>
> there is bpf_perf_event_output() interface that allows to stream
> arbitrary data from kernel into user space via perf ring buffer.
> User space can epoll on it. We use this in both tracing and networking
> for notifications and streaming data transfers.
> I suspect this can be used for 'logging' too, since it's cheap and fast.
>
> Specifically for android we added bpf_lsm hooks, cookie/uid helpers,
> and read-only maps.
> Lorenzo,
> there was a claim in this thread that bpf is disabled on android.
> Can you please clarify ?
> If it's actually disabled and there is no intent to enable it,
> I'd rather not add any more android specific features to bpf.
>
> What I think is important to understand is that BPF goes through
> very active development. The verifier is constantly getting smarter.
> There is work to add bounded loops, lock/unlock, get/put tracking,
> global/percpu variables, dynamic linking and so on.
> Most of the features are available to root only and unpriv
> has very limited set. Like getting bpf_perf_event_output() to work
> for unpriv will likely require additional verifier work.
>
> So all cool bits will not be usable by seccomp+eBPF and unpriv
> on day one. It's not a lot of work either, but once it's done
> I'd hate to see arguments against adding more verifier features
> just because eBPF is used by seccomp/landlock/other_security_thing.
>
> Also I think the argument that seccomp+eBPF will be faster than
> seccomp+cBPF is a weak one. I bet kpti on/off makes no difference
> under seccomp, since _all_ syscalls are already slow for sandboxed app.
> Instead of making seccomp 5% faster with eBPF, I think it's
> worth looking into extending LSM hooks to cover all syscalls and
> have programmable (bpf or whatever) filtering applied per syscall.
> Like we can have a white list syscall table covered by lsm hooks
> and any other syscall will get into old seccomp-style
> filtering category automatically.
> lsm+bpf would need to follow process hierarchy. It shouldn't be
> a runtime check at syscall entry either, but compile time
> extra branch in SYSCALL_DEFINE for non-whitelisted syscalls.
> There are bunch of other things to figure out, but I think
> the perf win will be bigger than replacing cBPF with eBPF in
> existing seccomp.
>
Given this test program:
for (i = 10; i < 99999999; i++) syscall(__NR_getpid);

If I implement an eBPF filter with PROG_ARRAYs, and tail call, the
numbers are such:
ebpf JIT 12.3% slower than native
ebpf no JIT 13.6% slower than native
seccomp JIT 17.6% slower than native
seccomp no JIT 37% slower than native

This is using libseccomp for the standard seccomp BPF program. There's
no reasonable way for our workload to know which syscalls come
"earlier", so we can't take that optimization. Potentially, libseccomp
can be smarter about ordering cases (using ranges), and use an
O(log(n)) search algorithm, but both of these are microptimizations
that scale with the number of syscalls and per-syscall rules. The
nicety of using a PROG_ARRAY means that adding additional filters
(syscalls) comes at no cost, whereas there's a tradeoff any time you
add another rule in traditional seccomp filters.

This was tested on an Amazon M4.16XL running with pcid, and KPTI.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found]   ` <CAGXu5jJZgrgLrhkZO33RNdOds8zwnnOZh+rqwguxJM+zm=EJ7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 20:38     ` Tom Hromatka
  0 siblings, 0 replies; 34+ messages in thread
From: Tom Hromatka @ 2018-02-13 20:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski,
	Sargun Dhillon



On 02/13/2018 01:35 PM, Kees Cook wrote:
> On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka <tom.hromatka@oracle.com> wrote:
>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>> this patchset doesn't introduce much of the functionality enabled by
>>> eBPF, it lays the ground work for it.
>>>
>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>> In the attached samples, there's an example of this. One can then use
>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>> and use that at reload time.
>>>
>>> The primary reason for not adding maps support in this patchset is
>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>> If we have a map that the BPF program can read, it can potentially
>>> "change" privileges after running. It seems like doing writes only
>>> is safe, because it can be pure, and side effect free, and therefore
>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>> to an agreement, this can be in a follow-up patchset.
>>
>>
>> Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp
>> userspace mailing list just last week:
>> https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74
>>
>> The kernel changes I proposed are in this email:
>> https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ
>>
>> In that email thread, Kees requested that I try out a binary tree in cBPF
>> and evaluate its performance.  I just got a rough prototype working, and
>> while not as fast as an eBPF hash map, the cBPF binary tree was a
>> significant
>> improvement over the linear list of ifs that are currently generated.  Also,
>> it only required changing a single function within the libseccomp libary
>> itself.
>>
>> https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b
>>
>> Here are the results I am currently seeing using an in-house customer's
>> seccomp filter and a simplistic test program that runs getppid() thousands
>> of times.
>>
>> Test Case                      minimum TSC ticks to make syscall
>> ----------------------------------------------------------------
>> seccomp disabled                                             620
>> getppid() at the front of 306-syscall seccomp filter         722
>> getppid() in middle of 306-syscall seccomp filter           1392
>> getppid() at the end of the 306-syscall filter              2452
>> seccomp using a 306-syscall-sized EBPF hash map              800
>> cBPF filter using a binary tree                              922
> I still think that's a crazy filter. :) It should be inverted to just
> check the 26 syscalls and a final "greater than" test. I would expect
> it to be faster still. :)
>
> -Kees

I completely agree it's a crazy filter, but it seems to be a
common "mistake" our users are making.  It would be nice to
help them out if we can.

Tom

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 20:35 ` Kees Cook
@ 2018-02-13 20:38   ` Tom Hromatka
       [not found]   ` <CAGXu5jJZgrgLrhkZO33RNdOds8zwnnOZh+rqwguxJM+zm=EJ7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 34+ messages in thread
From: Tom Hromatka @ 2018-02-13 20:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: Network Development, Sargun Dhillon, Will Drewry,
	Daniel Borkmann, Linux Containers, Alexei Starovoitov,
	Andy Lutomirski



On 02/13/2018 01:35 PM, Kees Cook wrote:
> On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka <tom.hromatka@oracle.com> wrote:
>> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although,
>>> this patchset doesn't introduce much of the functionality enabled by
>>> eBPF, it lays the ground work for it.
>>>
>>> It also introduces the capability to dump eBPF filters via the PTRACE
>>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>>> In the attached samples, there's an example of this. One can then use
>>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>>> and use that at reload time.
>>>
>>> The primary reason for not adding maps support in this patchset is
>>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>>> If we have a map that the BPF program can read, it can potentially
>>> "change" privileges after running. It seems like doing writes only
>>> is safe, because it can be pure, and side effect free, and therefore
>>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>>> to an agreement, this can be in a follow-up patchset.
>>
>>
>> Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp
>> userspace mailing list just last week:
>> https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74
>>
>> The kernel changes I proposed are in this email:
>> https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ
>>
>> In that email thread, Kees requested that I try out a binary tree in cBPF
>> and evaluate its performance.  I just got a rough prototype working, and
>> while not as fast as an eBPF hash map, the cBPF binary tree was a
>> significant
>> improvement over the linear list of ifs that are currently generated.  Also,
>> it only required changing a single function within the libseccomp libary
>> itself.
>>
>> https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b
>>
>> Here are the results I am currently seeing using an in-house customer's
>> seccomp filter and a simplistic test program that runs getppid() thousands
>> of times.
>>
>> Test Case                      minimum TSC ticks to make syscall
>> ----------------------------------------------------------------
>> seccomp disabled                                             620
>> getppid() at the front of 306-syscall seccomp filter         722
>> getppid() in middle of 306-syscall seccomp filter           1392
>> getppid() at the end of the 306-syscall filter              2452
>> seccomp using a 306-syscall-sized EBPF hash map              800
>> cBPF filter using a binary tree                              922
> I still think that's a crazy filter. :) It should be inverted to just
> check the 26 syscalls and a final "greater than" test. I would expect
> it to be faster still. :)
>
> -Kees

I completely agree it's a crazy filter, but it seems to be a
common "mistake" our users are making.  It would be nice to
help them out if we can.

Tom

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
       [not found] ` <7eb1497e-e5f3-c5ba-e255-7f510795b51d-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2018-02-13 20:35   ` Kees Cook
  0 siblings, 0 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 20:35 UTC (permalink / raw)
  To: Tom Hromatka
  Cc: Will Drewry, Daniel Borkmann, Network Development,
	Linux Containers, Alexei Starovoitov, Andy Lutomirski,
	Sargun Dhillon

On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka <tom.hromatka-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
>>
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
>
>
> Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp
> userspace mailing list just last week:
> https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74
>
> The kernel changes I proposed are in this email:
> https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ
>
> In that email thread, Kees requested that I try out a binary tree in cBPF
> and evaluate its performance.  I just got a rough prototype working, and
> while not as fast as an eBPF hash map, the cBPF binary tree was a
> significant
> improvement over the linear list of ifs that are currently generated.  Also,
> it only required changing a single function within the libseccomp libary
> itself.
>
> https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b
>
> Here are the results I am currently seeing using an in-house customer's
> seccomp filter and a simplistic test program that runs getppid() thousands
> of times.
>
> Test Case                      minimum TSC ticks to make syscall
> ----------------------------------------------------------------
> seccomp disabled                                             620
> getppid() at the front of 306-syscall seccomp filter         722
> getppid() in middle of 306-syscall seccomp filter           1392
> getppid() at the end of the 306-syscall filter              2452
> seccomp using a 306-syscall-sized EBPF hash map              800
> cBPF filter using a binary tree                              922

I still think that's a crazy filter. :) It should be inverted to just
check the 26 syscalls and a final "greater than" test. I would expect
it to be faster still. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
  2018-02-13 20:33 Tom Hromatka
@ 2018-02-13 20:35 ` Kees Cook
  2018-02-13 20:38   ` Tom Hromatka
       [not found]   ` <CAGXu5jJZgrgLrhkZO33RNdOds8zwnnOZh+rqwguxJM+zm=EJ7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found] ` <7eb1497e-e5f3-c5ba-e255-7f510795b51d-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 34+ messages in thread
From: Kees Cook @ 2018-02-13 20:35 UTC (permalink / raw)
  To: Tom Hromatka
  Cc: Network Development, Sargun Dhillon, Will Drewry,
	Daniel Borkmann, Linux Containers, Alexei Starovoitov,
	Andy Lutomirski

On Tue, Feb 13, 2018 at 12:33 PM, Tom Hromatka <tom.hromatka@oracle.com> wrote:
> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun@sargun.me> wrote:
>>
>> This patchset enables seccomp filters to be written in eBPF. Although,
>> this patchset doesn't introduce much of the functionality enabled by
>> eBPF, it lays the ground work for it.
>>
>> It also introduces the capability to dump eBPF filters via the PTRACE
>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
>> In the attached samples, there's an example of this. One can then use
>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
>> and use that at reload time.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
>
>
> Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp
> userspace mailing list just last week:
> https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74
>
> The kernel changes I proposed are in this email:
> https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ
>
> In that email thread, Kees requested that I try out a binary tree in cBPF
> and evaluate its performance.  I just got a rough prototype working, and
> while not as fast as an eBPF hash map, the cBPF binary tree was a
> significant
> improvement over the linear list of ifs that are currently generated.  Also,
> it only required changing a single function within the libseccomp libary
> itself.
>
> https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b
>
> Here are the results I am currently seeing using an in-house customer's
> seccomp filter and a simplistic test program that runs getppid() thousands
> of times.
>
> Test Case                      minimum TSC ticks to make syscall
> ----------------------------------------------------------------
> seccomp disabled                                             620
> getppid() at the front of 306-syscall seccomp filter         722
> getppid() in middle of 306-syscall seccomp filter           1392
> getppid() at the end of the 306-syscall filter              2452
> seccomp using a 306-syscall-sized EBPF hash map              800
> cBPF filter using a binary tree                              922

I still think that's a crazy filter. :) It should be inverted to just
check the 26 syscalls and a final "greater than" test. I would expect
it to be faster still. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net-next 0/3] eBPF Seccomp filters
@ 2018-02-13 20:33 Tom Hromatka
  2018-02-13 20:35 ` Kees Cook
       [not found] ` <7eb1497e-e5f3-c5ba-e255-7f510795b51d-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 34+ messages in thread
From: Tom Hromatka @ 2018-02-13 20:33 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ
  Cc: wad-F7+t8E8rja9g9hUCZPvPmw, Kees Cook,
	daniel-FeC+5ew28dpmcu3hnIyYJQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ast-DgEjT+Ai2ygdnm+yROfE0A, luto-kltTT9wpgjJwATOyAt5JVQ

On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sargun-GaZTRHToo+CzQB+pC5nmwQ@public.gmane.org> wrote:
> This patchset enables seccomp filters to be written in eBPF. Although,
> this patchset doesn't introduce much of the functionality enabled by
> eBPF, it lays the ground work for it.
>
> It also introduces the capability to dump eBPF filters via the PTRACE
> API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
> In the attached samples, there's an example of this. One can then use
> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
> and use that at reload time.
>
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.


Coincidentally I also sent an RFC for adding eBPF hash maps to the seccomp
userspace mailing list just last week:
https://groups.google.com/forum/#!topic/libseccomp/pX6QkVF0F74

The kernel changes I proposed are in this email:
https://groups.google.com/d/msg/libseccomp/pX6QkVF0F74/ZUJlwI5qAwAJ

In that email thread, Kees requested that I try out a binary tree in cBPF
and evaluate its performance.  I just got a rough prototype working, and
while not as fast as an eBPF hash map, the cBPF binary tree was a significant
improvement over the linear list of ifs that are currently generated.  Also,
it only required changing a single function within the libseccomp libary
itself.

https://github.com/drakenclimber/libseccomp/commit/87b36369f17385f5a7a4d95101185577fbf6203b

Here are the results I am currently seeing using an in-house customer's
seccomp filter and a simplistic test program that runs getppid() thousands
of times.

Test Case                      minimum TSC ticks to make syscall
----------------------------------------------------------------
seccomp disabled                                             620
getppid() at the front of 306-syscall seccomp filter         722
getppid() in middle of 306-syscall seccomp filter           1392
getppid() at the end of the 306-syscall filter              2452
seccomp using a 306-syscall-sized EBPF hash map              800
cBPF filter using a binary tree                              922

Thanks.

Tom

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net-next 0/3] eBPF Seccomp filters
@ 2018-02-13 15:42 Sargun Dhillon
  0 siblings, 0 replies; 34+ messages in thread
From: Sargun Dhillon @ 2018-02-13 15:42 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: wad-F7+t8E8rja9g9hUCZPvPmw, keescook-F7+t8E8rja9g9hUCZPvPmw,
	daniel-FeC+5ew28dpmcu3hnIyYJQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ast-DgEjT+Ai2ygdnm+yROfE0A, luto-kltTT9wpgjJwATOyAt5JVQ

This patchset enables seccomp filters to be written in eBPF. Although,
this patchset doesn't introduce much of the functionality enabled by
eBPF, it lays the ground work for it.

It also introduces the capability to dump eBPF filters via the PTRACE
API in order to make it so that CHECKPOINT_RESTORE will be satisifed.
In the attached samples, there's an example of this. One can then use
BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program,
and use that at reload time.

The primary reason for not adding maps support in this patchset is
to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
If we have a map that the BPF program can read, it can potentially
"change" privileges after running. It seems like doing writes only
is safe, because it can be pure, and side effect free, and therefore
not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
to an agreement, this can be in a follow-up patchset.


Sargun Dhillon (3):
  bpf, seccomp: Add eBPF filter capabilities
  seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
    filters
  bpf: Add eBPF seccomp sample programs

 arch/Kconfig                 |   7 ++
 include/linux/bpf_types.h    |   3 +
 include/linux/seccomp.h      |  12 +++
 include/uapi/linux/bpf.h     |   2 +
 include/uapi/linux/ptrace.h  |   5 +-
 include/uapi/linux/seccomp.h |  15 ++--
 kernel/bpf/syscall.c         |   1 +
 kernel/ptrace.c              |   3 +
 kernel/seccomp.c             | 185 ++++++++++++++++++++++++++++++++++++++-----
 samples/bpf/Makefile         |   9 +++
 samples/bpf/bpf_load.c       |   9 ++-
 samples/bpf/seccomp1_kern.c  |  17 ++++
 samples/bpf/seccomp1_user.c  |  34 ++++++++
 samples/bpf/seccomp2_kern.c  |  24 ++++++
 samples/bpf/seccomp2_user.c  |  66 +++++++++++++++
 15 files changed, 362 insertions(+), 30 deletions(-)
 create mode 100644 samples/bpf/seccomp1_kern.c
 create mode 100644 samples/bpf/seccomp1_user.c
 create mode 100644 samples/bpf/seccomp2_kern.c
 create mode 100644 samples/bpf/seccomp2_user.c

-- 
2.14.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-02-16 18:40 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 15:42 [PATCH net-next 0/3] eBPF Seccomp filters Sargun Dhillon
2018-02-13 15:47 ` Kees Cook
2018-02-13 16:29   ` Sargun Dhillon
     [not found]     ` <CAMp4zn8VNurTjmrUtHnaK21A4hUQQz5tnarj15vmTU+TjY79XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 17:02       ` Jessie Frazelle
2018-02-13 17:02     ` Jessie Frazelle
     [not found]       ` <CAEk6tEw3ty0kBH+06TYt4=Ywt-4_cHBa9f8p3ajMghtjRkHmMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 17:07         ` Brian Goff
2018-02-13 17:31         ` Sargun Dhillon
2018-02-13 17:31       ` Sargun Dhillon
2018-02-13 20:16         ` Kees Cook
2018-02-13 21:08           ` Paul Moore
     [not found]           ` <CAGXu5jKv3QFVKLhok1JWiPamE0b4CqLTO-hx8sP0KWED921=6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 20:50             ` Tycho Andersen
2018-02-13 21:08             ` Paul Moore
     [not found]         ` <CAMp4zn-Lw0grNrCyjHJZUje1Aznaj03iAUWZ86ki68MZMN1-zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 20:16           ` Kees Cook
2018-02-14 17:25   ` Andy Lutomirski
2018-02-14 17:32     ` Tycho Andersen
2018-02-15  4:30       ` Alexei Starovoitov
2018-02-15  4:30       ` Alexei Starovoitov
     [not found]         ` <20180215043027.zssmhvfdn7iz3rlz-+o4/htvd0TCa6kscz5V53/3mLCh9rsb+VpNB7YpNyf8@public.gmane.org>
2018-02-15  8:35           ` Lorenzo Colitti via Containers
2018-02-15 16:05           ` Andy Lutomirski
2018-02-16 18:39           ` Sargun Dhillon
2018-02-15 16:05         ` Andy Lutomirski
2018-02-16 18:39         ` Sargun Dhillon
     [not found]     ` <CALCETrV9xUd3XRgobTDgVNRFY_+o=pEDkfjvuxQ7w_UyH324zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-14 17:32       ` Tycho Andersen
     [not found]   ` <CAGXu5jLiYh0rSRuJ_-2xLB03Wod5G07njpoESR4SnmsmiUnsEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 16:29     ` Sargun Dhillon
2018-02-14 17:25     ` Andy Lutomirski
2018-02-14  0:47 ` Mickaël Salaün
     [not found] ` <20180213154244.GA3292-du9IEJ8oIxHXYT48pCVpJ3c7ZZ+wIVaZYkHkVr5ML8kVGlcevz2xqA@public.gmane.org>
2018-02-13 15:47   ` Kees Cook
2018-02-14  0:47   ` Mickaël Salaün
  -- strict thread matches above, loose matches on Subject: below --
2018-02-13 20:33 Tom Hromatka
2018-02-13 20:35 ` Kees Cook
2018-02-13 20:38   ` Tom Hromatka
     [not found]   ` <CAGXu5jJZgrgLrhkZO33RNdOds8zwnnOZh+rqwguxJM+zm=EJ7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 20:38     ` Tom Hromatka
     [not found] ` <7eb1497e-e5f3-c5ba-e255-7f510795b51d-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2018-02-13 20:35   ` Kees Cook
2018-02-13 15:42 Sargun Dhillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.