linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* eBPF / seccomp globals?
@ 2015-09-04  1:01 Michael Tirado
  2015-09-04  3:17 ` Alexei Starovoitov
  2015-09-04  4:01 ` Kees Cook
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Tirado @ 2015-09-04  1:01 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hiyall,

I have created a seccomp white list filter for a program that launches
other less trustworthy programs.  It's working great so far, but I
have run into a little roadblock.  the launcher program needs to call
execve as it's final step, but that may not be present in the white
list.  I am wondering if there is any way to use some sort of global
variable that will be preserved between syscall filter calls so that I
can allow only one execve, if not present in white list by
incrementing a counter variable.

I see that in Documentation/networking/filter.txt one of the registers
is documented as being a pointer to struct sk_buff, in the seccomp
context this is a pointer to struct seccomp_data  instead, right?  and
the line about callee saved registers R6-R9  probably refers to them
being saved across calls within that filter, and not calls between
filters?

My apologies if this is not the appropriate place to ask for help, but
it is difficult to find useful information on how eBPF works, and is a
bit confusing trying to figure out the differences between seccomp and
net filters, and the old bpf code kicking around short of spending
countless hours reading through all of it.  If anybody has a some
links to share I would be very grateful.  the only way I can think to
make this work otherwise is to mount everything as MS_NOEXEC in the
new namespace, but that just feels wrong.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04  1:01 eBPF / seccomp globals? Michael Tirado
@ 2015-09-04  3:17 ` Alexei Starovoitov
  2015-09-04 14:03   ` Tycho Andersen
  2015-09-04  4:01 ` Kees Cook
  1 sibling, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2015-09-04  3:17 UTC (permalink / raw)
  To: Michael Tirado
  Cc: netdev, linux-kernel, Tycho Andersen, Serge E. Hallyn, Kees Cook

On Fri, Sep 04, 2015 at 01:01:20AM +0000, Michael Tirado wrote:
> Hiyall,
> 
> I have created a seccomp white list filter for a program that launches
> other less trustworthy programs.  It's working great so far, but I
> have run into a little roadblock.  the launcher program needs to call
> execve as it's final step, but that may not be present in the white
> list.  I am wondering if there is any way to use some sort of global
> variable that will be preserved between syscall filter calls so that I
> can allow only one execve, if not present in white list by
> incrementing a counter variable.
> 
> I see that in Documentation/networking/filter.txt one of the registers
> is documented as being a pointer to struct sk_buff, in the seccomp
> context this is a pointer to struct seccomp_data  instead, right?  and
> the line about callee saved registers R6-R9  probably refers to them
> being saved across calls within that filter, and not calls between
> filters?

R6-R9 are the registered preserved across calls to helper functions
within single program. They are not preserved across invocations
of the same program. At the start of the program only R1 (pointer
to context) is valid.
The eBPF programs used for kprobes, sockets and TC can simulate
global state via maps. Like a map of one element can have some
'struct globals { ... }' as a value in such map. Then programs
can keep global state in there. If a key into such map is cpu_id,
then such state becomes per-cpu global. Other tricks possible too.
Unfortunately seccomp doesn't have access to eBPF yet
(only classic BPF is supported), but, I believe, Tycho is
working on adding eBPF to seccomp and criu of eBPF programs...


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04  1:01 eBPF / seccomp globals? Michael Tirado
  2015-09-04  3:17 ` Alexei Starovoitov
@ 2015-09-04  4:01 ` Kees Cook
  2015-09-04 20:29   ` Michael Tirado
  1 sibling, 1 reply; 11+ messages in thread
From: Kees Cook @ 2015-09-04  4:01 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Network Development, LKML

On Thu, Sep 3, 2015 at 6:01 PM, Michael Tirado <mtirado418@gmail.com> wrote:
> Hiyall,
>
> I have created a seccomp white list filter for a program that launches
> other less trustworthy programs.  It's working great so far, but I
> have run into a little roadblock.  the launcher program needs to call
> execve as it's final step, but that may not be present in the white
> list.  I am wondering if there is any way to use some sort of global
> variable that will be preserved between syscall filter calls so that I
> can allow only one execve, if not present in white list by
> incrementing a counter variable.
>
> I see that in Documentation/networking/filter.txt one of the registers
> is documented as being a pointer to struct sk_buff, in the seccomp
> context this is a pointer to struct seccomp_data  instead, right?  and
> the line about callee saved registers R6-R9  probably refers to them
> being saved across calls within that filter, and not calls between
> filters?
>
> My apologies if this is not the appropriate place to ask for help, but
> it is difficult to find useful information on how eBPF works, and is a
> bit confusing trying to figure out the differences between seccomp and
> net filters, and the old bpf code kicking around short of spending
> countless hours reading through all of it.  If anybody has a some
> links to share I would be very grateful.  the only way I can think to
> make this work otherwise is to mount everything as MS_NOEXEC in the
> new namespace, but that just feels wrong.

For documentation, there's some great slides on seccomp from Plumber's
this year[1].

At present, there is no variable state beyond the syscall context (PC,
args) available to seccomp filters. The no_new_privs prctl was added
to reduce the risk of including execve in a filter's whitelist, but
that isn't as strong as the "exec once" feature you want.

What we did in Chrome OS was to use the "minijail" tool[2] to
LD_PRELOAD a .so that sets up the seccomp filter after the exec. It's
a bit of a hack, but works in well-defined environments. You are
talking about namespaces, though, so maybe minijail is worth a look?
It does that too and a whole lot more.

As for using maps via eBPF in seccomp, it's on the horizon, but it
comes with a lot exposure that I haven't finished pondering, so I
don't think those features will be added soon.

-Kees

[1] http://man7.org/conf/lpc2015/limiting_kernel_attack_surface_with_seccomp-LPC_2015-Kerrisk.pdf
[2] see subdirectory "minijail" after "git clone
https://chromium.googlesource.com/chromiumos/platform2/"


-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04  3:17 ` Alexei Starovoitov
@ 2015-09-04 14:03   ` Tycho Andersen
  0 siblings, 0 replies; 11+ messages in thread
From: Tycho Andersen @ 2015-09-04 14:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Michael Tirado, netdev, linux-kernel, Serge E. Hallyn, Kees Cook

Hi all,

On Thu, Sep 03, 2015 at 08:17:05PM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 04, 2015 at 01:01:20AM +0000, Michael Tirado wrote:
> > Hiyall,
> > 
> > I have created a seccomp white list filter for a program that launches
> > other less trustworthy programs.  It's working great so far, but I
> > have run into a little roadblock.  the launcher program needs to call
> > execve as it's final step, but that may not be present in the white
> > list.  I am wondering if there is any way to use some sort of global
> > variable that will be preserved between syscall filter calls so that I
> > can allow only one execve, if not present in white list by
> > incrementing a counter variable.
> > 
> > I see that in Documentation/networking/filter.txt one of the registers
> > is documented as being a pointer to struct sk_buff, in the seccomp
> > context this is a pointer to struct seccomp_data  instead, right?  and
> > the line about callee saved registers R6-R9  probably refers to them
> > being saved across calls within that filter, and not calls between
> > filters?
> 
> R6-R9 are the registered preserved across calls to helper functions
> within single program. They are not preserved across invocations
> of the same program. At the start of the program only R1 (pointer
> to context) is valid.
> The eBPF programs used for kprobes, sockets and TC can simulate
> global state via maps. Like a map of one element can have some
> 'struct globals { ... }' as a value in such map. Then programs
> can keep global state in there. If a key into such map is cpu_id,
> then such state becomes per-cpu global. Other tricks possible too.
> Unfortunately seccomp doesn't have access to eBPF yet
> (only classic BPF is supported), but, I believe, Tycho is
> working on adding eBPF to seccomp and criu of eBPF programs...

Indeed I am, however my patches don't have support for seccomp
programs using eBPF maps. I'm intending to post them later today, so
stay tuned.

Tycho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04  4:01 ` Kees Cook
@ 2015-09-04 20:29   ` Michael Tirado
  2015-09-04 20:37     ` Kees Cook
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Tirado @ 2015-09-04 20:29 UTC (permalink / raw)
  To: Kees Cook; +Cc: Network Development, LKML, alexei.starovoitov

> What we did in Chrome OS was to use the "minijail" tool[2] to
> LD_PRELOAD a .so that sets up the seccomp filter after the exec. It's
> a bit of a hack, but works in well-defined environments. You are
> talking about namespaces, though, so maybe minijail is worth a look?
> It does that too and a whole lot more.

Minijail is pretty similar to what I have been working on the past few
months,  unfortunately I have already written it, doh!  Those slides
are a good resource,  definitely helpful as introduction to seccomp.

So it seems there are no easy solutions to this problem. Using
LD_PRELOAD to defer seccomp filter application scares me a little bit,
and won't work with file capabilities IIRC, though it is a damn clever
solution.  I think for now I will explore the possibility of
validating argument 1 of exec to allow only the program I am launching
to be exec'd, so if somehow by Thor's hammer that program escapes it's
sandbox, it will only be able to exec itself.  I suppose it will have
to now be restricted to absolute paths only.

Thanks everyone for the clarification!

On Fri, Sep 4, 2015 at 4:01 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 3, 2015 at 6:01 PM, Michael Tirado <mtirado418@gmail.com> wrote:
>> Hiyall,
>>
>> I have created a seccomp white list filter for a program that launches
>> other less trustworthy programs.  It's working great so far, but I
>> have run into a little roadblock.  the launcher program needs to call
>> execve as it's final step, but that may not be present in the white
>> list.  I am wondering if there is any way to use some sort of global
>> variable that will be preserved between syscall filter calls so that I
>> can allow only one execve, if not present in white list by
>> incrementing a counter variable.
>>
>> I see that in Documentation/networking/filter.txt one of the registers
>> is documented as being a pointer to struct sk_buff, in the seccomp
>> context this is a pointer to struct seccomp_data  instead, right?  and
>> the line about callee saved registers R6-R9  probably refers to them
>> being saved across calls within that filter, and not calls between
>> filters?
>>
>> My apologies if this is not the appropriate place to ask for help, but
>> it is difficult to find useful information on how eBPF works, and is a
>> bit confusing trying to figure out the differences between seccomp and
>> net filters, and the old bpf code kicking around short of spending
>> countless hours reading through all of it.  If anybody has a some
>> links to share I would be very grateful.  the only way I can think to
>> make this work otherwise is to mount everything as MS_NOEXEC in the
>> new namespace, but that just feels wrong.
>
> For documentation, there's some great slides on seccomp from Plumber's
> this year[1].
>
> At present, there is no variable state beyond the syscall context (PC,
> args) available to seccomp filters. The no_new_privs prctl was added
> to reduce the risk of including execve in a filter's whitelist, but
> that isn't as strong as the "exec once" feature you want.
>
> What we did in Chrome OS was to use the "minijail" tool[2] to
> LD_PRELOAD a .so that sets up the seccomp filter after the exec. It's
> a bit of a hack, but works in well-defined environments. You are
> talking about namespaces, though, so maybe minijail is worth a look?
> It does that too and a whole lot more.
>
> As for using maps via eBPF in seccomp, it's on the horizon, but it
> comes with a lot exposure that I haven't finished pondering, so I
> don't think those features will be added soon.
>
> -Kees
>
> [1] http://man7.org/conf/lpc2015/limiting_kernel_attack_surface_with_seccomp-LPC_2015-Kerrisk.pdf
> [2] see subdirectory "minijail" after "git clone
> https://chromium.googlesource.com/chromiumos/platform2/"
>
>
> --
> Kees Cook
> Chrome OS Security

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04 20:29   ` Michael Tirado
@ 2015-09-04 20:37     ` Kees Cook
  2015-09-10 21:55       ` Michael Tirado
  0 siblings, 1 reply; 11+ messages in thread
From: Kees Cook @ 2015-09-04 20:37 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Network Development, LKML, Alexei Starovoitov

On Fri, Sep 4, 2015 at 1:29 PM, Michael Tirado <mtirado418@gmail.com> wrote:
>> What we did in Chrome OS was to use the "minijail" tool[2] to
>> LD_PRELOAD a .so that sets up the seccomp filter after the exec. It's
>> a bit of a hack, but works in well-defined environments. You are
>> talking about namespaces, though, so maybe minijail is worth a look?
>> It does that too and a whole lot more.
>
> Minijail is pretty similar to what I have been working on the past few
> months,  unfortunately I have already written it, doh!  Those slides
> are a good resource,  definitely helpful as introduction to seccomp.
>
> So it seems there are no easy solutions to this problem. Using
> LD_PRELOAD to defer seccomp filter application scares me a little bit,
> and won't work with file capabilities IIRC, though it is a damn clever

Do you still need file capabilities with the availability of the new
ambient capabilities?

https://s3hh.wordpress.com/2015/07/25/ambient-capabilities/
http://thread.gmane.org/gmane.linux.kernel.lsm/24034

> solution.  I think for now I will explore the possibility of
> validating argument 1 of exec to allow only the program I am launching
> to be exec'd, so if somehow by Thor's hammer that program escapes it's
> sandbox, it will only be able to exec itself.  I suppose it will have
> to now be restricted to absolute paths only.

Well, you can only examine the memory address and not what's pointed
to, so you may be out of luck there too. Sorry! On the TODO list is
doing deep argument inspection, but it is not an easy thing to get
right. :)

-Kees

>
> Thanks everyone for the clarification!
>
> On Fri, Sep 4, 2015 at 4:01 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Sep 3, 2015 at 6:01 PM, Michael Tirado <mtirado418@gmail.com> wrote:
>>> Hiyall,
>>>
>>> I have created a seccomp white list filter for a program that launches
>>> other less trustworthy programs.  It's working great so far, but I
>>> have run into a little roadblock.  the launcher program needs to call
>>> execve as it's final step, but that may not be present in the white
>>> list.  I am wondering if there is any way to use some sort of global
>>> variable that will be preserved between syscall filter calls so that I
>>> can allow only one execve, if not present in white list by
>>> incrementing a counter variable.
>>>
>>> I see that in Documentation/networking/filter.txt one of the registers
>>> is documented as being a pointer to struct sk_buff, in the seccomp
>>> context this is a pointer to struct seccomp_data  instead, right?  and
>>> the line about callee saved registers R6-R9  probably refers to them
>>> being saved across calls within that filter, and not calls between
>>> filters?
>>>
>>> My apologies if this is not the appropriate place to ask for help, but
>>> it is difficult to find useful information on how eBPF works, and is a
>>> bit confusing trying to figure out the differences between seccomp and
>>> net filters, and the old bpf code kicking around short of spending
>>> countless hours reading through all of it.  If anybody has a some
>>> links to share I would be very grateful.  the only way I can think to
>>> make this work otherwise is to mount everything as MS_NOEXEC in the
>>> new namespace, but that just feels wrong.
>>
>> For documentation, there's some great slides on seccomp from Plumber's
>> this year[1].
>>
>> At present, there is no variable state beyond the syscall context (PC,
>> args) available to seccomp filters. The no_new_privs prctl was added
>> to reduce the risk of including execve in a filter's whitelist, but
>> that isn't as strong as the "exec once" feature you want.
>>
>> What we did in Chrome OS was to use the "minijail" tool[2] to
>> LD_PRELOAD a .so that sets up the seccomp filter after the exec. It's
>> a bit of a hack, but works in well-defined environments. You are
>> talking about namespaces, though, so maybe minijail is worth a look?
>> It does that too and a whole lot more.
>>
>> As for using maps via eBPF in seccomp, it's on the horizon, but it
>> comes with a lot exposure that I haven't finished pondering, so I
>> don't think those features will be added soon.
>>
>> -Kees
>>
>> [1] http://man7.org/conf/lpc2015/limiting_kernel_attack_surface_with_seccomp-LPC_2015-Kerrisk.pdf
>> [2] see subdirectory "minijail" after "git clone
>> https://chromium.googlesource.com/chromiumos/platform2/"
>>
>>
>> --
>> Kees Cook
>> Chrome OS Security



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-04 20:37     ` Kees Cook
@ 2015-09-10 21:55       ` Michael Tirado
  2015-09-10 23:22         ` Michael Tirado
  2015-09-29 23:44         ` Kees Cook
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Tirado @ 2015-09-10 21:55 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML

On Fri, Sep 4, 2015 at 8:37 PM, Kees Cook <keescook@chromium.org> wrote:
>
> Do you still need file capabilities with the availability of the new
> ambient capabilities?
>
> https://s3hh.wordpress.com/2015/07/25/ambient-capabilities/
> http://thread.gmane.org/gmane.linux.kernel.lsm/24034

Ah.. thanks for the info on this, my launcher program could use ambient
capabilities if whoever invoked it already has that capability. I am trying
to have the new environment explicitly defined as a white list, and avoid
any type of privilege escalation not already granted by root user either
by filesystem mechanisms (setuid / file caps) or inheritable caps.

I would still like to be able to launch programs with file capabilities since we
can lock those down with capability bounding set, and maybe even setuid
binaries too (with a hefty warning message). This rules out LD_PRELOAD
for me, and also some linkers may not support it at all.



> On the TODO list is
> doing deep argument inspection, but it is not an easy thing to get
> right. :)

Yes, please do not rush such a thing!!  It might even be a can of worms
not worth opening.



In case anyone is wondering what I am doing for-now(tm) while waiting for
eBPF map support, or some other way to deal with this problem: I have crafted
a very hacky patch to work around the issue that will allow 2 system calls to
pass through before the filter program is run. I'm lazily using google
webmail so,
sorry if the tabs are missing :(



From: Michael R. Tirado <mtirado418@gmail.com>
Date: Thu, 10 Sep 2015 08:28:41 +0000
Subject: [PATCH] Add new seccomp filter mode + flag to allow two syscalls to
 pass before the filter is run. allows a launcher program to setuid(drop caps)
 and exec if those two privileges are not granted in seccomp filter whitelist.

 DISCLAIMER:
 I am doing this as a quick temporary workaround to this complex problem.
 Also, there may be a more efficient way to implement it instead of
 branching in the filter loop.
---
 include/linux/seccomp.h      |  2 +-
 include/uapi/linux/seccomp.h |  2 ++
 kernel/seccomp.c             | 23 ++++++++++++++++++++---
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index a19ddac..5547448c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,7 +3,7 @@

 #include <uapi/linux/seccomp.h>

-#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC |
SECCOMP_FILTER_FLAG_DEFERRED)

 #ifdef CONFIG_SECCOMP

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..43a8fb8 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -9,6 +9,7 @@
 #define SECCOMP_MODE_DISABLED    0 /* seccomp is not in use. */
 #define SECCOMP_MODE_STRICT    1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER    2 /* uses user-supplied filter. */
+#define SECCOMP_MODE_FILTER_DEFERRED 3 /* sets filter mode + deferred flag */

 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT    0
@@ -16,6 +17,7 @@

 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC    1
+#define SECCOMP_FILTER_FLAG_DEFERRED    2 /* grant two unfiltered syscalls */

 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..dc2a5af 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -58,6 +58,7 @@ struct seccomp_filter {
     atomic_t usage;
     struct seccomp_filter *prev;
     struct bpf_prog *prog;
+    unsigned int deferred;
 };

 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -196,7 +197,12 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
      * value always takes priority (ignoring the DATA).
      */
     for (; f; f = f->prev) {
-        u32 cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
+        u32 cur_ret;
+        if (unlikely(f->deferred)) {
+            --f->deferred;
+            continue;
+        }
+        cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);

         if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
             ret = cur_ret;
@@ -444,6 +450,14 @@ static long seccomp_attach_filter(unsigned int flags,
     }

     /*
+     * in certain cases we may wish to defer filtering, and allow some
+     * syscalls. eg, a launcher program will setuid(drop caps) then exec.
+     */
+    if (flags & SECCOMP_FILTER_FLAG_DEFERRED) {
+        filter->deferred = 2;
+    }
+
+    /*
      * If there is an existing filter, make it the prev and don't drop its
      * task reference.
      */
@@ -838,6 +852,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
char __user *filter)
 {
     unsigned int op;
     char __user *uargs;
+    unsigned int flags = 0;

     switch (seccomp_mode) {
     case SECCOMP_MODE_STRICT:
@@ -849,6 +864,9 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
char __user *filter)
          */
         uargs = NULL;
         break;
+        /* set flag, older kernels lack seccomp syscall */
+    case SECCOMP_MODE_FILTER_DEFERRED:
+        flags = SECCOMP_FILTER_FLAG_DEFERRED;
     case SECCOMP_MODE_FILTER:
         op = SECCOMP_SET_MODE_FILTER;
         uargs = filter;
@@ -857,6 +875,5 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
char __user *filter)
         return -EINVAL;
     }

-    /* prctl interface doesn't have flags, so they are always zero. */
-    return do_seccomp(op, 0, uargs);
+    return do_seccomp(op, flags, uargs);
 }
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-10 21:55       ` Michael Tirado
@ 2015-09-10 23:22         ` Michael Tirado
  2015-09-29 23:44         ` Kees Cook
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Tirado @ 2015-09-10 23:22 UTC (permalink / raw)
  To: Kees Cook; +Cc: LKML

Sorry for spamming you guys with my ugly, poorly formatted hack.
I had to change something due to my shallow knowledge of how
multiprocessors / multiple threads sharing a filter would behave.
There may have been a possibility for the deferred integer to rollover
and ruin everything?  just in case someone tries to use it....



From: Michael R Tirado <mtirado418@gmail.com>
Date: Thu, 10 Sep 2015 23:06:55 +0000
Subject: [PATCH 2/2] changed to signed integer, due to multiprocesor
 uncertainty.

---
 kernel/seccomp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dc2a5af..e39e8c2 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -58,7 +58,7 @@ struct seccomp_filter {
     atomic_t usage;
     struct seccomp_filter *prev;
     struct bpf_prog *prog;
-    unsigned int deferred;
+    int deferred;
 };

 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -198,7 +198,7 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
      */
     for (; f; f = f->prev) {
         u32 cur_ret;
-        if (unlikely(f->deferred)) {
+        if (unlikely(f->deferred > 0)) {
             --f->deferred;
             continue;
         }
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-10 21:55       ` Michael Tirado
  2015-09-10 23:22         ` Michael Tirado
@ 2015-09-29 23:44         ` Kees Cook
  2015-09-30  0:07           ` Andy Lutomirski
  2015-10-06 16:00           ` Michael Tirado
  1 sibling, 2 replies; 11+ messages in thread
From: Kees Cook @ 2015-09-29 23:44 UTC (permalink / raw)
  To: Michael Tirado, Andy Lutomirski; +Cc: LKML, Will Drewry

On Thu, Sep 10, 2015 at 2:55 PM, Michael Tirado <mtirado418@gmail.com> wrote:
> On Fri, Sep 4, 2015 at 8:37 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> Do you still need file capabilities with the availability of the new
>> ambient capabilities?
>>
>> https://s3hh.wordpress.com/2015/07/25/ambient-capabilities/
>> http://thread.gmane.org/gmane.linux.kernel.lsm/24034
>
> Ah.. thanks for the info on this, my launcher program could use ambient
> capabilities if whoever invoked it already has that capability. I am trying
> to have the new environment explicitly defined as a white list, and avoid
> any type of privilege escalation not already granted by root user either
> by filesystem mechanisms (setuid / file caps) or inheritable caps.
>
> I would still like to be able to launch programs with file capabilities since we
> can lock those down with capability bounding set, and maybe even setuid
> binaries too (with a hefty warning message). This rules out LD_PRELOAD
> for me, and also some linkers may not support it at all.
>
>
>
>> On the TODO list is
>> doing deep argument inspection, but it is not an easy thing to get
>> right. :)
>
> Yes, please do not rush such a thing!!  It might even be a can of worms
> not worth opening.
>
>
>
> In case anyone is wondering what I am doing for-now(tm) while waiting for
> eBPF map support, or some other way to deal with this problem: I have crafted
> a very hacky patch to work around the issue that will allow 2 system calls to
> pass through before the filter program is run. I'm lazily using google
> webmail so,
> sorry if the tabs are missing :(

I

>
>
>
> From: Michael R. Tirado <mtirado418@gmail.com>
> Date: Thu, 10 Sep 2015 08:28:41 +0000
> Subject: [PATCH] Add new seccomp filter mode + flag to allow two syscalls to
>  pass before the filter is run. allows a launcher program to setuid(drop caps)
>  and exec if those two privileges are not granted in seccomp filter whitelist.
>
>  DISCLAIMER:
>  I am doing this as a quick temporary workaround to this complex problem.
>  Also, there may be a more efficient way to implement it instead of
>  branching in the filter loop.
> ---
>  include/linux/seccomp.h      |  2 +-
>  include/uapi/linux/seccomp.h |  2 ++
>  kernel/seccomp.c             | 23 ++++++++++++++++++++---
>  3 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index a19ddac..5547448c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -3,7 +3,7 @@
>
>  #include <uapi/linux/seccomp.h>
>
> -#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC)
> +#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC |
> SECCOMP_FILTER_FLAG_DEFERRED)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0f238a4..43a8fb8 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -9,6 +9,7 @@
>  #define SECCOMP_MODE_DISABLED    0 /* seccomp is not in use. */
>  #define SECCOMP_MODE_STRICT    1 /* uses hard-coded filter. */
>  #define SECCOMP_MODE_FILTER    2 /* uses user-supplied filter. */
> +#define SECCOMP_MODE_FILTER_DEFERRED 3 /* sets filter mode + deferred flag */
>
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT    0
> @@ -16,6 +17,7 @@
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC    1
> +#define SECCOMP_FILTER_FLAG_DEFERRED    2 /* grant two unfiltered syscalls */
>
>  /*
>   * All BPF programs must return a 32-bit value.
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 245df6b..dc2a5af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -58,6 +58,7 @@ struct seccomp_filter {
>      atomic_t usage;
>      struct seccomp_filter *prev;
>      struct bpf_prog *prog;
> +    unsigned int deferred;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -196,7 +197,12 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
>       * value always takes priority (ignoring the DATA).
>       */
>      for (; f; f = f->prev) {
> -        u32 cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
> +        u32 cur_ret;
> +        if (unlikely(f->deferred)) {
> +            --f->deferred;
> +            continue;
> +        }
> +        cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);

I do like the idea of a deferred filters, but I wouldn't want them
checked this way (it adds a fixed cost to all filters). I would rather
that a separate list of filters exist, waiting for something like
"exec" to trigger them getting added to the actual filter list. In
fact, probably you'd want something like "prepare" and "cancel" too,
then you could "prepare", fork, exec. Then the child would get the
prepared filters added, and the parent could call "cancel" to drop the
prepared filters from itself.

Andy, Will, do you see issues with a class of "deferred" filters that
would be attached after exec?

>
>          if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>              ret = cur_ret;
> @@ -444,6 +450,14 @@ static long seccomp_attach_filter(unsigned int flags,
>      }
>
>      /*
> +     * in certain cases we may wish to defer filtering, and allow some
> +     * syscalls. eg, a launcher program will setuid(drop caps) then exec.
> +     */
> +    if (flags & SECCOMP_FILTER_FLAG_DEFERRED) {
> +        filter->deferred = 2;
> +    }
> +
> +    /*
>       * If there is an existing filter, make it the prev and don't drop its
>       * task reference.
>       */
> @@ -838,6 +852,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>  {
>      unsigned int op;
>      char __user *uargs;
> +    unsigned int flags = 0;
>
>      switch (seccomp_mode) {
>      case SECCOMP_MODE_STRICT:
> @@ -849,6 +864,9 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>           */
>          uargs = NULL;
>          break;
> +        /* set flag, older kernels lack seccomp syscall */
> +    case SECCOMP_MODE_FILTER_DEFERRED:
> +        flags = SECCOMP_FILTER_FLAG_DEFERRED;
>      case SECCOMP_MODE_FILTER:
>          op = SECCOMP_SET_MODE_FILTER;
>          uargs = filter;
> @@ -857,6 +875,5 @@ long prctl_set_seccomp(unsigned long seccomp_mode,
> char __user *filter)
>          return -EINVAL;
>      }
>
> -    /* prctl interface doesn't have flags, so they are always zero. */
> -    return do_seccomp(op, 0, uargs);
> +    return do_seccomp(op, flags, uargs);

Sorry, no, this isn't allowed, as the comment above it says. New flags
must happen only via the seccomp syscall.

>  }
> --
> 1.8.4

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-29 23:44         ` Kees Cook
@ 2015-09-30  0:07           ` Andy Lutomirski
  2015-10-06 16:00           ` Michael Tirado
  1 sibling, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2015-09-30  0:07 UTC (permalink / raw)
  To: Kees Cook; +Cc: Michael Tirado, LKML, Will Drewry

On Tue, Sep 29, 2015 at 4:44 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 10, 2015 at 2:55 PM, Michael Tirado <mtirado418@gmail.com> wrote:
>> On Fri, Sep 4, 2015 at 8:37 PM, Kees Cook <keescook@chromium.org> wrote:
>>>
>>> Do you still need file capabilities with the availability of the new
>>> ambient capabilities?
>>>
>>> https://s3hh.wordpress.com/2015/07/25/ambient-capabilities/
>>> http://thread.gmane.org/gmane.linux.kernel.lsm/24034
>>
>> Ah.. thanks for the info on this, my launcher program could use ambient
>> capabilities if whoever invoked it already has that capability. I am trying
>> to have the new environment explicitly defined as a white list, and avoid
>> any type of privilege escalation not already granted by root user either
>> by filesystem mechanisms (setuid / file caps) or inheritable caps.
>>
>> I would still like to be able to launch programs with file capabilities since we
>> can lock those down with capability bounding set, and maybe even setuid
>> binaries too (with a hefty warning message). This rules out LD_PRELOAD
>> for me, and also some linkers may not support it at all.
>>
>>
>>
>>> On the TODO list is
>>> doing deep argument inspection, but it is not an easy thing to get
>>> right. :)
>>
>> Yes, please do not rush such a thing!!  It might even be a can of worms
>> not worth opening.
>>
>>
>>
>> In case anyone is wondering what I am doing for-now(tm) while waiting for
>> eBPF map support, or some other way to deal with this problem: I have crafted
>> a very hacky patch to work around the issue that will allow 2 system calls to
>> pass through before the filter program is run. I'm lazily using google
>> webmail so,
>> sorry if the tabs are missing :(
>
> I
>
>>
>>
>>
>> From: Michael R. Tirado <mtirado418@gmail.com>
>> Date: Thu, 10 Sep 2015 08:28:41 +0000
>> Subject: [PATCH] Add new seccomp filter mode + flag to allow two syscalls to
>>  pass before the filter is run. allows a launcher program to setuid(drop caps)
>>  and exec if those two privileges are not granted in seccomp filter whitelist.
>>
>>  DISCLAIMER:
>>  I am doing this as a quick temporary workaround to this complex problem.
>>  Also, there may be a more efficient way to implement it instead of
>>  branching in the filter loop.
>> ---
>>  include/linux/seccomp.h      |  2 +-
>>  include/uapi/linux/seccomp.h |  2 ++
>>  kernel/seccomp.c             | 23 ++++++++++++++++++++---
>>  3 files changed, 23 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index a19ddac..5547448c 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -3,7 +3,7 @@
>>
>>  #include <uapi/linux/seccomp.h>
>>
>> -#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC)
>> +#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC |
>> SECCOMP_FILTER_FLAG_DEFERRED)
>>
>>  #ifdef CONFIG_SECCOMP
>>
>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index 0f238a4..43a8fb8 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -9,6 +9,7 @@
>>  #define SECCOMP_MODE_DISABLED    0 /* seccomp is not in use. */
>>  #define SECCOMP_MODE_STRICT    1 /* uses hard-coded filter. */
>>  #define SECCOMP_MODE_FILTER    2 /* uses user-supplied filter. */
>> +#define SECCOMP_MODE_FILTER_DEFERRED 3 /* sets filter mode + deferred flag */
>>
>>  /* Valid operations for seccomp syscall. */
>>  #define SECCOMP_SET_MODE_STRICT    0
>> @@ -16,6 +17,7 @@
>>
>>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>>  #define SECCOMP_FILTER_FLAG_TSYNC    1
>> +#define SECCOMP_FILTER_FLAG_DEFERRED    2 /* grant two unfiltered syscalls */
>>
>>  /*
>>   * All BPF programs must return a 32-bit value.
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 245df6b..dc2a5af 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -58,6 +58,7 @@ struct seccomp_filter {
>>      atomic_t usage;
>>      struct seccomp_filter *prev;
>>      struct bpf_prog *prog;
>> +    unsigned int deferred;
>>  };
>>
>>  /* Limit any path through the tree to 256KB worth of instructions. */
>> @@ -196,7 +197,12 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
>>       * value always takes priority (ignoring the DATA).
>>       */
>>      for (; f; f = f->prev) {
>> -        u32 cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
>> +        u32 cur_ret;
>> +        if (unlikely(f->deferred)) {
>> +            --f->deferred;
>> +            continue;
>> +        }
>> +        cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
>
> I do like the idea of a deferred filters, but I wouldn't want them
> checked this way (it adds a fixed cost to all filters). I would rather
> that a separate list of filters exist, waiting for something like
> "exec" to trigger them getting added to the actual filter list. In
> fact, probably you'd want something like "prepare" and "cancel" too,
> then you could "prepare", fork, exec. Then the child would get the
> prepared filters added, and the parent could call "cancel" to drop the
> prepared filters from itself.
>
> Andy, Will, do you see issues with a class of "deferred" filters that
> would be attached after exec?

No security issue, but it's not beautiful.  What prevents nasty races
if you allow fork()?

We could consider more generally allowing limited stateful filters
using eBPF.  Then we'd change the state in exec using eBPF code.
Before we do any of the above, I think we need to nail down just what
the identity of a filter is, though.  With eBPF, it's possible to
attach the same program to two tasks.

--Andy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eBPF / seccomp globals?
  2015-09-29 23:44         ` Kees Cook
  2015-09-30  0:07           ` Andy Lutomirski
@ 2015-10-06 16:00           ` Michael Tirado
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Tirado @ 2015-10-06 16:00 UTC (permalink / raw)
  To: Kees Cook; +Cc: Andy Lutomirski, LKML, Will Drewry

On Tue, Sep 29, 2015 at 11:44 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 10, 2015 at 2:55 PM, Michael Tirado <mtirado418@gmail.com> wrote:
>> On Fri, Sep 4, 2015 at 8:37 PM, Kees Cook <keescook@chromium.org> wrote:
>>>
>> @@ -196,7 +197,12 @@ static u32 seccomp_run_filters(struct seccomp_data *sd)
>>       * value always takes priority (ignoring the DATA).
>>       */
>>      for (; f; f = f->prev) {
>> -        u32 cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
>> +        u32 cur_ret;
>> +        if (unlikely(f->deferred)) {
>> +            --f->deferred;
>> +            continue;
>> +        }
>> +        cur_ret = BPF_PROG_RUN(f->prog, (void *)sd);
>
> I do like the idea of a deferred filters, but I wouldn't want them
> checked this way (it adds a fixed cost to all filters). I would rather
> that a separate list of filters exist, waiting for something like
> "exec" to trigger them getting added to the actual filter list. In
> fact, probably you'd want something like "prepare" and "cancel" too,
> then you could "prepare", fork, exec. Then the child would get the
> prepared filters added, and the parent could call "cancel" to drop the
> prepared filters from itself.
>
> Andy, Will, do you see issues with a class of "deferred" filters that
> would be attached after exec?
>


Thanks for the feedback.
Like Andy said, it's definitely not beautiful, but I decided anyway
to spend some time writing a more legitimate version. It seems that unless
we can make a new temporary mode for detecting execve, the check
on SECCOMP_RET_ALLOW return is unavoidable. The comments seem to forbid mode
changes, so I did not test if that approach would be more optimized. This new
version only checks once unlike the previous that checked for each filter.

Again, sorry about the formatting,  I can re-send this the proper way if
interested, or let me know of any other ideas for solving this problem.



Subject: [PATCH] seccomp: Add deferred filter flag

---
 include/asm-generic/seccomp.h                      |   2 +
 include/linux/seccomp.h                            |   4 +-
 include/uapi/linux/seccomp.h                       |   2 +-
 kernel/seccomp.c                                   |  52 +++++++++-
 tools/testing/selftests/seccomp/Makefile           |   1 +
 tools/testing/selftests/seccomp/seccomp_bpf.c      | 105 +++++++++++++++++++++
 .../testing/selftests/seccomp/seccomp_defer_test.c |  15 +++
 7 files changed, 176 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/seccomp/seccomp_defer_test.c

diff --git a/include/asm-generic/seccomp.h b/include/asm-generic/seccomp.h
index c9ccafa..5dd7e7d 100644
--- a/include/asm-generic/seccomp.h
+++ b/include/asm-generic/seccomp.h
@@ -29,4 +29,6 @@
 #define __NR_seccomp_sigreturn        __NR_rt_sigreturn
 #endif

+#define __NR_seccomp_execve        __NR_execve
+
 #endif /* _ASM_GENERIC_SECCOMP_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index a19ddac..8ab7a68 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,7 +3,8 @@

 #include <uapi/linux/seccomp.h>

-#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC)
+#define SECCOMP_FILTER_FLAG_MASK    (SECCOMP_FILTER_FLAG_TSYNC |    \
+                     SECCOMP_FILTER_FLAG_DEFER)

 #ifdef CONFIG_SECCOMP

@@ -25,6 +26,7 @@ struct seccomp_filter;
 struct seccomp {
     int mode;
     struct seccomp_filter *filter;
+    struct seccomp_filter *deferred;
 };

 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..01fab07 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,7 +16,7 @@

 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC    1
-
+#define SECCOMP_FILTER_FLAG_DEFER    2
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..6d54066 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -176,11 +176,12 @@ static int seccomp_check_filter(struct
sock_filter *filter, unsigned int flen)
 static u32 seccomp_run_filters(struct seccomp_data *sd)
 {
     struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
+    struct seccomp_filter *d = ACCESS_ONCE(current->seccomp.deferred);
     struct seccomp_data sd_local;
     u32 ret = SECCOMP_RET_ALLOW;

     /* Ensure unexpected behavior doesn't result in failing open. */
-    if (unlikely(WARN_ON(f == NULL)))
+    if (unlikely(WARN_ON(f == NULL && d == NULL)))
         return SECCOMP_RET_KILL;

     /* Make sure cross-thread synced filter points somewhere sane. */
@@ -447,8 +448,14 @@ static long seccomp_attach_filter(unsigned int flags,
      * If there is an existing filter, make it the prev and don't drop its
      * task reference.
      */
-    filter->prev = current->seccomp.filter;
-    current->seccomp.filter = filter;
+    if (flags & SECCOMP_FILTER_FLAG_DEFER) {
+        filter->prev = current->seccomp.deferred;
+        current->seccomp.deferred = filter;
+    }
+    else {
+        filter->prev = current->seccomp.filter;
+        current->seccomp.filter = filter;
+    }

     /* Now that the new filter is in place, synchronize to all threads. */
     if (flags & SECCOMP_FILTER_FLAG_TSYNC)
@@ -487,6 +494,35 @@ void put_seccomp_filter(struct task_struct *tsk)
     }
 }

+/* attach any filters waiting in deferred list */
+static long seccomp_attach_deferred(void)
+{
+    struct seccomp_filter *tmp;
+    int ret;
+
+    spin_lock_irq(&current->sighand->siglock);
+
+    while(current->seccomp.deferred) {
+        tmp = current->seccomp.deferred->prev;
+        ret = seccomp_attach_filter(0, current->seccomp.deferred);
+        if (ret)
+            goto out_free;
+        current->seccomp.deferred = tmp;
+    }
+
+    spin_unlock_irq(&current->sighand->siglock);
+    return 0;
+
+out_free:
+    while(current->seccomp.deferred) {
+        tmp = current->seccomp.deferred->prev;
+        seccomp_filter_free(current->seccomp.deferred);
+        current->seccomp.deferred = tmp;
+    }
+    spin_unlock_irq(&current->sighand->siglock);
+    return ret;
+}
+
 /**
  * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
  * @syscall: syscall number to send to userland
@@ -605,6 +641,12 @@ static u32 __seccomp_phase1_filter(int
this_syscall, struct seccomp_data *sd)
         return filter_ret;  /* Save the rest for phase 2. */

     case SECCOMP_RET_ALLOW:
+        if (unlikely(current->seccomp.deferred) &&
+                this_syscall == __NR_seccomp_execve) {
+            if (seccomp_attach_deferred()) {
+                do_exit(SIGKILL);
+            }
+        }
         return SECCOMP_PHASE1_OK;

     case SECCOMP_RET_KILL:
@@ -815,6 +857,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
             return -EINVAL;
         return seccomp_set_mode_strict();
     case SECCOMP_SET_MODE_FILTER:
+        if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
+            flags & SECCOMP_FILTER_FLAG_DEFER) {
+            return -EINVAL;
+        }
         return seccomp_set_mode_filter(flags, uargs);
     default:
         return -EINVAL;
diff --git a/tools/testing/selftests/seccomp/Makefile
b/tools/testing/selftests/seccomp/Makefile
index 8401e87..69bf673 100644
--- a/tools/testing/selftests/seccomp/Makefile
+++ b/tools/testing/selftests/seccomp/Makefile
@@ -1,4 +1,5 @@
 TEST_PROGS := seccomp_bpf
+TEST_PROGS += seccomp_defer_test
 CFLAGS += -Wl,-no-as-needed -Wall
 LDFLAGS += -lpthread

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c5abe7f..946060a 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -2095,6 +2095,111 @@ TEST(syscall_restart)
         _metadata->passed = 0;
 }

+TEST(seccomp_deferred)
+{
+    struct sock_filter filter[] = {
+        BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
+            offsetof(struct seccomp_data, nr)),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_exit_group, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_mmap2, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_mprotect, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_brk, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_open, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_close, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_munmap, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_rt_sigaction, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_rt_sigprocmask, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_fstat64, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_set_thread_area, 0, 1),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+        BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ENOSYS),
+    };
+    struct sock_fprog prog = {
+        .len = (unsigned short)ARRAY_SIZE(filter),
+        .filter = filter,
+    };
+    long ret;
+    uid_t uid;
+    int status;
+    pid_t pid;
+    int i;
+
+    ret = prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0);
+    ASSERT_EQ(0, ret) {
+        TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+    }
+
+    /* fail if tsync is set */
+    ret = seccomp(SECCOMP_SET_MODE_FILTER,
+              SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_DEFER,
+              &prog);
+    EXPECT_NE(0, ret) {
+        TH_LOG("tsync + defer not implemented");
+    }
+
+    pid = fork();
+    if (pid == 0) { /* exec thread */
+        char *env[] = {NULL};
+        char *arg[] = {NULL};
+
+        ret = seccomp(SECCOMP_SET_MODE_FILTER,
+                  SECCOMP_FILTER_FLAG_DEFER,
+                  &prog);
+        EXPECT_EQ(0, ret) {
+            TH_LOG("seccomp syscall failed %s", strerror(errno));
+        }
+
+        /* should allow any system call, try one! */
+        EXPECT_NE(-1, (uid = getuid())) {
+            TH_LOG("getuid error: %s\n", strerror(errno));
+        }
+
+        /*exec'd program should return 57 when it fails write syscall*/
+        if (execve("seccomp_defer_test", arg, env)) {
+            ASSERT_EQ(0, 1) {
+                printf("exec failed: %s\n", strerror(errno));
+            }
+        }
+        abort();
+    }
+    EXPECT_NE(-1, pid) {
+        printf("fork error: %s\n", strerror(errno));
+    }
+
+    for (i = 0; i < 20; ++i) {
+        ret = waitpid(pid, &status, WNOHANG);
+        EXPECT_NE(-1, ret) {
+            TH_LOG("waitpid error: %s\n", strerror(errno));
+        }
+        if (ret == pid) {
+            EXPECT_NE(0,  WIFEXITED(status)) {
+                printf("process did not exit normally\n");
+            }
+            EXPECT_EQ(57, WEXITSTATUS(status)) {
+                printf("unexpected exit status: %d\n",
+                        WEXITSTATUS(status));
+            }
+            break;
+        }
+        usleep(25000);
+    }
+    EXPECT_NE(i, 20) {
+        printf("process timed out\n");
+    }
+}
+
 /*
  * TODO:
  * - add microbenchmarks
diff --git a/tools/testing/selftests/seccomp/seccomp_defer_test.c
b/tools/testing/selftests/seccomp/seccomp_defer_test.c
new file mode 100644
index 0000000..1c62581
--- /dev/null
+++ b/tools/testing/selftests/seccomp/seccomp_defer_test.c
@@ -0,0 +1,15 @@
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+int main(int argc, char *argv[])
+{
+    char str[] = "this should fail\n";
+    if (write(STDOUT_FILENO, str, sizeof(str)) == -1) {
+        if (errno == ENOSYS) {
+            return 57; /* expected behavior */
+        }
+    }
+    return -1;
+}
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-10-06 16:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-04  1:01 eBPF / seccomp globals? Michael Tirado
2015-09-04  3:17 ` Alexei Starovoitov
2015-09-04 14:03   ` Tycho Andersen
2015-09-04  4:01 ` Kees Cook
2015-09-04 20:29   ` Michael Tirado
2015-09-04 20:37     ` Kees Cook
2015-09-10 21:55       ` Michael Tirado
2015-09-10 23:22         ` Michael Tirado
2015-09-29 23:44         ` Kees Cook
2015-09-30  0:07           ` Andy Lutomirski
2015-10-06 16:00           ` Michael Tirado

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).