All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ksummit-discuss]  [TECH TOPIC] seccomp
@ 2019-07-19  9:35 Christian Brauner
  2019-07-19 12:32 ` Andy Lutomirski
  2019-07-20  7:23 ` James Morris
  0 siblings, 2 replies; 11+ messages in thread
From: Christian Brauner @ 2019-07-19  9:35 UTC (permalink / raw)
  To: ksummit-discuss

Hey everyone,

I would like to discuss approaches to enabling deep argument inspection
with seccomp and if we reach an agreement am also happy to do the work
and implement it.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF which
enables a process (watchee) to retrieve a fd for its seccomp filter.
This fd can then be handed to another (usually more privileged)
process (watcher).
The watcher will then be able to receive seccomp messages about the
syscalls having been performed by the watchee.

I have integrated this feature into userspace. We currently make heavy
use of this to intercept mknod() syscalls in user namespaces aka in
containers.
If the mknod() syscall matches a device in a pre-determined whitelist
the privileged watcher will perform the mknod syscall in lieu of the
unprivileged watchee and report back to the watchee on the success or
failure of its attempt. If the syscall does not match a device in a
whitelist we simply report an error.

We recently also started to intercept the setxattr() syscall to allow
the creation of various, well-known xattrs including
trusted.overlay.opaque.

The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), setxattr() and many other syscalls that we would
like to intercept suffer from two major problems:
1. they are not easily filterable like mknod() because they have pointer
   arguments
2. some of them might actually succeed in user namespaces already (e.g.
   fscaps etc.)

The 1. problem is not specific to SECCOMP_RET_USER_NOTIF but also
apparently affects future system call design.
We recently merged the clone3() syscall into mainline which moves the
flag from a register argument into a dedicated extensible struct
clone_args to lift the flag limit from legacy clone() and allowing for
extensions while supporting all legacy workloads.

One of the counter arguments leveraged against my design early on was
that this means clone3() cannot be easily filtered by seccomp due to 1.
This argument was fortunately not seen as defeating.
I would argue that there sure is value in trying to design syscalls that
can be handled by seccomp nicely but that seccomp can't become a burden
on designing extensible syscalls.
The openat2() syscall proposed currenly also does use a dedicated
argument struct which contains flags and the seccomp argument popped
back up again.

In light of all this, I would argue that we should seriously look into
extending seccomp to allow filtering on pointer arguments.

There is a close connection between 1. and 2. When a watcher intercepts
a syscall from a watchee and starts to inspect its arguments it can -
depending on the syscall rather often actually - determine whether or
not the syscall would succeed or fail. If it knows that the syscall will
succeed it currently still has to perform it in lieu of the watchee
since there is no way to tell the kernel to "resume" or actually perform
the syscall. It would be nice if we could discuss approaches to enabling
this feature as well.

I'm happy to lead this session and can also illustrate how this feature
is heavily used and how we run into its limitations.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-19  9:35 [Ksummit-discuss] [TECH TOPIC] seccomp Christian Brauner
@ 2019-07-19 12:32 ` Andy Lutomirski
  2019-07-20  3:18   ` Kees Cook
  2019-07-20  7:23 ` James Morris
  1 sibling, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2019-07-19 12:32 UTC (permalink / raw)
  To: Christian Brauner; +Cc: ksummit

On Fri, Jul 19, 2019 at 2:35 AM Christian Brauner <christian@brauner.io> wrote:
>
> In light of all this, I would argue that we should seriously look into
> extending seccomp to allow filtering on pointer arguments.

I won't be at LPC this year, but I was thinking about this anyway.  I
have the following suggestion that might be a bit unorthodox: have
syscalls opt into this filtering.  Specifically, a syscall that
supports pointer filtering would be refactored the way a bunch of our
syscalls are already refactored.  The baseline situation is:

SYSCALL_DEFINE1(syscallname, struct foo __user *, buf) { ... }

Instead, we would do:

SYSCALL_FILTERABLE(syscallname, struct foo __user *, buf)
{
  int ret;
  struct foo kbuf;
  ret = copy_from_user(&kbuf, buf, sizeof(buf));
  if (ret)
    return ret;

  ret = seccomp_deep_filter(syscallname, 0, &kbuf);
  if (ret)
    return ret;

  return do_syscallname(&kbuf);
}

In principle, if we know we're doing a FILTERABLE syscall, we could
skip the initial seccomp invocation and just defer it until
seccomp_deep_filter(), although this might interact badly with any
SECCOMP_RET_PTRACE handles that change nr.

To make this robust, it might help a lot if the generation of these
stubs was mostly automated.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-19 12:32 ` Andy Lutomirski
@ 2019-07-20  3:18   ` Kees Cook
  2019-08-14 17:54     ` Andy Lutomirski
  0 siblings, 1 reply; 11+ messages in thread
From: Kees Cook @ 2019-07-20  3:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: ksummit

On Fri, Jul 19, 2019 at 05:32:59AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 19, 2019 at 2:35 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > In light of all this, I would argue that we should seriously look into
> > extending seccomp to allow filtering on pointer arguments.

I would be all for this. :) I've struggled for a long while trying to
find a sane design for this.

> I won't be at LPC this year, but I was thinking about this anyway.  I
> have the following suggestion that might be a bit unorthodox: have
> syscalls opt into this filtering.  Specifically, a syscall that
> supports pointer filtering would be refactored the way a bunch of our
> syscalls are already refactored.  The baseline situation is:
> 
> SYSCALL_DEFINE1(syscallname, struct foo __user *, buf) { ... }
> 
> Instead, we would do:
> 
> SYSCALL_FILTERABLE(syscallname, struct foo __user *, buf)
> {
>   int ret;
>   struct foo kbuf;
>   ret = copy_from_user(&kbuf, buf, sizeof(buf));
>   if (ret)
>     return ret;
> 
>   ret = seccomp_deep_filter(syscallname, 0, &kbuf);
>   if (ret)
>     return ret;
> 
>   return do_syscallname(&kbuf);
> }
> 
> In principle, if we know we're doing a FILTERABLE syscall, we could
> skip the initial seccomp invocation and just defer it until
> seccomp_deep_filter(), although this might interact badly with any
> SECCOMP_RET_PTRACE handles that change nr.

I don't like splitting the logic on seccomp invocation (we end up needing
to solve ordering issues maybe again), but I do like this explicit
opt-in feature. How you have it does make the "where do we store a cached
copy?" problem go away, too.

With a solution looming, now my mind turns to "how do we write filters
that check argument data?" Can this be done sanely with cBPF or are we
finally to requiring eBPF?

The placement of the seccomp hook looks rather like an LSM, which gets
me back to earlier LSM hooking designs I'd considered:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=seccomp/lsm&id=10c1e4d2b51ad61ad516fa44c2e007f3f5f6edfb
Which also didn't solve the split-location of seccomp rules and wasn't
creating a dynamic way to do, say, string matching.

> To make this robust, it might help a lot if the generation of these
> stubs was mostly automated.

Agreed.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-19  9:35 [Ksummit-discuss] [TECH TOPIC] seccomp Christian Brauner
  2019-07-19 12:32 ` Andy Lutomirski
@ 2019-07-20  7:23 ` James Morris
  2019-07-20  7:41   ` Christian Brauner
  1 sibling, 1 reply; 11+ messages in thread
From: James Morris @ 2019-07-20  7:23 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mic, ksummit-discuss

On Fri, 19 Jul 2019, Christian Brauner wrote:

> There is a close connection between 1. and 2. When a watcher intercepts
> a syscall from a watchee and starts to inspect its arguments it can -
> depending on the syscall rather often actually - determine whether or
> not the syscall would succeed or fail. If it knows that the syscall will
> succeed it currently still has to perform it in lieu of the watchee
> since there is no way to tell the kernel to "resume" or actually perform
> the syscall. It would be nice if we could discuss approaches to enabling
> this feature as well.

Landlock is exploring userspace access control via the seccomp 
syscall with ebpf, but from within the same process:

https://landlock.io/

It may be worth investigating whether Landlock could be extended to a 
split watcher/watchee model.


-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-20  7:23 ` James Morris
@ 2019-07-20  7:41   ` Christian Brauner
  2019-07-25 14:18     ` Serge E. Hallyn
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2019-07-20  7:41 UTC (permalink / raw)
  To: James Morris; +Cc: mic, ksummit-discuss

On July 20, 2019 9:23:33 AM GMT+02:00, James Morris <jmorris@namei.org> wrote:
>On Fri, 19 Jul 2019, Christian Brauner wrote:
>
>> There is a close connection between 1. and 2. When a watcher
>intercepts
>> a syscall from a watchee and starts to inspect its arguments it can -
>> depending on the syscall rather often actually - determine whether or
>> not the syscall would succeed or fail. If it knows that the syscall
>will
>> succeed it currently still has to perform it in lieu of the watchee
>> since there is no way to tell the kernel to "resume" or actually
>perform
>> the syscall. It would be nice if we could discuss approaches to
>enabling
>> this feature as well.
>
>Landlock is exploring userspace access control via the seccomp 
>syscall with ebpf, but from within the same process:
>
>https://landlock.io/
>
>It may be worth investigating whether Landlock could be extended to a 
>split watcher/watchee model.

Certainly a valid point but...
I don't want to rely on landlock for this.
First, no one knows if and when it will ever land.
Second, seccomp is the go-to sandboxing solution for a lot of userspace already.
Often used without a full LSM.
Third, syscall interception to me is seccomp territory. :)
That's to say I'd like seccomp to have this feature *natively* and ideally not tied to
a complete LSM that needs to be merged for this. :)

Christian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-20  7:41   ` Christian Brauner
@ 2019-07-25 14:18     ` Serge E. Hallyn
  0 siblings, 0 replies; 11+ messages in thread
From: Serge E. Hallyn @ 2019-07-25 14:18 UTC (permalink / raw)
  To: Christian Brauner; +Cc: mic, ksummit-discuss

On Sat, Jul 20, 2019 at 09:41:11AM +0200, Christian Brauner wrote:
> On July 20, 2019 9:23:33 AM GMT+02:00, James Morris <jmorris@namei.org> wrote:
> >On Fri, 19 Jul 2019, Christian Brauner wrote:
> >
> >> There is a close connection between 1. and 2. When a watcher
> >intercepts
> >> a syscall from a watchee and starts to inspect its arguments it can -
> >> depending on the syscall rather often actually - determine whether or
> >> not the syscall would succeed or fail. If it knows that the syscall
> >will
> >> succeed it currently still has to perform it in lieu of the watchee
> >> since there is no way to tell the kernel to "resume" or actually
> >perform
> >> the syscall. It would be nice if we could discuss approaches to
> >enabling
> >> this feature as well.
> >
> >Landlock is exploring userspace access control via the seccomp 
> >syscall with ebpf, but from within the same process:
> >
> >https://landlock.io/
> >
> >It may be worth investigating whether Landlock could be extended to a 
> >split watcher/watchee model.
> 
> Certainly a valid point but...
> I don't want to rely on landlock for this.
> First, no one knows if and when it will ever land.
> Second, seccomp is the go-to sandboxing solution for a lot of userspace already.
> Often used without a full LSM.
> Third, syscall interception to me is seccomp territory. :)
> That's to say I'd like seccomp to have this feature *natively* and ideally not tied to
> a complete LSM that needs to be merged for this. :)

Sounds all the more like discussion is warranted :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-07-20  3:18   ` Kees Cook
@ 2019-08-14 17:54     ` Andy Lutomirski
  2019-08-15 17:48       ` Kees Cook
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2019-08-14 17:54 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit, Andy Lutomirski

On Fri, Jul 19, 2019 at 8:18 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Jul 19, 2019 at 05:32:59AM -0700, Andy Lutomirski wrote:
> > On Fri, Jul 19, 2019 at 2:35 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > In light of all this, I would argue that we should seriously look into
> > > extending seccomp to allow filtering on pointer arguments.
>
> I would be all for this. :) I've struggled for a long while trying to
> find a sane design for this.
>
> > I won't be at LPC this year, but I was thinking about this anyway.  I
> > have the following suggestion that might be a bit unorthodox: have
> > syscalls opt into this filtering.  Specifically, a syscall that
> > supports pointer filtering would be refactored the way a bunch of our
> > syscalls are already refactored.  The baseline situation is:
> >
> > SYSCALL_DEFINE1(syscallname, struct foo __user *, buf) { ... }
> >
> > Instead, we would do:
> >
> > SYSCALL_FILTERABLE(syscallname, struct foo __user *, buf)
> > {
> >   int ret;
> >   struct foo kbuf;
> >   ret = copy_from_user(&kbuf, buf, sizeof(buf));
> >   if (ret)
> >     return ret;
> >
> >   ret = seccomp_deep_filter(syscallname, 0, &kbuf);
> >   if (ret)
> >     return ret;
> >
> >   return do_syscallname(&kbuf);
> > }
> >
> > In principle, if we know we're doing a FILTERABLE syscall, we could
> > skip the initial seccomp invocation and just defer it until
> > seccomp_deep_filter(), although this might interact badly with any
> > SECCOMP_RET_PTRACE handles that change nr.
>
> I don't like splitting the logic on seccomp invocation (we end up needing
> to solve ordering issues maybe again), but I do like this explicit
> opt-in feature. How you have it does make the "where do we store a cached
> copy?" problem go away, too.

After thinking about this a bit more, I think that deferring the main
seccomp filter invocation until arguments have been read is too
problematic.  It has the ordering issues you're thinking of, but it
also has unpleasant effects if one of the reads faults or if
SECCOMP_RET_TRACE or SECCOMP_RET_TRAP is used.  I'm thinking that this
type of deeper inspection filter should just be a totally separate
layer.  Once the main seccomp logic decides that a filterable syscall
will be issued then, assuming that no -EFAULT happens, a totally
different program should get run with access to arguments.  And there
should be a way for the main program to know that the syscall nr in
question is filterable on the running kernel.

Does that make sense?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-08-14 17:54     ` Andy Lutomirski
@ 2019-08-15 17:48       ` Kees Cook
  2019-08-15 18:26         ` Andy Lutomirski
  0 siblings, 1 reply; 11+ messages in thread
From: Kees Cook @ 2019-08-15 17:48 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: ksummit

On Wed, Aug 14, 2019 at 10:54:49AM -0700, Andy Lutomirski wrote:
> After thinking about this a bit more, I think that deferring the main
> seccomp filter invocation until arguments have been read is too
> problematic.  It has the ordering issues you're thinking of, but it
> also has unpleasant effects if one of the reads faults or if
> SECCOMP_RET_TRACE or SECCOMP_RET_TRAP is used.  I'm thinking that this

Right, I was actually thinking of the trace/trap as being the race.

> type of deeper inspection filter should just be a totally separate
> layer.  Once the main seccomp logic decides that a filterable syscall
> will be issued then, assuming that no -EFAULT happens, a totally
> different program should get run with access to arguments.  And there
> should be a way for the main program to know that the syscall nr in
> question is filterable on the running kernel.

Right -- this is how I designed the original prototype: it was
effectively an LSM that was triggered by seccomp (since LSMs don't know
anything about syscalls -- their hooks are more generalized). So seccomp
would set a flag to make the LSM hook pay attention.

Existing LSMs are system-owner defined, so really something like Landlock
is needed for a process-owned LSM to be defined. But I worry that LSM
hooks are still too "deep" in the kernel to have a process-oriented
filter author who is not a kernel developer make any sense of the
hooks. They're certainly oriented in a better position to gain the
intent of a filter. For example, if a filter says "you can't open(2)
/etc/foo", but it misses saying "you can't openat(2) /etc/foo", that's a
dumb exposure. The LSM hooks are positioned to say "you can't manipulate
/etc/foo through any means".

So, I'm not entirely sure. It needs a clear design that chooses and
justifies the appropriate "depth" of filtering. And FWIW, the two most
frequent examples of argument parsing requests have been path-based
checking and network address checking. So any prototype needs to handle
these two cases sanely...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-08-15 17:48       ` Kees Cook
@ 2019-08-15 18:26         ` Andy Lutomirski
  2019-08-15 18:31           ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2019-08-15 18:26 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit, Andy Lutomirski

On Thu, Aug 15, 2019 at 10:48 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, Aug 14, 2019 at 10:54:49AM -0700, Andy Lutomirski wrote:
> > After thinking about this a bit more, I think that deferring the main
> > seccomp filter invocation until arguments have been read is too
> > problematic.  It has the ordering issues you're thinking of, but it
> > also has unpleasant effects if one of the reads faults or if
> > SECCOMP_RET_TRACE or SECCOMP_RET_TRAP is used.  I'm thinking that this
>
> Right, I was actually thinking of the trace/trap as being the race.
>
> > type of deeper inspection filter should just be a totally separate
> > layer.  Once the main seccomp logic decides that a filterable syscall
> > will be issued then, assuming that no -EFAULT happens, a totally
> > different program should get run with access to arguments.  And there
> > should be a way for the main program to know that the syscall nr in
> > question is filterable on the running kernel.
>
> Right -- this is how I designed the original prototype: it was
> effectively an LSM that was triggered by seccomp (since LSMs don't know
> anything about syscalls -- their hooks are more generalized). So seccomp
> would set a flag to make the LSM hook pay attention.
>
> Existing LSMs are system-owner defined, so really something like Landlock
> is needed for a process-owned LSM to be defined. But I worry that LSM
> hooks are still too "deep" in the kernel to have a process-oriented
> filter author who is not a kernel developer make any sense of the
> hooks. They're certainly oriented in a better position to gain the
> intent of a filter. For example, if a filter says "you can't open(2)
> /etc/foo", but it misses saying "you can't openat(2) /etc/foo", that's a
> dumb exposure. The LSM hooks are positioned to say "you can't manipulate
> /etc/foo through any means".
>
> So, I'm not entirely sure. It needs a clear design that chooses and
> justifies the appropriate "depth" of filtering. And FWIW, the two most
> frequent examples of argument parsing requests have been path-based
> checking and network address checking. So any prototype needs to handle
> these two cases sanely...
>

But also clone() flag filtering, and new clone() proposals keep
wanting to add structs.  And filtering bpf().  /me runs.

But yes, doing this LSM-style could also make sense.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-08-15 18:26         ` Andy Lutomirski
@ 2019-08-15 18:31           ` Christian Brauner
  2019-08-15 19:21             ` Andy Lutomirski
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2019-08-15 18:31 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: ksummit

On Thu, Aug 15, 2019 at 11:26:10AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2019 at 10:48 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Wed, Aug 14, 2019 at 10:54:49AM -0700, Andy Lutomirski wrote:
> > > After thinking about this a bit more, I think that deferring the main
> > > seccomp filter invocation until arguments have been read is too
> > > problematic.  It has the ordering issues you're thinking of, but it
> > > also has unpleasant effects if one of the reads faults or if
> > > SECCOMP_RET_TRACE or SECCOMP_RET_TRAP is used.  I'm thinking that this
> >
> > Right, I was actually thinking of the trace/trap as being the race.
> >
> > > type of deeper inspection filter should just be a totally separate
> > > layer.  Once the main seccomp logic decides that a filterable syscall
> > > will be issued then, assuming that no -EFAULT happens, a totally
> > > different program should get run with access to arguments.  And there
> > > should be a way for the main program to know that the syscall nr in
> > > question is filterable on the running kernel.
> >
> > Right -- this is how I designed the original prototype: it was
> > effectively an LSM that was triggered by seccomp (since LSMs don't know
> > anything about syscalls -- their hooks are more generalized). So seccomp
> > would set a flag to make the LSM hook pay attention.
> >
> > Existing LSMs are system-owner defined, so really something like Landlock
> > is needed for a process-owned LSM to be defined. But I worry that LSM
> > hooks are still too "deep" in the kernel to have a process-oriented
> > filter author who is not a kernel developer make any sense of the
> > hooks. They're certainly oriented in a better position to gain the
> > intent of a filter. For example, if a filter says "you can't open(2)
> > /etc/foo", but it misses saying "you can't openat(2) /etc/foo", that's a
> > dumb exposure. The LSM hooks are positioned to say "you can't manipulate
> > /etc/foo through any means".
> >
> > So, I'm not entirely sure. It needs a clear design that chooses and
> > justifies the appropriate "depth" of filtering. And FWIW, the two most
> > frequent examples of argument parsing requests have been path-based
> > checking and network address checking. So any prototype needs to handle
> > these two cases sanely...
> >
> 
> But also clone() flag filtering, and new clone() proposals keep
> wanting to add structs.  And filtering bpf().  /me runs.

Yeah, I've mentioned clone3() in my initial mail. And it is not a
proposal anymore it's in mainline since the 5.3 merge window. So the
evil has been done. /me (sorry-not-sorry) ducks :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp
  2019-08-15 18:31           ` Christian Brauner
@ 2019-08-15 19:21             ` Andy Lutomirski
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2019-08-15 19:21 UTC (permalink / raw)
  To: Christian Brauner; +Cc: ksummit, Andy Lutomirski

On Thu, Aug 15, 2019 at 11:31 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Aug 15, 2019 at 11:26:10AM -0700, Andy Lutomirski wrote:
> > On Thu, Aug 15, 2019 at 10:48 AM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > On Wed, Aug 14, 2019 at 10:54:49AM -0700, Andy Lutomirski wrote:
> > > > After thinking about this a bit more, I think that deferring the main
> > > > seccomp filter invocation until arguments have been read is too
> > > > problematic.  It has the ordering issues you're thinking of, but it
> > > > also has unpleasant effects if one of the reads faults or if
> > > > SECCOMP_RET_TRACE or SECCOMP_RET_TRAP is used.  I'm thinking that this
> > >
> > > Right, I was actually thinking of the trace/trap as being the race.
> > >
> > > > type of deeper inspection filter should just be a totally separate
> > > > layer.  Once the main seccomp logic decides that a filterable syscall
> > > > will be issued then, assuming that no -EFAULT happens, a totally
> > > > different program should get run with access to arguments.  And there
> > > > should be a way for the main program to know that the syscall nr in
> > > > question is filterable on the running kernel.
> > >
> > > Right -- this is how I designed the original prototype: it was
> > > effectively an LSM that was triggered by seccomp (since LSMs don't know
> > > anything about syscalls -- their hooks are more generalized). So seccomp
> > > would set a flag to make the LSM hook pay attention.
> > >
> > > Existing LSMs are system-owner defined, so really something like Landlock
> > > is needed for a process-owned LSM to be defined. But I worry that LSM
> > > hooks are still too "deep" in the kernel to have a process-oriented
> > > filter author who is not a kernel developer make any sense of the
> > > hooks. They're certainly oriented in a better position to gain the
> > > intent of a filter. For example, if a filter says "you can't open(2)
> > > /etc/foo", but it misses saying "you can't openat(2) /etc/foo", that's a
> > > dumb exposure. The LSM hooks are positioned to say "you can't manipulate
> > > /etc/foo through any means".
> > >
> > > So, I'm not entirely sure. It needs a clear design that chooses and
> > > justifies the appropriate "depth" of filtering. And FWIW, the two most
> > > frequent examples of argument parsing requests have been path-based
> > > checking and network address checking. So any prototype needs to handle
> > > these two cases sanely...
> > >
> >
> > But also clone() flag filtering, and new clone() proposals keep
> > wanting to add structs.  And filtering bpf().  /me runs.
>
> Yeah, I've mentioned clone3() in my initial mail. And it is not a
> proposal anymore it's in mainline since the 5.3 merge window. So the
> evil has been done. /me (sorry-not-sorry) ducks :)

/me throws something squishy

So I guess we want some way for a seccomp filter to see clone3() being
called and determine that it or a related filter will be invoked again
with the arguments read before clone3() actually does anything.  Doing
this with Landlock would involve poking quite a few places to add a
syscall, whereas my FILTERABLE thing would do it more simply.

These approaches aren't necessarily mutually exclusive.  Maybe some
flags could be passed to the main seccomp filter so that it could
determine things like:

- This syscall is FILTERABLE and (optionally) these args will be filtered.
- Landlock will be called for filesystem access and the following
hooks are enabled.

The idea is that we want the ability to make additional syscalls be
FILTERABLE and/or to add new seccompable LSM hooks in new kernels.

Doing this in a way that has an acceptably low risk of accidentally
opening security holes when LSM hooks change will require quite a bit
of care.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-08-15 19:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-19  9:35 [Ksummit-discuss] [TECH TOPIC] seccomp Christian Brauner
2019-07-19 12:32 ` Andy Lutomirski
2019-07-20  3:18   ` Kees Cook
2019-08-14 17:54     ` Andy Lutomirski
2019-08-15 17:48       ` Kees Cook
2019-08-15 18:26         ` Andy Lutomirski
2019-08-15 18:31           ` Christian Brauner
2019-08-15 19:21             ` Andy Lutomirski
2019-07-20  7:23 ` James Morris
2019-07-20  7:41   ` Christian Brauner
2019-07-25 14:18     ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.