Ksummit-Discuss Archive on lore.kernel.org
 help / color / Atom feed
* [Ksummit-discuss] [TECH TOPIC] seccomp feature development
@ 2020-05-20 16:17 Kees Cook
  2020-05-20 16:31 ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: Kees Cook @ 2020-05-20 16:17 UTC (permalink / raw)
  To: ksummit-discuss

As recently outlined[1], there are are a number of seccomp topics that
need discussion:

 - fd passing
 - deep argument inspection
 - changing structure sizes
 - syscall bitmasks

Specifically, seccomp needs to grow the ability to inspect Extensible
Argument syscalls, which requires that it inspect userspace memory
without Time-of-Check/Time-of-Use races and without double-copying.
Additionally, since the structures can grow and be nested, there needs
to be a way to deal with flattening the arguments into a linear buffer
that can be examined by seccomp's BPF dialect. All of this also needs to
be handled by the USER_NOTIF implementation. Finally, fd passing needs
to be finished, and there needs to be an exploration of syscall bitmasks
to augment the existing filters to gain back some performance.

-Kees

(This has been submitted to the LPC site as well[2].)

[1] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/
[2] https://linuxplumbersconf.org/event/7/abstracts/596/

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 16:17 [Ksummit-discuss] [TECH TOPIC] seccomp feature development Kees Cook
@ 2020-05-20 16:31 ` Al Viro
  2020-05-20 18:05   ` Kees Cook
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2020-05-20 16:31 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit-discuss

On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote:
> As recently outlined[1], there are are a number of seccomp topics that
> need discussion:
> 
>  - fd passing
>  - deep argument inspection
>  - changing structure sizes
>  - syscall bitmasks
> 
> Specifically, seccomp needs to grow the ability to inspect Extensible
> Argument syscalls, which requires that it inspect userspace memory
> without Time-of-Check/Time-of-Use races and without double-copying.
> Additionally, since the structures can grow and be nested, there needs
> to be a way to

... catch and kill the bullshit ABI "enhancements" that would attempt that
kind of garbage.  I'm not sure why that is seccomp-related, though...
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 16:31 ` Al Viro
@ 2020-05-20 18:05   ` Kees Cook
  2020-05-20 18:16     ` Al Viro
  2020-05-20 18:27     ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Kees Cook @ 2020-05-20 18:05 UTC (permalink / raw)
  To: Al Viro; +Cc: ksummit-discuss

On Wed, May 20, 2020 at 05:31:02PM +0100, Al Viro wrote:
> On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote:
> > As recently outlined[1], there are are a number of seccomp topics that
> > need discussion:
> > 
> >  - fd passing
> >  - deep argument inspection
> >  - changing structure sizes
> >  - syscall bitmasks
> > 
> > Specifically, seccomp needs to grow the ability to inspect Extensible
> > Argument syscalls, which requires that it inspect userspace memory
> > without Time-of-Check/Time-of-Use races and without double-copying.
> > Additionally, since the structures can grow and be nested, there needs
> > to be a way to
> 
> ... catch and kill the bullshit ABI "enhancements" that would attempt that
> kind of garbage.  I'm not sure why that is seccomp-related, though...

We already have structs passed to syscalls that contain pointers to yet
more structs. Do you mean those should be disallowed? (Personally, I
would love that, but this doesn't seem to match the reality of the
design considerations of those syscalls.)

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 18:05   ` Kees Cook
@ 2020-05-20 18:16     ` Al Viro
  2020-05-20 18:27     ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Al Viro @ 2020-05-20 18:16 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit-discuss

On Wed, May 20, 2020 at 11:05:58AM -0700, Kees Cook wrote:
> On Wed, May 20, 2020 at 05:31:02PM +0100, Al Viro wrote:
> > On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote:
> > > As recently outlined[1], there are are a number of seccomp topics that
> > > need discussion:
> > > 
> > >  - fd passing
> > >  - deep argument inspection
> > >  - changing structure sizes
> > >  - syscall bitmasks
> > > 
> > > Specifically, seccomp needs to grow the ability to inspect Extensible
> > > Argument syscalls, which requires that it inspect userspace memory
> > > without Time-of-Check/Time-of-Use races and without double-copying.
> > > Additionally, since the structures can grow and be nested, there needs
> > > to be a way to
> > 
> > ... catch and kill the bullshit ABI "enhancements" that would attempt that
> > kind of garbage.  I'm not sure why that is seccomp-related, though...
> 
> We already have structs passed to syscalls that contain pointers to yet
> more structs. Do you mean those should be disallowed? (Personally, I
> would love that, but this doesn't seem to match the reality of the
> design considerations of those syscalls.)

We have no chance to kill the existing ones off, but we certainly can stop
accepting new ones.  I would make an exception for struct iovec arrays
passed as an argument, provided that the data they refer to is opaque,
and while pselect6() kind of kludges are bloody revolting, they might
be inevitable in some cases - not without a very good explanation from
their authors, though, and "I wanna have 18 arguments; whaddya mean, don't?!"
is not one.
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 18:05   ` Kees Cook
  2020-05-20 18:16     ` Al Viro
@ 2020-05-20 18:27     ` Linus Torvalds
  2020-05-20 19:04       ` Kees Cook
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-05-20 18:27 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit

On Wed, May 20, 2020 at 11:06 AM Kees Cook <keescook@chromium.org> wrote:
> We already have structs passed to syscalls that contain pointers to yet
> more structs.

Give real examples of where this matters for security, please, and
where somebody would want to control this.

Yes, we have things like clone3() that pass a struct with pointers to
user space (eg the wait location etc).

Yes, we have tons of things like ioctl's that have random struct
pointer arguments with random contents.

Yes, we have iovec's etc that have arrays of pointers to user space.

But no, none of these seem to be things that seccomp should care about.

So I am not in the least interested in some kind of general discussion
about system calls with "pointers to pointers". They exist. Deal with
it. It's not in the least an interesting issue, and no, we shouldn't
make seccomp and friends incredibly more complicated for it.

If you want to do sandboxing, you disallow those things entirely if
you don't trust them, or make the case-by-case argument for why they
don't matter.

If you want to do something fancier (special compat emulation using
seccomp and a ptrace fallback? I dunno) you are going to just have to
deal with it. It's not simple, but it's not the kernels problem. You
may have to emulate it *entirely* in the ptracer (ie instead of "check
the arguments and let it continue" you _actually_ emulate it to avoid
any races)

And if you have some actual and imminent real security issue, you
mention _that_ and explain _that_, and accept that maybe you need to
do that expensive emulation (because the kernel people just don't care
about your private hang-ups) or you need to explain why it's a real
issue and why the kernel should help with your odd special case.

Don't make this some kind of abstract conceptual problem thing.
Because it's not.

Some computer scientists think that everythinig should be really
generic and solutions that solve some problem for every possible case
are the only good solutions.

But those people are wrong. The thing that _really_ matters is the
details. Not the nebulous theory.

So details, please.

             Linus
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 18:27     ` Linus Torvalds
@ 2020-05-20 19:04       ` Kees Cook
  2020-05-20 19:08         ` Linus Torvalds
  2020-05-20 22:12         ` Alexei Starovoitov
  0 siblings, 2 replies; 16+ messages in thread
From: Kees Cook @ 2020-05-20 19:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: ksummit

On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote:
> Don't make this some kind of abstract conceptual problem thing.
> Because it's not.

I have no intention of making this abstract (the requests for expanding
seccomp coverage have been for only a select class of syscalls, and
specifically clone3 and openat2) nor more complicated than it needs to be
(I regularly resist expanding the seccomp BPF dialect into eBPF).

> So details, please.

We've been discussing it all here:
https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/

The example given in the thread was dealing with things like clone3's
struct clone_args's set_tid member, which is a pointer to a dynamically
sized array.

Things seccomp is NOT expected to introspect due to complexity would be
stuff like the bpf() syscall.

Perhaps the question is "how deeply does seccomp need to inspect?"
and maybe it does not get to see anything beyond just the "top level"
struct (i.e. struct clone_args) and all pointers within THAT become
opaque? That certainly simplifies the design.

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 19:04       ` Kees Cook
@ 2020-05-20 19:08         ` Linus Torvalds
  2020-05-20 20:24           ` Christian Brauner
  2020-05-20 22:12         ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-05-20 19:08 UTC (permalink / raw)
  To: Kees Cook; +Cc: ksummit

On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
>
> Things seccomp is NOT expected to introspect due to complexity would be
> stuff like the bpf() syscall.

Right.

I don't dispute at all that those kinds of pointer-to-pointer things
exist all over.

But:

> Perhaps the question is "how deeply does seccomp need to inspect?"
> and maybe it does not get to see anything beyond just the "top level"
> struct (i.e. struct clone_args) and all pointers within THAT become
> opaque? That certainly simplifies the design.

Exactly. I think that's the most common situation by far. Does anybody
really really need to care at a deep level, and why?

              Linus
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 19:08         ` Linus Torvalds
@ 2020-05-20 20:24           ` Christian Brauner
  2020-05-20 20:52             ` Kees Cook
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2020-05-20 20:24 UTC (permalink / raw)
  To: Linus Torvalds, Kees Cook; +Cc: ksummit

On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > Things seccomp is NOT expected to introspect due to complexity would be
> > stuff like the bpf() syscall.
> 
> Right.
> 
> I don't dispute at all that those kinds of pointer-to-pointer things
> exist all over.
> 
> But:
> 
> > Perhaps the question is "how deeply does seccomp need to inspect?"
> > and maybe it does not get to see anything beyond just the "top level"
> > struct (i.e. struct clone_args) and all pointers within THAT become
> > opaque? That certainly simplifies the design.
> 
> Exactly. I think that's the most common situation by far. Does anybody
> really really need to care at a deep level, and why?


We mostly don't and making all second-level pointers opaque is ok imho.
First, I don't think we need to really nest structs. (We have netlink
for that.)
Second, features for such syscall that require other pointers can and
usually will be placed under a flag in the first-level struct. If that's
filterable you have the option to turn that of based on the flag. As
long as the flag identifies one feature and not one feature that can
have other features it's no different from filtering a simple flag
anyway. Even for clone3() it only has one feature that has a pointer in
the struct and that's for checkpoint/restore and lets them select a
specific pid and it comes with a size argument that is capped by the
maximum nesting depth of pid namespaces in the kernel. So if you see
that the size argument is not 0 in the first level struct you can
disallow that too same as if it were placed under the flag. So no
second-level nesting required. Probably the first level pointer is
alreay making some people vomit but it's useful and for some syscalls
almost cannot be avoided.

But I think that we need some documented consensus on all that stuff
which I stressed in other mails before. I'll hand something in about
this, if that's ok than we can hash this out.

Christian

_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 20:24           ` Christian Brauner
@ 2020-05-20 20:52             ` Kees Cook
  2020-05-20 21:02               ` Christian Brauner
  2020-05-22  4:06               ` Aleksa Sarai
  0 siblings, 2 replies; 16+ messages in thread
From: Kees Cook @ 2020-05-20 20:52 UTC (permalink / raw)
  To: Aleksa Sarai, Christian Brauner; +Cc: ksummit

On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote:
> On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> > > Perhaps the question is "how deeply does seccomp need to inspect?"
> > > and maybe it does not get to see anything beyond just the "top level"
> > > struct (i.e. struct clone_args) and all pointers within THAT become
> > > opaque? That certainly simplifies the design.
> > 
> > Exactly. I think that's the most common situation by far. Does anybody
> > really really need to care at a deep level, and why?
> 
> We mostly don't and making all second-level pointers opaque is ok imho.

That'll make things MUCH easier. :)

> But I think that we need some documented consensus on all that stuff
> which I stressed in other mails before. I'll hand something in about
> this, if that's ok than we can hash this out.

Aleksa, I know you had an entire presentation[1] on the extensible
argument syscalls, but was there any text-based design doc that you made?

It would be really nice to update Documentation/process/adding-syscalls.rst
with the specifics[2], and to (now) include the "no nested flags"
requirement. What do you think?

-Kees

[1] https://github.com/cyphar/talks/tree/master/2020/01-linux-conf-au/syscall-extensions
    https://www.youtube.com/watch?v=ggD-eb3yPVs
[2] https://www.kernel.org/doc/html/latest/process/adding-syscalls.html?highlight=syscall#designing-the-api-planning-for-extension

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 20:52             ` Kees Cook
@ 2020-05-20 21:02               ` Christian Brauner
  2020-05-22  4:06               ` Aleksa Sarai
  1 sibling, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2020-05-20 21:02 UTC (permalink / raw)
  To: Kees Cook; +Cc: Aleksa Sarai, ksummit

On Wed, May 20, 2020 at 01:52:30PM -0700, Kees Cook wrote:
> On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote:
> > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> > > > Perhaps the question is "how deeply does seccomp need to inspect?"
> > > > and maybe it does not get to see anything beyond just the "top level"
> > > > struct (i.e. struct clone_args) and all pointers within THAT become
> > > > opaque? That certainly simplifies the design.
> > > 
> > > Exactly. I think that's the most common situation by far. Does anybody
> > > really really need to care at a deep level, and why?
> > 
> > We mostly don't and making all second-level pointers opaque is ok imho.
> 
> That'll make things MUCH easier. :)
> 
> > But I think that we need some documented consensus on all that stuff
> > which I stressed in other mails before. I'll hand something in about
> > this, if that's ok than we can hash this out.
> 
> Aleksa, I know you had an entire presentation[1] on the extensible
> argument syscalls, but was there any text-based design doc that you made?
> 
> It would be really nice to update Documentation/process/adding-syscalls.rst
> with the specifics[2], and to (now) include the "no nested flags"
> requirement. What do you think?

I've sent out:

https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/

a while back which we drafted together specifically to documented
extensible syscalls. I wanted to resend it this week. I'll update it
with all register based flag arguments should be unsigned int which I
already droned about this week in another thread. The no nested flags
requirement can go in there as well.
Aleksa and I also had a joint session for Plumbers/Ksummit planned hence
the "handing something in".

Christian
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 19:04       ` Kees Cook
  2020-05-20 19:08         ` Linus Torvalds
@ 2020-05-20 22:12         ` Alexei Starovoitov
  2020-05-20 23:39           ` Kees Cook
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-05-20 22:12 UTC (permalink / raw)
  To: Kees Cook; +Cc: bpf, ksummit

On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote:
> On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote:
> > Don't make this some kind of abstract conceptual problem thing.
> > Because it's not.
> 
> I have no intention of making this abstract (the requests for expanding
> seccomp coverage have been for only a select class of syscalls, and
> specifically clone3 and openat2) nor more complicated than it needs to be
> (I regularly resist expanding the seccomp BPF dialect into eBPF).

Kees, since you've forked the thread I'm adding bpf mailing list back and
re-iterating my point:
** Nack to cBPF extensions **
How that is relevant?
You're proposing to add copy_from_user() to selected syscalls, like clone3,
and present large __u32 array to cBPF program.
In other words existing fixed sized 'struct seccomp_data' will become
either variable length or jumbo fixed size like one page.
In the fomer case it would mean that cBPF would need to be extended
with variable length logic. Which in turn means it will suffer from
spectre v1 issues.
We've spent a lot of time fixing spectre v1 issues with eBPF. Including
teaching the verifier to recognize speculative patterns inside the programs
so that malicious bpf progs trying to exploit spec v1 will be caught
at load time. There is no other tool (compiler or static analysis) that
can do similar analysis. I suggest that you look into what eBPF
is actually doing instead of trying to reinvent the wheel.
If you go with latter approach of presenting cBPF with giant
'struct seccomp_data + page' that extra page would need to be zeroed out
before invocation of bpf program which will make seccomp even less usable
that it is today. Currently it's slow and unusable in production datacenter.
People suggested for years to adopt eBPF in seccomp to accelerate it,
but, as you confessed, you resisted and sounds like now you want to
implement seccomp specific syscall bitmask?
Which means more kernel code, more bugs, more security issues.
imo that's another reinvented wheel when eBPF can do it already. I don't think
it's a good idea to add kernel code when eBPF-based solution exists and capable
of examining any level of nested args.

> Perhaps the question is "how deeply does seccomp need to inspect?"
> and maybe it does not get to see anything beyond just the "top level"
> struct (i.e. struct clone_args) and all pointers within THAT become
> opaque? That certainly simplifies the design.

clone3's 'struct clone_args' has set_tid pointer as a second level.
I don't think that sticking to first level of pointers for this particular
syscall will make seccomp filtering any more practical.
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 22:12         ` Alexei Starovoitov
@ 2020-05-20 23:39           ` Kees Cook
  2020-05-21  0:43             ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Kees Cook @ 2020-05-20 23:39 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, ksummit

On Wed, May 20, 2020 at 03:12:56PM -0700, Alexei Starovoitov wrote:
> On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote:
> > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote:
> > > Don't make this some kind of abstract conceptual problem thing.
> > > Because it's not.
> > 
> > I have no intention of making this abstract (the requests for expanding
> > seccomp coverage have been for only a select class of syscalls, and
> > specifically clone3 and openat2) nor more complicated than it needs to be
> > (I regularly resist expanding the seccomp BPF dialect into eBPF).
> 
> Kees, since you've forked the thread I'm adding bpf mailing list back and
> re-iterating my point:
> ** Nack to cBPF extensions **

Yes, I know. I agreed[1] with you on this point.

> How that is relevant?
> You're proposing to add copy_from_user() to selected syscalls, like clone3,
> and present large __u32 array to cBPF program.
> In other words existing fixed sized 'struct seccomp_data' will become
> either variable length or jumbo fixed size like one page.
> In the fomer case it would mean that cBPF would need to be extended
> with variable length logic. Which in turn means it will suffer from
> spectre v1 issues.

I don't expect to need to do anything with variable lengths in the
seccomp BPF dialect. As I said in the other thread, if we are faced with
design trade-offs that require extending the seccomp filter language, we
would switch to eBPF.

> If you go with latter approach of presenting cBPF with giant
> 'struct seccomp_data + page' that extra page would need to be zeroed out
> before invocation of bpf program which will make seccomp even less usable
> that it is today. Currently it's slow and unusable in production datacenter.

Making universal declarations based on your opinion does not help
convince people of your position. Saying it's "unusable in production
datacenter" is perhaps true for you, but hardly true for the many
datacenters that do use it.

Additionally, we're obviously not interested in making seccomp _slower_.
The entire point of an investigation of the design is to examine our
options and find the right solution.

> People suggested for years to adopt eBPF in seccomp to accelerate it,
> but, as you confessed, you resisted and sounds like now you want to
> implement seccomp specific syscall bitmask?

Yes -- because it's an order of magnitude faster than even a single
instruction BPF seccomp filter. The vast majority of seccomp filters need
nothing more than a single yes/no, and right now the bulk of processing
time is spent running the BPF filter. I would prefer to avoid BPF
entirely where possible for seccomp.

> Which means more kernel code, more bugs, more security issues.

Right. This is a solid design principle, and one I agree with: avoid
adding code, keep things simple, everything will have bugs. And, as it
stands, seccomp has had a significantly safer history than eBPF, largely
due to its goal of staying as utterly small and simple as possible. I
don't intend to discard that stance, and it's why I would rather continue
to shield seccomp from the regularly occurring eBPF flaws.

> imo that's another reinvented wheel when eBPF can do it already. I don't think
> it's a good idea to add kernel code when eBPF-based solution exists and capable
> of examining any level of nested args.

Thanks to the neighboring thread here, the requirements no longer[2]
include nested args. Also, you're mixing bitmasks (to accelerate the
overwhelmingly common case) with the deep argument inspection (which is
a rare but needed case).

> > Perhaps the question is "how deeply does seccomp need to inspect?"
> > and maybe it does not get to see anything beyond just the "top level"
> > struct (i.e. struct clone_args) and all pointers within THAT become
> > opaque? That certainly simplifies the design.
> 
> clone3's 'struct clone_args' has set_tid pointer as a second level.
> I don't think that sticking to first level of pointers for this particular
> syscall will make seccomp filtering any more practical.

Yup, we all agree. :)

-Kees

[1] https://lore.kernel.org/lkml/202005191434.57253AD@keescook/
[2] https://lore.kernel.org/ksummit-discuss/20200520221256.tzqkjpeswv3d6ne2@ast-mbp.dhcp.thefacebook.com/T/#m01a045c8715cfff8399ba86171039110befecbcf

-- 
Kees Cook
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 23:39           ` Kees Cook
@ 2020-05-21  0:43             ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2020-05-21  0:43 UTC (permalink / raw)
  To: Kees Cook; +Cc: bpf, ksummit

On Wed, May 20, 2020 at 04:39:20PM -0700, Kees Cook wrote:
> On Wed, May 20, 2020 at 03:12:56PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote:
> > > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote:
> > > > Don't make this some kind of abstract conceptual problem thing.
> > > > Because it's not.
> > > 
> > > I have no intention of making this abstract (the requests for expanding
> > > seccomp coverage have been for only a select class of syscalls, and
> > > specifically clone3 and openat2) nor more complicated than it needs to be
> > > (I regularly resist expanding the seccomp BPF dialect into eBPF).
> > 
> > Kees, since you've forked the thread I'm adding bpf mailing list back and
> > re-iterating my point:
> > ** Nack to cBPF extensions **
> 
> Yes, I know. I agreed[1] with you on this point.
> 
> > How that is relevant?
> > You're proposing to add copy_from_user() to selected syscalls, like clone3,
> > and present large __u32 array to cBPF program.
> > In other words existing fixed sized 'struct seccomp_data' will become
> > either variable length or jumbo fixed size like one page.
> > In the fomer case it would mean that cBPF would need to be extended
> > with variable length logic. Which in turn means it will suffer from
> > spectre v1 issues.
> 
> I don't expect to need to do anything with variable lengths in the
> seccomp BPF dialect. As I said in the other thread, if we are faced with
> design trade-offs that require extending the seccomp filter language, we
> would switch to eBPF.
> 
> > If you go with latter approach of presenting cBPF with giant
> > 'struct seccomp_data + page' that extra page would need to be zeroed out
> > before invocation of bpf program which will make seccomp even less usable
> > that it is today. Currently it's slow and unusable in production datacenter.
> 
> Making universal declarations based on your opinion does not help
> convince people of your position. Saying it's "unusable in production
> datacenter" is perhaps true for you, but hardly true for the many
> datacenters that do use it.

The datacenter that went with full bypass of kernel storage and networking
subsystems where application don't do many syscalls per second any more ?
Sure. In such cases extreme seccomp overhead is irrelevant.
Just like kpti and retpoline overhead.

> Additionally, we're obviously not interested in making seccomp _slower_.
> The entire point of an investigation of the design is to examine our
> options and find the right solution.
> 
> > People suggested for years to adopt eBPF in seccomp to accelerate it,
> > but, as you confessed, you resisted and sounds like now you want to
> > implement seccomp specific syscall bitmask?
> 
> Yes -- because it's an order of magnitude faster than even a single
> instruction BPF seccomp filter. The vast majority of seccomp filters need
> nothing more than a single yes/no, and right now the bulk of processing
> time is spent running the BPF filter. I would prefer to avoid BPF
> entirely where possible for seccomp.

Are you running in interpreted mode? Otherwise above statement is nonsense or
you haven't done eBPF benchmarking in long time. JITed eBPF has exact same
speed as kernel C code. Even extreme use cases of bpf programs with single
'return 0' instruction are being optimized into minimal number of native
instructions.
So with above you're saying that giant bitmask of syscalls in C code
is faster than 'int foo() { return 0; }' in C code ? Simply absurd.
You realize that eBPF is processing tens of millions of packets per second
and folks measure the overhead in nanoseconds.
Indirect call and retpoline overhead was removed from a lot bpf use cases
in networking specifically because every nanosecond counts.

> 
> > Which means more kernel code, more bugs, more security issues.
> 
> Right. This is a solid design principle, and one I agree with: avoid
> adding code, keep things simple, everything will have bugs. And, as it
> stands, seccomp has had a significantly safer history than eBPF, largely
> due to its goal of staying as utterly small and simple as possible. I
> don't intend to discard that stance, and it's why I would rather continue
> to shield seccomp from the regularly occurring eBPF flaws.

It's subjectively safer. I argue it's enjoying smaller bug rate because
syzbots are not looking into it and it's not being actively developed.
In the year 2020 there were three verifier bugs that could have been exploited
through unpriv. All three were found by new kBdysch fuzzer. In 2019 there was
nothing. Not because people didn't try, but because syzbot reached its limit.
The pace of bpf development is accelerating. There will be more bugs found and
introduced in the verifier. Yet that doesn't stop folks to use eBPF to secure
the datacenters.
The JITs are also being developed. There are bugs in them and they affect both
cBPF and eBPF. If you're worried about that it's probably time to get rid of
cBPF from seccomp. Invent your own syscall processing thingy, fix bugs and stop
adding new features. That's the only way to reduce the bug rate. We're not
going to slow down JITs development because of seccomp.
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-20 20:52             ` Kees Cook
  2020-05-20 21:02               ` Christian Brauner
@ 2020-05-22  4:06               ` Aleksa Sarai
  2020-05-22  7:35                 ` Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Aleksa Sarai @ 2020-05-22  4:06 UTC (permalink / raw)
  To: Kees Cook; +Cc: Christian Brauner, ksummit

[-- Attachment #1.1: Type: text/plain, Size: 2790 bytes --]

On 2020-05-20, Kees Cook <keescook@chromium.org> wrote:
> On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote:
> > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> > > > Perhaps the question is "how deeply does seccomp need to inspect?"
> > > > and maybe it does not get to see anything beyond just the "top level"
> > > > struct (i.e. struct clone_args) and all pointers within THAT become
> > > > opaque? That certainly simplifies the design.
> > > 
> > > Exactly. I think that's the most common situation by far. Does anybody
> > > really really need to care at a deep level, and why?
> > 
> > We mostly don't and making all second-level pointers opaque is ok imho.
> 
> That'll make things MUCH easier. :)

To be clear, my insistence on the second-level pointers topic is coming
from the view that we should make sure whatever model we use for the
first iteration of deep argument inspection can be expanded to
second-level pointers if we need them. The jump-table proposal I had was
just an example of how we could plan out a design that could be
implemented piece-meal (heck, we don't even need jump-tables in the
first iteration -- so long as we have an idea for how they'd work).

I also hasten to point out that if we make all second-level pointers
opaque then you won't be able to filter clone3() based on ->set_tid.
Now, maybe that's something nobody cares about, but it should be taken
into consideration that one of the handful of "obvious" syscalls will
already not be completely-filterable with second-level pointers being
opaque.

But if that's fine (at least for a first iteration), then I'm also okay
with that.

> > But I think that we need some documented consensus on all that stuff
> > which I stressed in other mails before. I'll hand something in about
> > this, if that's ok than we can hash this out.
> 
> Aleksa, I know you had an entire presentation[1] on the extensible
> argument syscalls, but was there any text-based design doc that you made?
> 
> It would be really nice to update Documentation/process/adding-syscalls.rst
> with the specifics[2], and to (now) include the "no nested flags"
> requirement. What do you think?

Christian and I wrote a patch for adding-syscalls last year[1], but Jon
felt that it should require greater community consensus before it gets
put into adding-syscalls. But yes, I'm definitely in favour of having
this be a properly-documented aspect of new syscall design.

[1]: https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-22  4:06               ` Aleksa Sarai
@ 2020-05-22  7:35                 ` Christian Brauner
  2020-05-22 11:27                   ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2020-05-22  7:35 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: ksummit

On Fri, May 22, 2020 at 02:06:06PM +1000, Aleksa Sarai wrote:
> On 2020-05-20, Kees Cook <keescook@chromium.org> wrote:
> > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote:
> > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> > > > > Perhaps the question is "how deeply does seccomp need to inspect?"
> > > > > and maybe it does not get to see anything beyond just the "top level"
> > > > > struct (i.e. struct clone_args) and all pointers within THAT become
> > > > > opaque? That certainly simplifies the design.
> > > > 
> > > > Exactly. I think that's the most common situation by far. Does anybody
> > > > really really need to care at a deep level, and why?
> > > 
> > > We mostly don't and making all second-level pointers opaque is ok imho.
> > 
> > That'll make things MUCH easier. :)
> 
> To be clear, my insistence on the second-level pointers topic is coming
> from the view that we should make sure whatever model we use for the
> first iteration of deep argument inspection can be expanded to
> second-level pointers if we need them. The jump-table proposal I had was
> just an example of how we could plan out a design that could be
> implemented piece-meal (heck, we don't even need jump-tables in the
> first iteration -- so long as we have an idea for how they'd work).
> 
> I also hasten to point out that if we make all second-level pointers
> opaque then you won't be able to filter clone3() based on ->set_tid.

That's not an interesting second-level case. Either turn it on or off;
base it on set_tid_size which tells you whether someone requested it or
not. There's absolutely no reason to filter around in set_tid size ([1]).
That was considered when adding this for checkpoint restore. You either
allow someone that is sufficiently capable in the owning user namespace
of each pid namespace it wants to select specific pids in or you simply
deny it. That's not a great strawman.

Interesting second level pointers are where you have second level
pointers that can point to differnet things or multiple features at once
based on an opaque switch. If you're looking for interesting second
level pointers look at bpf(). One example, is just the
BPF_OBJ_GET_INFO_BY_FD command wich passes a struct info which contains
an fd and depending on what type of fd that is, info can be either
struct bpf_prog_info, struct bpf_map_info, or struct bpf_btf_info some
of which can have other third level pointers in there.

[1]: There already wouldn't be any point to this if it were a first
     level pointer because you always need to determine the pid
     namespace hierarchy of the caller first to know whether or not you
     want to deny choosing a specific pid in a given namespace. That's
     nonsense.

> Now, maybe that's something nobody cares about, but it should be taken
> into consideration that one of the handful of "obvious" syscalls will
> already not be completely-filterable with second-level pointers being
> opaque.
> 
> But if that's fine (at least for a first iteration), then I'm also okay
> with that.
> 
> > > But I think that we need some documented consensus on all that stuff
> > > which I stressed in other mails before. I'll hand something in about
> > > this, if that's ok than we can hash this out.
> > 
> > Aleksa, I know you had an entire presentation[1] on the extensible
> > argument syscalls, but was there any text-based design doc that you made?
> > 
> > It would be really nice to update Documentation/process/adding-syscalls.rst
> > with the specifics[2], and to (now) include the "no nested flags"
> > requirement. What do you think?
> 
> Christian and I wrote a patch for adding-syscalls last year[1], but Jon
> felt that it should require greater community consensus before it gets
> put into adding-syscalls. But yes, I'm definitely in favour of having
> this be a properly-documented aspect of new syscall design.
> 
> [1]: https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>


_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development
  2020-05-22  7:35                 ` Christian Brauner
@ 2020-05-22 11:27                   ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2020-05-22 11:27 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: ksummit

On Fri, May 22, 2020 at 09:35:35AM +0200, Christian Brauner wrote:
> On Fri, May 22, 2020 at 02:06:06PM +1000, Aleksa Sarai wrote:
> > On 2020-05-20, Kees Cook <keescook@chromium.org> wrote:
> > > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote:
> > > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote:
> > > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote:
> > > > > > Perhaps the question is "how deeply does seccomp need to inspect?"
> > > > > > and maybe it does not get to see anything beyond just the "top level"
> > > > > > struct (i.e. struct clone_args) and all pointers within THAT become
> > > > > > opaque? That certainly simplifies the design.
> > > > > 
> > > > > Exactly. I think that's the most common situation by far. Does anybody
> > > > > really really need to care at a deep level, and why?
> > > > 
> > > > We mostly don't and making all second-level pointers opaque is ok imho.
> > > 
> > > That'll make things MUCH easier. :)
> > 
> > To be clear, my insistence on the second-level pointers topic is coming
> > from the view that we should make sure whatever model we use for the
> > first iteration of deep argument inspection can be expanded to
> > second-level pointers if we need them. The jump-table proposal I had was
> > just an example of how we could plan out a design that could be
> > implemented piece-meal (heck, we don't even need jump-tables in the
> > first iteration -- so long as we have an idea for how they'd work).
> > 
> > I also hasten to point out that if we make all second-level pointers
> > opaque then you won't be able to filter clone3() based on ->set_tid.
> 
> That's not an interesting second-level case. Either turn it on or off;
> base it on set_tid_size which tells you whether someone requested it or
> not. There's absolutely no reason to filter around in set_tid size ([1]).
> That was considered when adding this for checkpoint restore. You either
> allow someone that is sufficiently capable in the owning user namespace
> of each pid namespace it wants to select specific pids in or you simply
> deny it. That's not a great strawman.
> 
> Interesting second level pointers are where you have second level
> pointers that can point to differnet things or multiple features at once
> based on an opaque switch. If you're looking for interesting second
> level pointers look at bpf(). One example, is just the
> BPF_OBJ_GET_INFO_BY_FD command wich passes a struct info which contains
> an fd and depending on what type of fd that is, info can be either
> struct bpf_prog_info, struct bpf_map_info, or struct bpf_btf_info some
> of which can have other third level pointers in there.

Other examples include (possibly epoll_ctl's struct epoll_event),
struct iovec in general, {get,set}_robust_list(), kexec_load()'s struct
kexec_segment, and sendmsg()'s and recvmsg()'s sruct msghdr with a large
number of additional substructs passed through passing a struct iovec.
Most of these I reckon are uninteresting and will just in general be not
allowed if there's a problem.

> 
> [1]: There already wouldn't be any point to this if it were a first
>      level pointer because you always need to determine the pid
>      namespace hierarchy of the caller first to know whether or not you
>      want to deny choosing a specific pid in a given namespace. That's
>      nonsense.
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 16:17 [Ksummit-discuss] [TECH TOPIC] seccomp feature development Kees Cook
2020-05-20 16:31 ` Al Viro
2020-05-20 18:05   ` Kees Cook
2020-05-20 18:16     ` Al Viro
2020-05-20 18:27     ` Linus Torvalds
2020-05-20 19:04       ` Kees Cook
2020-05-20 19:08         ` Linus Torvalds
2020-05-20 20:24           ` Christian Brauner
2020-05-20 20:52             ` Kees Cook
2020-05-20 21:02               ` Christian Brauner
2020-05-22  4:06               ` Aleksa Sarai
2020-05-22  7:35                 ` Christian Brauner
2020-05-22 11:27                   ` Christian Brauner
2020-05-20 22:12         ` Alexei Starovoitov
2020-05-20 23:39           ` Kees Cook
2020-05-21  0:43             ` Alexei Starovoitov

Ksummit-Discuss Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/ksummit-discuss/0 ksummit-discuss/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ksummit-discuss ksummit-discuss/ https://lore.kernel.org/ksummit-discuss \
		ksummit-discuss@lists.linuxfoundation.org
	public-inbox-index ksummit-discuss

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linuxfoundation.lists.ksummit-discuss


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git