linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* seccomp feature development
@ 2020-05-18 21:04 Kees Cook
  2020-05-18 22:39 ` Jann Horn
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Kees Cook @ 2020-05-18 21:04 UTC (permalink / raw)
  To: Christian Brauner, Tycho Andersen, Sargun Dhillon, Matt Denton
  Cc: Chris Palmer, Jeffrey Vander Stoep, containers, linux-api, linux-kernel

Hi!

This is my attempt at a brain-dump on my plans for nearish-term seccomp
features. Welcome to my TED talk... ;)

These are the things I've been thinking about:

- fd passing
- deep argument inspection
- changing structure sizes
- syscall bitmasks

So, diving right in:


## fd passing

Background: seccomp users want to be able to install an fd in a
monitored process during a user_notif to emulate "open" calls (or
similar), possibly across security boundaries, etc.

On the fd passing front, it seems that gaining pidfd_addfd() is the way
to go as it allows for generic use not tied to seccomp in particular.
I expect this feature will be developed orthogonally to seccomp (where
does this stand, BTW?). However, as Sargun has shown[1], seccomp could
be friendlier to help with using it. Things that need to be resolved:

- report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
  if we're going to step back and make some design choices here, is
  there a place for pidfds in seccomp user_notif, in order to avoid
  needing the user_notif cookie? I think probably not: it's a rather lot
  of overhead for notifications. It seems it's safe to perform an fd
  installation with these steps:
	- get pidnr from user_notif_recv
	- open pidfd from pidnr
	- re-verify user_notif cookie is still valid
	- send new fd via pidfd
	- reply with user_notif_send
	- close pidfd

- how to deal with changing sizes of the user_notif structures to
  include a pidnr. (Which will be its own topic below.)


## deep argument inspection

Background: seccomp users would like to write filters that traverse
the user pointers passed into many syscalls, but seccomp can't do this
dereference for a variety of reasons (mostly involving race conditions and
rearchitecting the entire kernel syscall and copy_from_user() code flows).

During the last plumbers and in conversations since, the grudging
consensus was reached that having seccomp do this for ALL syscalls was
likely going to be extremely disruptive for very little gain (i.e.
many things, like pathnames, have differing lifetimes, aliases, unstable
kernel object references, etc[6]), but that there were a small subset of
syscalls for which this WOULD be beneficial, and those are the newly
created "Extensible Argument" syscalls (is there a better name for this
design? I'm calling it "EA" for the rest of the email), like clone3(),
openat2(), etc, which pass a pointer and a size:

long clone3(struct clone_args *cl_args, size_t size);

I think it should be possible to extend seccomp to examine this structure
by appending it to seccomp_data, and allowing filters to examine the
contents. This means that no BPF language extensions are required for
seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
seccomp's design principles work well with maps, kernel helpers, etc,
and I think the earlier the examination of using eBPF for user_notif
bares this out).

In order for this to work, there are a number of prerequisites:

- argument caching, in two halves: syscall side and seccomp side:
  - the EA syscalls needs to include awareness of potential seccomp
    hooking. i.e. seccomp may have done the copy_from_user() already and
    kept a cached copy.
  - seccomp needs to potentially DO the copy_from_user() itself when it
    hits these syscalls for a given filter, and put it somewhere for
    later use by the syscall.
- the sizes of these EA structs are, by design, growable over time.
  seccomp and its users need to be handle this in a forward and backward
  compatible way, similar to the design of the EA syscall interface
  itself.

The argument caching bit is, I think, rather mechanical in nature since
it's all "just" internal to the kernel: seccomp can likely adjust how it
allocates seccomp_data (maybe going so far as to have it split across two
pages with the syscall argument struct always starting on the 2nd page
boundary), and copying the EA struct into that page, which will be both
used by the filter and by the syscall. I imagine state tracking ("is
there a cached EA?", "what is the address of seccomp_data?", "what is
the address of the EA?") can be associated with the thread struct.

The growing size of the EA struct will need some API design. For filters
to operate on the contiguous seccomp_data+EA struct, the filter will
need to know how large seccomp_data is (more on this later), and how
large the EA struct is. When the filter is written in userspace, it can
do the math, point into the expected offsets, and get what it needs. For
this to work correctly in the kernel, though, the seccomp BPF verifier
needs to know the size of the EA struct as well, so it can correctly
perform the offset checking (as it currently does for just the
seccomp_data struct size).

Since there is not really any caller-based "seccomp state" associated
across seccomp(2) calls, I don't think we can add a new command to tell
the kernel "I'm expecting the EA struct size to be $foo bytes", since
the kernel doesn't track who "I" is besides just being "current", which
doesn't take into account the thread lifetime -- if a process launcher
knows about one size and the child knows about another, things will get
confused. The sizes really are just associated with individual filters,
based on the syscalls they're examining. So, I have thoughts on possible
solutions:

- create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
  EA style so we can pass in more than a filter and include also an
  array of syscall to size mappings. (I don't like this...)
- create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
  the meaning of the uarg from "filter" to a EA-style structure with
  sizes and pointers to the filter and an array of syscall to size
  mappings. (I like this slightly better, but I still don't like it.)
- leverage the EA design and just accept anything <= PAGE_SIZE, record
  the "max offset" value seen during filter verification, and zero-fill
  the EA struct with zeros to that size when constructing the
  seccomp_data + EA struct that the filter will examine. Then the seccomp
  filter doesn't care what any of the sizes are, and userspace doesn't
  care what any of the sizes are. (I like this as it makes the problems
  to solve contained entirely by the seccomp infrastructure and does not
  touch user API, but I worry I'm missing some gotcha I haven't
  considered.)

And then, my age-old concern, that maybe doesn't need a solution... I
remain plagued by the lack of pathname inspection. But I think the
ToCToU nature of it means we just cannot do it from seccomp. It does
make filtering openat2()'s EA struct a bit funny... a filter has no idea
what path it applies to... but that doesn't matter because the object
the path points to might change[6] during the syscall. Argh.


## changing structure sizes

Background: there have been regular needs to add things to various
seccomp structures. Each come with their own unique pains, and solving
this as completely as possible in a future-proof way would be very nice.

As noted in "fd passing" above, there is a desire to add some useful
things to the user_notif struct (e.g. thread group pid). Similarly,
there have been requests in the past (though I can't find any good
references right now, just my indirect comments[3]) to grow seccomp_data.
Combined with the EA struct work above, there's a clear need for seccomp
to reexamine how it deals with its API structures (and how this
interacts with filters).

First, let's consider seccomp_data. If we grow it, the EA struct offset
will move, based on the deep arg inspection design above. Alternatively,
we could instead put seccomp_data offset 0, and EA struct at offset
PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
the filter access whatever it thinks is there, with it being zero-filled
by the kernel. For any values where 0 is valid, there will just need to
be a "is that field valid?" bit before it:

	unsigned long feature_bits;
	unsigned long interesting_thing_1;
	unsigned long interesting_thing_2;
	unsigned long interesting_thing_3;
	...

and the filter would check feature_bits...

(However, this needs to be carefully considered given that seccomp_data
is embedded in user_notif... should the EA struct from userspace also be
copied into user_notif? More thoughts on this below...)

For user_notif, I think we need something in and around these options:

- make a new API that explicitly follows EA struct design
  (and while read()/write() might be easier[4], I tend to agree with
  Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
  for data". Though I wonder if read() could be used for the notifications,
  which ARE data, and use ioctl() for the responses?)
- make a new API that is perf_event_open()-style where fields are
  explicitly requested, as Sargun suggested[5]. (This looks like it
  might be complex to construct, but would get us by far the most
  extensible API.)
- jam whatever we pick into the existing API (we'll be forced to do
  SOMETHING to make the old API still work, so, I dunno what that will
  look like until we finish the rest of the design).

If we did a requested-fields approach, what would the user_notif event
block of bytes look like? Would it be entirely dynamic based on the
initial ioctl()? Another design consideration here is that we don't want
the kernel doing tons of work (especially copying) and tossing tons
of stuff into a huge structure that the user doesn't care about. In
addition to explicit fields, maybe the EA struct could be included,
perhaps with specified offset/size, so only the portion the user_notif
user wanted to inspect was copied?

The complexity of the per-field API is higher, but I think it might be
the most robust and have the greatest chance at being performant.
For example, "send me user_notif but I only care about the pid" would
mean no syscall arguments are copied, etc.


## syscall bitmasks

Background: the number one bit of feedback on seccomp has been performance
concerns, especially for fast syscalls like futex(). When looking at
where time is spent, it is very clearly spent running the filters (which
I found surprising, given that adding TIF_SECCOMP tended to trip the
"slow path" syscall path (though most architectures these days just
don't have a fast path any more thanks to Meltdown). It would be nice
to make filtering faster without running BPF at all. :)

Nearly every thread on adding eBPF, for example, has been about trying to
speed up the if/then nature of BPF for finding a syscall that the filter
wants to always accept (or reject). The bulk of most seccomp filters
are pretty simple, usually they're either "reject everything except
$set-of-syscalls", or "accept everything except $set-of-syscalls". The
stuff in between tends to be a mix, like "accept $some, process $these
with argument checks, and reject $remaining".

In all three cases, the entire seccomp() path could be sped up by having
a syscall bitmask that could be applied before the filters are ever run,
with 3 (actually 2) syscall bitmasks: accept, reject, or process. If
a syscall was in the "accept" bitmask, we immediately exit seccomp and
continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected
and immediately exit seccomp. And finally, we run the filters. In all
ways, doing bitmask math is going to be faster than running the BPF. ;)

So how would the API for this work? I have two thoughts, and I don't
think they're exclusive:

- new API for "add this syscall to the reject bitmask". We can't really
  do an "accept" bitmask addition without processing the attached
  filters...
- process attached filters! Each time a filter is added, have the
  BPF verifier do an analysis to determine if there are any static
  results, and set bits in the various bitmasks to represent it.
  i.e. when seccomp is first enabled on a thread, the "accept"
  bitmask is populated with all syscalls, and for each filter, do
  [math,simulation,magic] and knock each syscall out of "accept" if
  it isn't always accepted, and further see if there are any syscalls
  that are always rejected, and mark those in the "reject" bitmask.


Okay, that's all I had ... what do people think?

-Kees


[1] https://lore.kernel.org/lkml/20200515234005.32370-1-sargun@sargun.me/
    https://lore.kernel.org/lkml/20200124091743.3357-1-sargun@sargun.me/
[2] http://man7.org/linux/man-pages/man2/openat2.2.html#NOTES
[3] https://lore.kernel.org/lkml/202003172058.3CB0D95@keescook/
[4] https://lore.kernel.org/lkml/20200518124500.5cb7rtjitbiiw3mq@wittgenstein/
[5] https://lore.kernel.org/lkml/CAMp4zn-Ak0062t9HfMMZvKNv1+EAujgEeg5c4-gtjD-pAGAtTw@mail.gmail.com/
    http://man7.org/linux/man-pages/man2/perf_event_open.2.html
[6] see "move argument parsing?" https://outflux.net/slides/2019/lpc/deep-arg-inspection.pdf
    https://www.youtube.com/watch?v=PnOSPsRzVYM


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
@ 2020-05-18 22:39 ` Jann Horn
  2020-05-19  2:48   ` Aleksa Sarai
                     ` (2 more replies)
  2020-05-18 22:55 ` Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 22+ messages in thread
From: Jann Horn @ 2020-05-18 22:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Linux Containers, Linux API,
	kernel list

On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> ## deep argument inspection
>
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).

Also, other than for syscall entry, it might be worth thinking about
whether we want to have a special hook into seccomp for io_uring.
io_uring is growing support for more and more syscalls, including
things like openat2, connect, sendmsg, splice and so on, and that list
is probably just going to grow in the future. If people start wanting
to use io_uring in software with seccomp filters, it might be
necessary to come up with some mechanism to prevent io_uring from
permitting access to almost everything else...

Probably not a big priority for now, but something to keep in mind for
the future.

[...]
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall.

We could also do the same kind of thing the eBPF verifier does in
convert_ctx_accesses(), and rewrite the context accesses to actually
go through two different pointers depending on the (constant) offset
into seccomp_data.

> I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.

You probably mean the task struct?

> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
>
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
>
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
>   EA style so we can pass in more than a filter and include also an
>   array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
>   the meaning of the uarg from "filter" to a EA-style structure with
>   sizes and pointers to the filter and an array of syscall to size
>   mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
>   the "max offset" value seen during filter verification, and zero-fill
>   the EA struct with zeros to that size when constructing the
>   seccomp_data + EA struct that the filter will examine. Then the seccomp
>   filter doesn't care what any of the sizes are, and userspace doesn't
>   care what any of the sizes are. (I like this as it makes the problems
>   to solve contained entirely by the seccomp infrastructure and does not
>   touch user API, but I worry I'm missing some gotcha I haven't
>   considered.)

We don't need to actually zero-fill memory for this beyond what the
kernel supports - AFAIK the existing APIs already say that passing a
short length is equivalent to passing zeroes, so we can just replace
all out-of-bounds loads with zeroing registers in the filter.
The tricky case is what should happen if the userspace program passes
in fields that the filter doesn't know about. The filter can see the
length field passed in by userspace, and then just reject anything
where the length field is bigger than the structure size the filter
knows about. But maybe it'd be slightly nicer if there was an
operation for "tell me whether everything starting at offset X is
zeroes", so that if someone compiles with newer kernel headers where
the struct is bigger, and leaves the new fields zeroed, the syscalls
still go through an outdated filter properly.

> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.

I don't think seccomp needs to care about paths. Instead, you can use
one of these three approaches:

1) You can make openat2() the only syscall that is allowed to take
non-empty path arguments, and restrict it to
RESOLVE_BENEATH|RESOLVE_IN_ROOT. (For APIs that use AT_EMPTY_PATH, we
can probably come up with some way to say "this part must be an empty
string" - e.g. by defining a new bogus placeholder pointer that you
can use as "empty path", or something like that.) This is basically
like the old capsicum O_BENEATH stuff, except with seccomp doing the
enforcement that you're not using absolute paths or things like that.
2) You can create a new mount namespace, then use open_tree() with
OPEN_TREE_CLONE to create file descriptors to ephemeral bind mounts,
then sandbox yourself with pivot_root().
3) You can use Mickael's landlock, once that's landed.

In particular the first and second option are things you can already
do today - although the first one requires that you have a seccomp
filter that prevents the program from overwriting/unmapping/... the
chunk of memory that contains the openat2 argument structure, so that
you can then whitelist the argument structure's address in the filter.

> ## changing structure sizes
>
> Background: there have been regular needs to add things to various
> seccomp structures. Each come with their own unique pains, and solving
> this as completely as possible in a future-proof way would be very nice.
>
> As noted in "fd passing" above, there is a desire to add some useful
> things to the user_notif struct (e.g. thread group pid). Similarly,
> there have been requests in the past (though I can't find any good
> references right now, just my indirect comments[3]) to grow seccomp_data.

This thing (which hasn't landed so far, but would be a really awesome
feature) needed to add stuff to seccomp_data:
<https://lore.kernel.org/linux-api/20181029112343.27454-1-msammler@mpi-sws.org/>

> Combined with the EA struct work above, there's a clear need for seccomp
> to reexamine how it deals with its API structures (and how this
> interacts with filters).
>
> First, let's consider seccomp_data. If we grow it, the EA struct offset
> will move, based on the deep arg inspection design above. Alternatively,
> we could instead put seccomp_data offset 0, and EA struct at offset
> PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> the filter access whatever it thinks is there, with it being zero-filled
> by the kernel. For any values where 0 is valid, there will just need to
> be a "is that field valid?" bit before it:
>
>         unsigned long feature_bits;
>         unsigned long interesting_thing_1;
>         unsigned long interesting_thing_2;
>         unsigned long interesting_thing_3;
>         ...
>
> and the filter would check feature_bits...

(Apart from the user_notif stuff, those feature bits would not
actually have to exist in memory; they could be inlined while loading
the program. Actually, not even the registers would have to exist in a
seccomp_data struct in memory, we could just replace the loads with
reads from the pt_regs, too.)

> (However, this needs to be carefully considered given that seccomp_data
> is embedded in user_notif... should the EA struct from userspace also be
> copied into user_notif? More thoughts on this below...)
>
> For user_notif, I think we need something in and around these options:
>
> - make a new API that explicitly follows EA struct design
>   (and while read()/write() might be easier[4], I tend to agree with
>   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
>   for data". Though I wonder if read() could be used for the notifications,
>   which ARE data, and use ioctl() for the responses?)

Just as a note: If we use read() there, we'll never be able to
transfer things like FDs through that API.

> ## syscall bitmasks

YES PLEASE

> Background: the number one bit of feedback on seccomp has been performance
> concerns, especially for fast syscalls like futex(). When looking at
> where time is spent, it is very clearly spent running the filters (which
> I found surprising, given that adding TIF_SECCOMP tended to trip the
> "slow path" syscall path (though most architectures these days just
> don't have a fast path any more thanks to Meltdown). It would be nice
> to make filtering faster without running BPF at all. :)
>
> Nearly every thread on adding eBPF, for example, has been about trying to
> speed up the if/then nature of BPF for finding a syscall that the filter
> wants to always accept (or reject). The bulk of most seccomp filters
> are pretty simple, usually they're either "reject everything except
> $set-of-syscalls", or "accept everything except $set-of-syscalls". The
> stuff in between tends to be a mix, like "accept $some, process $these
> with argument checks, and reject $remaining".
>
> In all three cases, the entire seccomp() path could be sped up by having
> a syscall bitmask that could be applied before the filters are ever run,
> with 3 (actually 2) syscall bitmasks: accept, reject, or process. If
> a syscall was in the "accept" bitmask, we immediately exit seccomp and
> continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected
> and immediately exit seccomp. And finally, we run the filters. In all
> ways, doing bitmask math is going to be faster than running the BPF. ;)
>
> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
>
> - new API for "add this syscall to the reject bitmask". We can't really
>   do an "accept" bitmask addition without processing the attached
>   filters...
> - process attached filters! Each time a filter is added, have the
>   BPF verifier do an analysis to determine if there are any static
>   results, and set bits in the various bitmasks to represent it.
>   i.e. when seccomp is first enabled on a thread, the "accept"
>   bitmask is populated with all syscalls, and for each filter, do
>   [math,simulation,magic] and knock each syscall out of "accept" if
>   it isn't always accepted, and further see if there are any syscalls
>   that are always rejected, and mark those in the "reject" bitmask.

Other options:
 - add a "load from read-only memory" opcode and permit specifying the
data that should be in that memory when loading the filter
 - make the seccomp API take an array of (syscall-number,
instruction-offset) tuples, and start evaluation of the filter at an
offset if it's one of those syscalls

One more thing that would be really nice: Is there a way we can have
64-bit registers in our seccomp filters? At the moment, every
comparison has to turn into three ALU ops, which is pretty messy;
libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and
it contributes to the horrific code Chrome's BPF generator creates.
Here's some pseudocode from my hacky BPF disassembler, which shows
pretty much just the filter code for filtering getpriority() and
setpriority() in a Chrome renderer, with tons of useless dead code:

0139         if args[0].high == 0x00000000: [true +3, false +0]
013e           if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
0140           if args[1].high == 0x00000000: [true +3, false +0]
0145             if args[1].low == 0x00000000: [true +149, false +0]
-> ret ALLOW (syscalls: getpriority, setpriority)
0147             if args[1].high == 0x00000000: [true +3, false +0]
014c               if args[1].low == 0x00000001: [true +142, false
+141] -> ret ALLOW (syscalls: getpriority, setpriority)
01da               ret ERRNO
0148             if args[1].high != 0xffffffff: [true +142, false +0]
-> ret TRAP
014a             if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c             if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da             ret ERRNO
0141           if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
0143           if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
false +0] -> ret TRAP
0145           if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147           if args[1].high == 0x00000000: [true +3, false +0]
014c             if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da             ret ERRNO
0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c           if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da           ret ERRNO
013a         if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP
013c         if args[0].low NO-COMMON-BITS 0x80000000: [true +154,
false +0] -> ret TRAP
013e         if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
0140         if args[1].high == 0x00000000: [true +3, false +0]
0145           if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147           if args[1].high == 0x00000000: [true +3, false +0]
014c             if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da             ret ERRNO
0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c           if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da           ret ERRNO
0141         if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
0143         if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
false +0] -> ret TRAP
0145         if args[1].low == 0x00000000: [true +149, false +0] ->
ret ALLOW (syscalls: getpriority, setpriority)
0147         if args[1].high == 0x00000000: [true +3, false +0]
014c           if args[1].low == 0x00000001: [true +142, false +141]
-> ret ALLOW (syscalls: getpriority, setpriority)
01da           ret ERRNO
0148         if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
014a         if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
false +0] -> ret TRAP
014c         if args[1].low == 0x00000001: [true +142, false +141] ->
ret ALLOW (syscalls: getpriority, setpriority)
01da         ret ERRNO

which is generated by this little snippet of C++ code:

ResultExpr RestrictGetSetpriority(pid_t target_pid) {
  const Arg<int> which(0);
  const Arg<int> who(1);
  return If(which == PRIO_PROCESS,
            Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM)))
      .Else(CrashSIGSYS());
}

On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc,
we're actually already using the eBPF backend infrastructure... so
maybe it would be an option to keep enforcing basically the same rules
that we currently have for cBPF, but use the eBPF instruction format?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
  2020-05-18 22:39 ` Jann Horn
@ 2020-05-18 22:55 ` Andy Lutomirski
  2020-05-19  7:09 ` Aleksa Sarai
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2020-05-18 22:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Linux Containers, Linux API,
	LKML

On Mon, May 18, 2020 at 2:05 PM Kees Cook <keescook@chromium.org> wrote:
>
> Hi!
>

This is minor, but, if we grow seccomp_data, I would like to add the
other 32 bits of the syscall nr to it.  Syscall numbers are unsigned
long, but they get munged into u32 for seccomp_data.

Sure, no one uses those high bits yet, but if we're extending things
anyway, let's support them.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 22:39 ` Jann Horn
@ 2020-05-19  2:48   ` Aleksa Sarai
  2020-05-19 10:35     ` Christian Brauner
  2020-05-19 16:18     ` Alexei Starovoitov
  2020-05-19  7:24   ` Sargun Dhillon
  2020-06-03 19:09   ` Kees Cook
  2 siblings, 2 replies; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-19  2:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Christian Brauner, Tycho Andersen, Sargun Dhillon,
	Matt Denton, Chris Palmer, Jeffrey Vander Stoep,
	Linux Containers, Linux API, kernel list

[-- Attachment #1: Type: text/plain, Size: 11487 bytes --]

On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > ## deep argument inspection
> >
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> Also, other than for syscall entry, it might be worth thinking about
> whether we want to have a special hook into seccomp for io_uring.
> io_uring is growing support for more and more syscalls, including
> things like openat2, connect, sendmsg, splice and so on, and that list
> is probably just going to grow in the future. If people start wanting
> to use io_uring in software with seccomp filters, it might be
> necessary to come up with some mechanism to prevent io_uring from
> permitting access to almost everything else...
> 
> Probably not a big priority for now, but something to keep in mind for
> the future.

Indeed. Quite a few people have raised concerns about io_uring and its
debug-ability, but I agree that another less-commonly-mentioned concern
should be how you restrict io_uring(2) from doing operations you've
disallowed through seccomp. Though obviously user_notif shouldn't be
allowed. :D

> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall.
> 
> We could also do the same kind of thing the eBPF verifier does in
> convert_ctx_accesses(), and rewrite the context accesses to actually
> go through two different pointers depending on the (constant) offset
> into seccomp_data.

My main worry with this is that we'll need to figure out what kind of
offset mathematics are necessary to deal with pointers inside the
extensible struct. As a very ugly proposal, you could make it so that
you multiply the offset by PAGE_SIZE each time you want to dereference
the pointer at that offset (unless we want to add new opcodes to cBPF to
allow us to represent this).

This might even be needed for seccomp user_notif -- given one of the
recent proposals was basically to just add two (extensible) struct
pointers inside the main user_notif struct.

> > I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> You probably mean the task struct?
> 
> > The growing size of the EA struct will need some API design. For filters
> > to operate on the contiguous seccomp_data+EA struct, the filter will
> > need to know how large seccomp_data is (more on this later), and how
> > large the EA struct is. When the filter is written in userspace, it can
> > do the math, point into the expected offsets, and get what it needs. For
> > this to work correctly in the kernel, though, the seccomp BPF verifier
> > needs to know the size of the EA struct as well, so it can correctly
> > perform the offset checking (as it currently does for just the
> > seccomp_data struct size).
> >
> > Since there is not really any caller-based "seccomp state" associated
> > across seccomp(2) calls, I don't think we can add a new command to tell
> > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > the kernel doesn't track who "I" is besides just being "current", which
> > doesn't take into account the thread lifetime -- if a process launcher
> > knows about one size and the child knows about another, things will get
> > confused. The sizes really are just associated with individual filters,
> > based on the syscalls they're examining. So, I have thoughts on possible
> > solutions:
> >
> > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> >   EA style so we can pass in more than a filter and include also an
> >   array of syscall to size mappings. (I don't like this...)
> > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> >   the meaning of the uarg from "filter" to a EA-style structure with
> >   sizes and pointers to the filter and an array of syscall to size
> >   mappings. (I like this slightly better, but I still don't like it.)
> > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> >   the "max offset" value seen during filter verification, and zero-fill
> >   the EA struct with zeros to that size when constructing the
> >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> >   filter doesn't care what any of the sizes are, and userspace doesn't
> >   care what any of the sizes are. (I like this as it makes the problems
> >   to solve contained entirely by the seccomp infrastructure and does not
> >   touch user API, but I worry I'm missing some gotcha I haven't
> >   considered.)
> 
> We don't need to actually zero-fill memory for this beyond what the
> kernel supports - AFAIK the existing APIs already say that passing a
> short length is equivalent to passing zeroes, so we can just replace
> all out-of-bounds loads with zeroing registers in the filter.
> The tricky case is what should happen if the userspace program passes
> in fields that the filter doesn't know about. The filter can see the
> length field passed in by userspace, and then just reject anything
> where the length field is bigger than the structure size the filter
> knows about. But maybe it'd be slightly nicer if there was an
> operation for "tell me whether everything starting at offset X is
> zeroes", so that if someone compiles with newer kernel headers where
> the struct is bigger, and leaves the new fields zeroed, the syscalls
> still go through an outdated filter properly.

I think the best way of handling this (without breaking programs
senselessly) is to have filters essentially emulate
copy_struct_from_user() semantics -- which is along the lines of what
you've suggested.

That means in addition to treating all offsets past the end of usize as
being zeroed, we also get filters to only reject programs with usize >
fsize (the size the filter knows) if the trailing (fsize - usize) bytes
are non-zero. Having an opcode for that (a-la bpf_check_uarg_tail_zero)
would be a good idea.

> > And then, my age-old concern, that maybe doesn't need a solution... I
> > remain plagued by the lack of pathname inspection. But I think the
> > ToCToU nature of it means we just cannot do it from seccomp. It does
> > make filtering openat2()'s EA struct a bit funny... a filter has no idea
> > what path it applies to... but that doesn't matter because the object
> > the path points to might change[6] during the syscall. Argh.
> 
> I don't think seccomp needs to care about paths. Instead, you can use
> one of these three approaches:
> 
> 1) You can make openat2() the only syscall that is allowed to take
> non-empty path arguments, and restrict it to
> RESOLVE_BENEATH|RESOLVE_IN_ROOT. (For APIs that use AT_EMPTY_PATH, we
> can probably come up with some way to say "this part must be an empty
> string" - e.g. by defining a new bogus placeholder pointer that you
> can use as "empty path", or something like that.) This is basically
> like the old capsicum O_BENEATH stuff, except with seccomp doing the
> enforcement that you're not using absolute paths or things like that.

The only concern I'd have with this is the dfd argument (would you only
permit AT_FDCWD?). If you want to let programs run inside a
pre-configured jail directory then you'd need to white-list only *that*
fd and then block all other syscalls which would let you overwrite that
fd with something else...

> 2) You can create a new mount namespace, then use open_tree() with
> OPEN_TREE_CLONE to create file descriptors to ephemeral bind mounts,
> then sandbox yourself with pivot_root().
> 3) You can use Mickael's landlock, once that's landed.

Or (4), just use LSMs to achieve this goal (which is basically 3, but
applies to all LSMs).

> > ## changing structure sizes
> >
> > Background: there have been regular needs to add things to various
> > seccomp structures. Each come with their own unique pains, and solving
> > this as completely as possible in a future-proof way would be very nice.
> >
> > As noted in "fd passing" above, there is a desire to add some useful
> > things to the user_notif struct (e.g. thread group pid). Similarly,
> > there have been requests in the past (though I can't find any good
> > references right now, just my indirect comments[3]) to grow seccomp_data.
> 
> This thing (which hasn't landed so far, but would be a really awesome
> feature) needed to add stuff to seccomp_data:
> <https://lore.kernel.org/linux-api/20181029112343.27454-1-msammler@mpi-sws.org/>
> 
> > Combined with the EA struct work above, there's a clear need for seccomp
> > to reexamine how it deals with its API structures (and how this
> > interacts with filters).
> >
> > First, let's consider seccomp_data. If we grow it, the EA struct offset
> > will move, based on the deep arg inspection design above. Alternatively,
> > we could instead put seccomp_data offset 0, and EA struct at offset
> > PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> > the filter access whatever it thinks is there, with it being zero-filled
> > by the kernel. For any values where 0 is valid, there will just need to
> > be a "is that field valid?" bit before it:
> >
> >         unsigned long feature_bits;
> >         unsigned long interesting_thing_1;
> >         unsigned long interesting_thing_2;
> >         unsigned long interesting_thing_3;
> >         ...
> >
> > and the filter would check feature_bits...
> 
> (Apart from the user_notif stuff, those feature bits would not
> actually have to exist in memory; they could be inlined while loading
> the program. Actually, not even the registers would have to exist in a
> seccomp_data struct in memory, we could just replace the loads with
> reads from the pt_regs, too.)
> 
> > (However, this needs to be carefully considered given that seccomp_data
> > is embedded in user_notif... should the EA struct from userspace also be
> > copied into user_notif? More thoughts on this below...)
> >
> > For user_notif, I think we need something in and around these options:
> >
> > - make a new API that explicitly follows EA struct design
> >   (and while read()/write() might be easier[4], I tend to agree with
> >   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> >   for data". Though I wonder if read() could be used for the notifications,
> >   which ARE data, and use ioctl() for the responses?)
> 
> Just as a note: If we use read() there, we'll never be able to
> transfer things like FDs through that API.

And we run into the age-old "read() for management can be a bit hairy"
problem.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
  2020-05-18 22:39 ` Jann Horn
  2020-05-18 22:55 ` Andy Lutomirski
@ 2020-05-19  7:09 ` Aleksa Sarai
  2020-05-19 11:01   ` Christian Brauner
  2020-05-19 10:26 ` Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-19  7:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Daniel Vetter, containers,
	linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 17057 bytes --]

On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> ## fd passing
> 
> Background: seccomp users want to be able to install an fd in a
> monitored process during a user_notif to emulate "open" calls (or
> similar), possibly across security boundaries, etc.
> 
> On the fd passing front, it seems that gaining pidfd_addfd() is the way
> to go as it allows for generic use not tied to seccomp in particular.
> I expect this feature will be developed orthogonally to seccomp (where
> does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> be friendlier to help with using it. Things that need to be resolved:
> 
> - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
>   if we're going to step back and make some design choices here, is
>   there a place for pidfds in seccomp user_notif, in order to avoid
>   needing the user_notif cookie? I think probably not: it's a rather lot
>   of overhead for notifications. It seems it's safe to perform an fd
>   installation with these steps:
> 	- get pidnr from user_notif_recv
> 	- open pidfd from pidnr
> 	- re-verify user_notif cookie is still valid
> 	- send new fd via pidfd
> 	- reply with user_notif_send
> 	- close pidfd

I agree that while it would be conceptually nicer to not need the
user_notif cookie, creating a new pidfd for every notification isn't a
good idea (not to mention that the program may not care about the pid or
pidfd when determining its response -- making the pidfd more of a pain
than anything else).

You'd also need to have special-casing based on the userspace length of
user_notif because old programs wouldn't know they'd have to close the
pidfd returned (which then brings us back to the fact that user_notif
doesn't specify what the usize of the struct is a-la
copy_struct_from_user).

> ## deep argument inspection
> 
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> During the last plumbers and in conversations since, the grudging
> consensus was reached that having seccomp do this for ALL syscalls was
> likely going to be extremely disruptive for very little gain (i.e.
> many things, like pathnames, have differing lifetimes, aliases, unstable
> kernel object references, etc[6]), but that there were a small subset of
> syscalls for which this WOULD be beneficial, and those are the newly
> created "Extensible Argument" syscalls (is there a better name for this
> design? I'm calling it "EA" for the rest of the email), like clone3(),
> openat2(), etc, which pass a pointer and a size:

I called them "extensible struct" syscalls, but "extensible arguments"
is probably a better name. Though if you want to call them EA, make sure
to include the ™. ;)

> long clone3(struct clone_args *cl_args, size_t size);
> 
> I think it should be possible to extend seccomp to examine this structure
> by appending it to seccomp_data, and allowing filters to examine the
> contents. This means that no BPF language extensions are required for
> seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> seccomp's design principles work well with maps, kernel helpers, etc,
> and I think the earlier the examination of using eBPF for user_notif
> bares this out).

While I agree that we don't need to jump to eBPF for this, I do think
we'll need an opcode for is_zeroed() (or bpf_check_uarg_tail_zero()) --
see below.

> In order for this to work, there are a number of prerequisites:
> 
> - argument caching, in two halves: syscall side and seccomp side:
>   - the EA syscalls needs to include awareness of potential seccomp
>     hooking. i.e. seccomp may have done the copy_from_user() already and
>     kept a cached copy.
>   - seccomp needs to potentially DO the copy_from_user() itself when it
>     hits these syscalls for a given filter, and put it somewhere for
>     later use by the syscall.
> 
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall. I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.

I agree that while this is an important thing to handle properly, it's
internal and probably doesn't deserve the lions share of design
discussion.

> - the sizes of these EA structs are, by design, growable over time.
>   seccomp and its users need to be handle this in a forward and backward
>   compatible way, similar to the design of the EA syscall interface
>   itself.
> 
> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
> 
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
> 
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
>   EA style so we can pass in more than a filter and include also an
>   array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
>   the meaning of the uarg from "filter" to a EA-style structure with
>   sizes and pointers to the filter and an array of syscall to size
>   mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
>   the "max offset" value seen during filter verification, and zero-fill
>   the EA struct with zeros to that size when constructing the
>   seccomp_data + EA struct that the filter will examine. Then the seccomp
>   filter doesn't care what any of the sizes are, and userspace doesn't
>   care what any of the sizes are. (I like this as it makes the problems
>   to solve contained entirely by the seccomp infrastructure and does not
>   touch user API, but I worry I'm missing some gotcha I haven't
>   considered.)

Okay, so here is my view on this. I think that the third option is
closest to what I'd like to see. Based on Jann's email, I think we're on
the same page but I'd just like to elaborate it a bit further:

First of all -- ideally, the backward and forward compatibility that EA
syscalls give us should be reflected with seccomp filters being
similarly compatible. Otherwise we're going to run into issues where all
of the hard work with ensuring EA syscalls behave when extended will be
less valuable if seccomp cannot handle it sufficiently. This means that
I would hope that every combination of {old,new} filter/kernel/program
would work on a best-effort (but fail-safe) basis.

In my view, the simplest way (from the kernel side) would be to simply
do what you outlined in (3) -- make all accesses past usize (and even
ksize) be zeroed.

However in order to make an old filter fail-safe on a new kernel with a
new program, we'd need a new opcode which basically does
bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
would punt the fail-safe problem to userspace (libseccomp would need to
generate a check that any unknown-to-the-filter fields are zero). I
don't think this is a decision we can make in-kernel because it might be
that the filter doesn't care about the last field in a struct (and thus
doesn't access it) but we don't know the difference between a field the
filter doesn't care about and a field it doesn't know about.

Another issue which you haven't mentioned here is how do we deal with
pointers inside an EA struct. clone3() already has this with set_tid,
and the thread on user_notif seems to be converging towards having the
user_notif2 struct just contain two pointers to EA structs. We could
punt on this for now, so long as we leave room for us to deal with this
problem in the future. However, I think it would be misguided to enable
deep argument inspection on syscalls which have such entries (such as
clone3) until we've figured out how we want to handle nested pointers --
at the very least because the bpf_check_uarg_tail_zero() opcode
semantics need to consider this problem.

A silly proposal I had was that because the EA would never be bigger
than PAGE_SIZE, we could take the offset into the EA and multiply it by
PAGE_SIZE each time we wanted to do a "dereference" of a pointer in EA.
But this does have a few downsides (seccomp_data can no longer be just a
block of memory, we'd need to translate the offsets, and we'd need to
figure out what the new semantics of the bpf_check_uarg_tail_zero()
should be).

> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.

I still strongly believe that seccomp is the wrong layer to be doing
this. Yes, we could in theory do some checks like RESOLVE_IN_ROOT (to
detect if a path has changed name on the system), but that sounds like a
complete mess to put inside seccomp.

Yes, this does mean that filtering openat2() is a bit odd -- but so was
filtering openat() by the same token. Really, this is something which
the LSMs were specifically included to do and I think we should let them
handle this problem. Ultimately if/when Landlock gets merged then folks
will be able to have an eBPF-programmable LSM and no longer rely on
seccomp for all their programmable-policy needs.

And as Jann said, you could (with enough effort) come up with filter
rules that do effectively restrict the set of paths a program can access
without needing to look at the path argument. Though I would posit that
using deep argument inspection for that purpose is a little misguided,
as there are better tools for the job (such as LSMs).

> ## changing structure sizes
> 
> Background: there have been regular needs to add things to various
> seccomp structures. Each come with their own unique pains, and solving
> this as completely as possible in a future-proof way would be very nice.
> 
> As noted in "fd passing" above, there is a desire to add some useful
> things to the user_notif struct (e.g. thread group pid). Similarly,
> there have been requests in the past (though I can't find any good
> references right now, just my indirect comments[3]) to grow seccomp_data.
> Combined with the EA struct work above, there's a clear need for seccomp
> to reexamine how it deals with its API structures (and how this
> interacts with filters).
> 
> First, let's consider seccomp_data. If we grow it, the EA struct offset
> will move, based on the deep arg inspection design above. Alternatively,
> we could instead put seccomp_data offset 0, and EA struct at offset
> PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> the filter access whatever it thinks is there, with it being zero-filled
> by the kernel. For any values where 0 is valid, there will just need to
> be a "is that field valid?" bit before it:
> 
> 	unsigned long feature_bits;
> 	unsigned long interesting_thing_1;
> 	unsigned long interesting_thing_2;
> 	unsigned long interesting_thing_3;
> 	...
> 
> and the filter would check feature_bits...
> 
> (However, this needs to be carefully considered given that seccomp_data
> is embedded in user_notif... should the EA struct from userspace also be
> copied into user_notif? More thoughts on this below...)
> 
> For user_notif, I think we need something in and around these options:
> 
> - make a new API that explicitly follows EA struct design
>   (and while read()/write() might be easier[4], I tend to agree with
>   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
>   for data". Though I wonder if read() could be used for the notifications,
>   which ARE data, and use ioctl() for the responses?)

Note that having EA-style ioctl()s is not unheard of. Daniel Vetter
(added to Cc) mentioned after my EA talk at the last LCA that apparently
graphics has been doing this for years -- maybe we can look at how
they've handled this problem?

> - make a new API that is perf_event_open()-style where fields are
>   explicitly requested, as Sargun suggested[5]. (This looks like it
>   might be complex to construct, but would get us by far the most
>   extensible API.)

Unless I'm mistaken, I think statx(2) might be a better comparison. I
do think this is a reasonable idea, though if we make the struct an EA
(which I recommend) then we'll need to agree on copy_struct_to_user()
semantics.

> - jam whatever we pick into the existing API (we'll be forced to do
>   SOMETHING to make the old API still work, so, I dunno what that will
>   look like until we finish the rest of the design).

If we do the statx-like API then we can just make the old API have an
implied set of requested-fields that match whatever fields were present
before we added the requested-fields approach.

> If we did a requested-fields approach, what would the user_notif event
> block of bytes look like? Would it be entirely dynamic based on the
> initial ioctl()? Another design consideration here is that we don't want
> the kernel doing tons of work (especially copying) and tossing tons
> of stuff into a huge structure that the user doesn't care about. In
> addition to explicit fields, maybe the EA struct could be included,
> perhaps with specified offset/size, so only the portion the user_notif
> user wanted to inspect was copied?

I would be cautious about adding the EA struct -- there is still the
problem of nested pointers (and it might be even crazier to apply
whatever hacks we have for deep argument inspection into the user_notif
API).

> ## syscall bitmasks

I like this idea. :D

> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
> 
> - new API for "add this syscall to the reject bitmask". We can't really
>   do an "accept" bitmask addition without processing the attached
>   filters...

You could use the accept mask -- take the logical and of all the
filters' masks and that set is the ones you can skip and auto-accept.

> - process attached filters! Each time a filter is added, have the
>   BPF verifier do an analysis to determine if there are any static
>   results, and set bits in the various bitmasks to represent it.
>   i.e. when seccomp is first enabled on a thread, the "accept"
>   bitmask is populated with all syscalls, and for each filter, do
>   [math,simulation,magic] and knock each syscall out of "accept" if
>   it isn't always accepted, and further see if there are any syscalls
>   that are always rejected, and mark those in the "reject" bitmask.

I reckon that both approaches (in parallel) are worthwhile, as the
latter proposal will allow for existing filters to become faster while
the former allows new filters to become much smaller.

The only possible issue I see with a deny bitmask is that we'd have to
make sure that filters generated by libseccomp et al still handle
unknown syscalls correctly (so we don't end up with this feature causing
new filters to become black-lists).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 22:39 ` Jann Horn
  2020-05-19  2:48   ` Aleksa Sarai
@ 2020-05-19  7:24   ` Sargun Dhillon
  2020-05-19  8:30     ` Christian Brauner
  2020-06-03 19:09   ` Kees Cook
  2 siblings, 1 reply; 22+ messages in thread
From: Sargun Dhillon @ 2020-05-19  7:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Christian Brauner, Tycho Andersen, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Linux Containers, Linux API,
	kernel list

On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote:
> > For user_notif, I think we need something in and around these options:
> >
> > - make a new API that explicitly follows EA struct design
> >   (and while read()/write() might be easier[4], I tend to agree with
> >   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> >   for data". Though I wonder if read() could be used for the notifications,
> >   which ARE data, and use ioctl() for the responses?)
> 
> Just as a note: If we use read() there, we'll never be able to
> transfer things like FDs through that API.
> 
Although there is no good reason for read being able to receive FDs, there is
precedence for recvmsg being able to do this. Either way, I do not think
it's a good idea to recv file descriptors, and instead file descriptors
should be fetched via the pidfd_getfd syscall.

Injection is more complicated, and for now, I believe that "writes" should
be done via ioctl, or in the future, something like sendmsg might work.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19  7:24   ` Sargun Dhillon
@ 2020-05-19  8:30     ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2020-05-19  8:30 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Jann Horn, Kees Cook, Tycho Andersen, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, Linux Containers, Linux API, kernel list

On Tue, May 19, 2020 at 07:24:52AM +0000, Sargun Dhillon wrote:
> On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote:
> > > For user_notif, I think we need something in and around these options:
> > >
> > > - make a new API that explicitly follows EA struct design
> > >   (and while read()/write() might be easier[4], I tend to agree with
> > >   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> > >   for data". Though I wonder if read() could be used for the notifications,
> > >   which ARE data, and use ioctl() for the responses?)
> > 
> > Just as a note: If we use read() there, we'll never be able to
> > transfer things like FDs through that API.

(Hm, how did I not get that message? Weird. :))

I hope we won't be able to receive fds through read(). This quickly
becomes quite problematic, I think.

> > 
> Although there is no good reason for read being able to receive FDs, there is
> precedence for recvmsg being able to do this. Either way, I do not think

Right, and recvmsg() is quite dangerous because of that because you need
to be extremely careful when e.g. the message is truncated and you want
to error out and you need to carefully close all fds and other shenanigans.

Also, recvmsg() imho, is a bit different from read simply because
it's sort-of a "typed" read; it's plumbed on top of a message protocol and
that protocal includes the ability to read fds. read() on the other hand
is completely agnostic and doesn't care about the data at all. But
that's just how I always conceptualized it...

> it's a good idea to recv file descriptors, and instead file descriptors
> should be fetched via the pidfd_getfd syscall.

+1

> 
> Injection is more complicated, and for now, I believe that "writes" should
> be done via ioctl, or in the future, something like sendmsg might work.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
                   ` (2 preceding siblings ...)
  2020-05-19  7:09 ` Aleksa Sarai
@ 2020-05-19 10:26 ` Christian Brauner
  2020-05-20  8:23   ` Sargun Dhillon
  2020-05-19 14:07 ` Tycho Andersen
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2020-05-19 10:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, Sargun Dhillon, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, containers, linux-api, linux-kernel

On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
> So, diving right in:
> 
> 
> ## fd passing
> 
> Background: seccomp users want to be able to install an fd in a
> monitored process during a user_notif to emulate "open" calls (or
> similar), possibly across security boundaries, etc.
> 
> On the fd passing front, it seems that gaining pidfd_addfd() is the way
> to go as it allows for generic use not tied to seccomp in particular.
> I expect this feature will be developed orthogonally to seccomp (where
> does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> be friendlier to help with using it. Things that need to be resolved:
> 
> - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
>   if we're going to step back and make some design choices here, is
>   there a place for pidfds in seccomp user_notif, in order to avoid
>   needing the user_notif cookie? I think probably not: it's a rather lot
>   of overhead for notifications. It seems it's safe to perform an fd
>   installation with these steps:
> 	- get pidnr from user_notif_recv
> 	- open pidfd from pidnr
> 	- re-verify user_notif cookie is still valid
> 	- send new fd via pidfd
> 	- reply with user_notif_send
> 	- close pidfd

Yes, that sounds good and should be sufficient for all users.

I don't want to have a range of interfaces being able to create pidfds
as this might make it more difficult when we want to implement new
features on top of pidfds (privilege delegation etc., pidfds for
specific threads).

> 
> - how to deal with changing sizes of the user_notif structures to
>   include a pidnr. (Which will be its own topic below.)
> 
> 
> ## deep argument inspection
> 
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> During the last plumbers and in conversations since, the grudging
> consensus was reached that having seccomp do this for ALL syscalls was
> likely going to be extremely disruptive for very little gain (i.e.
> many things, like pathnames, have differing lifetimes, aliases, unstable
> kernel object references, etc[6]), but that there were a small subset of
> syscalls for which this WOULD be beneficial, and those are the newly
> created "Extensible Argument" syscalls (is there a better name for this
> design? I'm calling it "EA" for the rest of the email), like clone3(),

esyscalls? :)

> openat2(), etc, which pass a pointer and a size:
> 
> long clone3(struct clone_args *cl_args, size_t size);
> 
> I think it should be possible to extend seccomp to examine this structure
> by appending it to seccomp_data, and allowing filters to examine the
> contents. This means that no BPF language extensions are required for

Whatever design go with ultimately should probably also be able to
handle pointers within the struct as that's just something that's going
to happen no matter how much we fight it, i.e.:

foo(struct a, size_t size_a)

struct a {
	int first;
	int second;
	struct *b;
	size_t size_b;
};

will likely be inevitable.

> seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> seccomp's design principles work well with maps, kernel helpers, etc,
> and I think the earlier the examination of using eBPF for user_notif
> bares this out).

Agreed. I would like this to minimally disruptive for userspace so
various projects can simply adapt openat2() and clone3() without having
to either opt for a new LSM-type thing or ebpf. Ideally it'll feel very
natural to extend their seccomp filters to handle these esyscalls.

> 
> In order for this to work, there are a number of prerequisites:
> 
> - argument caching, in two halves: syscall side and seccomp side:
>   - the EA syscalls needs to include awareness of potential seccomp
>     hooking. i.e. seccomp may have done the copy_from_user() already and
>     kept a cached copy.
>   - seccomp needs to potentially DO the copy_from_user() itself when it
>     hits these syscalls for a given filter, and put it somewhere for
>     later use by the syscall.
> - the sizes of these EA structs are, by design, growable over time.
>   seccomp and its users need to be handle this in a forward and backward
>   compatible way, similar to the design of the EA syscall interface
>   itself.

We could provide a wrapper similar to SYSCALL_DEFINE*(). So we'd have -
idk - ESYSCALL_DEFINE*(foo, T1, T1_NAME, T2, T2_NAME, <foo_copy_from_user>)
Each such syscall that wants to do
ESYSCALL_DEFINE*() would need to define copy-from-user function for it's
struct argument that seccomp can call to get a full cached copy of the
arguments. So the ESYSCALL_DEFINE*() would register the copy function
with seccomp so seccomp knows what it needs to call. This way seccomp
doesn't need to worry about what the function does or always be careful
to catch up with extensions to these syscalls.

> 
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall. I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.

thread_struct or task_struct? The latter feels like it would make more
sense.

> 
> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
> 
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
> 
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
>   EA style so we can pass in more than a filter and include also an
>   array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
>   the meaning of the uarg from "filter" to a EA-style structure with
>   sizes and pointers to the filter and an array of syscall to size
>   mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
>   the "max offset" value seen during filter verification, and zero-fill
>   the EA struct with zeros to that size when constructing the
>   seccomp_data + EA struct that the filter will examine. Then the seccomp
>   filter doesn't care what any of the sizes are, and userspace doesn't
>   care what any of the sizes are. (I like this as it makes the problems
>   to solve contained entirely by the seccomp infrastructure and does not
>   touch user API, but I worry I'm missing some gotcha I haven't
>   considered.)

This sounds like the sanest option and again, the least invasive one to
userspace.
I think this just needs to take care to adhere to the current convention
that we established that passing a larger struct than the kernel/seccomp
knows about is fine, as long as the unknown bits are zeroed. For
example, I currently use clone3() like this in LXC on any kernel:

struct lxc_clone_args {
	__aligned_u64 flags;
	__aligned_u64 pidfd;
	__aligned_u64 child_tid;
	__aligned_u64 parent_tid;
	__aligned_u64 exit_signal;
	__aligned_u64 stack;
	__aligned_u64 stack_size;
	__aligned_u64 tls;
	__aligned_u64 set_tid;
	__aligned_u64 set_tid_size;
	__aligned_u64 cgroup;
};

__returns_twice static inline pid_t lxc_clone3(struct lxc_clone_args *args, size_t size)
{
	/* COW stack, don't even think about passing CLONE_VM. */
	if (flags & (CLONE_VM | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID |
		     CLONE_CHILD_CLEARTID | CLONE_SETTLS))
		return -EINVAL;

	return syscall(__NR_clone3, args, size);
}

and there's not even a kernel out with the CLONE_INTO_CGROUP and thus
the .cgroup member in there and things just work fine (TM). I think
seccomp needs to make sure to not reject syscalls with larger but
zeroed structs.

> 
> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.

Also, people who need to filter open*() now will already need to make
use of LSMs and I think for this specific case it's fine to burden them
with this. So I suggest we simply shrug our shoulders and move on.
The other way of putting this is, openat2() would be a full replacement
for open*() as soon as it's flags argument in the struct can be
filtered, I think and so it's not that we're loosing any feature that
currently is supported. The path-based filtering would be the cherry on
top and as you say it's doubtful whether there's much use to it.

> 
> 
> ## changing structure sizes
> 
> Background: there have been regular needs to add things to various
> seccomp structures. Each come with their own unique pains, and solving
> this as completely as possible in a future-proof way would be very nice.
> 
> As noted in "fd passing" above, there is a desire to add some useful
> things to the user_notif struct (e.g. thread group pid). Similarly,
> there have been requests in the past (though I can't find any good
> references right now, just my indirect comments[3]) to grow seccomp_data.
> Combined with the EA struct work above, there's a clear need for seccomp
> to reexamine how it deals with its API structures (and how this
> interacts with filters).
> 
> First, let's consider seccomp_data. If we grow it, the EA struct offset
> will move, based on the deep arg inspection design above. Alternatively,
> we could instead put seccomp_data offset 0, and EA struct at offset
> PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> the filter access whatever it thinks is there, with it being zero-filled
> by the kernel. For any values where 0 is valid, there will just need to
> be a "is that field valid?" bit before it:

That's exactly what clone3() - and openat2() too - I think is doing.
(Case in point, CLONE_INTO_CGROUP with the .cgroup member referencing a
 file descriptor for which 0 is obviously valid.)

> 
> 	unsigned long feature_bits;
> 	unsigned long interesting_thing_1;
> 	unsigned long interesting_thing_2;
> 	unsigned long interesting_thing_3;
> 	...
> 
> and the filter would check feature_bits...
> 
> (However, this needs to be carefully considered given that seccomp_data
> is embedded in user_notif... should the EA struct from userspace also be
> copied into user_notif? More thoughts on this below...)
> 
> For user_notif, I think we need something in and around these options:
> 
> - make a new API that explicitly follows EA struct design
>   (and while read()/write() might be easier[4], I tend to agree with
>   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
>   for data". Though I wonder if read() could be used for the notifications,
>   which ARE data, and use ioctl() for the responses?)

I'm still pretty convinced that my original suggestion here to switch
user_notif over to esyscall design is the way to go and probably the
least complex.

> - make a new API that is perf_event_open()-style where fields are
>   explicitly requested, as Sargun suggested[5]. (This looks like it
>   might be complex to construct, but would get us by far the most
>   extensible API.)
> - jam whatever we pick into the existing API (we'll be forced to do
>   SOMETHING to make the old API still work, so, I dunno what that will
>   look like until we finish the rest of the design).
> 
> If we did a requested-fields approach, what would the user_notif event
> block of bytes look like? Would it be entirely dynamic based on the
> initial ioctl()? Another design consideration here is that we don't want
> the kernel doing tons of work (especially copying) and tossing tons
> of stuff into a huge structure that the user doesn't care about. In
> addition to explicit fields, maybe the EA struct could be included,
> perhaps with specified offset/size, so only the portion the user_notif
> user wanted to inspect was copied?
> 
> The complexity of the per-field API is higher, but I think it might be
> the most robust and have the greatest chance at being performant.
> For example, "send me user_notif but I only care about the pid" would
> mean no syscall arguments are copied, etc.

Before we go with such a flexible design we should ask who's going to
use it that way? At least for our use-cass we always want everything.
Maybe I'm missing other uses though. But most use-cases seem kinda
pointless, e.g. just getting the pid seems not interesting. I could see
a use-case for pid + syscall number without arguments and the notifier
always replying with _CONTINUE but even that seems a little odd to me.

> 
> 
> ## syscall bitmasks
> 
> Background: the number one bit of feedback on seccomp has been performance
> concerns, especially for fast syscalls like futex(). When looking at
> where time is spent, it is very clearly spent running the filters (which
> I found surprising, given that adding TIF_SECCOMP tended to trip the
> "slow path" syscall path (though most architectures these days just
> don't have a fast path any more thanks to Meltdown). It would be nice

Well, ia64 has fsys-syscalls :)
/me ducks

> to make filtering faster without running BPF at all. :)
> 
> Nearly every thread on adding eBPF, for example, has been about trying to
> speed up the if/then nature of BPF for finding a syscall that the filter
> wants to always accept (or reject). The bulk of most seccomp filters
> are pretty simple, usually they're either "reject everything except
> $set-of-syscalls", or "accept everything except $set-of-syscalls". The
> stuff in between tends to be a mix, like "accept $some, process $these
> with argument checks, and reject $remaining".
> 
> In all three cases, the entire seccomp() path could be sped up by having
> a syscall bitmask that could be applied before the filters are ever run,
> with 3 (actually 2) syscall bitmasks: accept, reject, or process. If
> a syscall was in the "accept" bitmask, we immediately exit seccomp and
> continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected
> and immediately exit seccomp. And finally, we run the filters. In all
> ways, doing bitmask math is going to be faster than running the BPF. ;)

That sounds great!

> 
> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
> 
> - new API for "add this syscall to the reject bitmask". We can't really
>   do an "accept" bitmask addition without processing the attached
>   filters...
> - process attached filters! Each time a filter is added, have the
>   BPF verifier do an analysis to determine if there are any static
>   results, and set bits in the various bitmasks to represent it.
>   i.e. when seccomp is first enabled on a thread, the "accept"
>   bitmask is populated with all syscalls, and for each filter, do
>   [math,simulation,magic] and knock each syscall out of "accept" if
>   it isn't always accepted, and further see if there are any syscalls
>   that are always rejected, and mark those in the "reject" bitmask.
> 
> 
> Okay, that's all I had ... what do people think?

Great, in fact, I'd like to strongly encourage you to hand in either a
session for the container's microconference for Virtual-LPC 2020 for
this or a refereed track talk (that's less of a discussion though) for
the combined LPC/Kernel Summit track:

https://www.linuxplumbersconf.org/event/7/abstracts/

:)

(I think we're also going to want to have a talk on extensible syscall
 design. I still want to "standardize" some aspects such as type of flag
 arguments, extensible structures and so on.)

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19  2:48   ` Aleksa Sarai
@ 2020-05-19 10:35     ` Christian Brauner
  2020-05-19 16:18     ` Alexei Starovoitov
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2020-05-19 10:35 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jann Horn, Kees Cook, Tycho Andersen, Sargun Dhillon,
	Matt Denton, Chris Palmer, Jeffrey Vander Stoep,
	Linux Containers, Linux API, kernel list

On Tue, May 19, 2020 at 12:48:46PM +1000, Aleksa Sarai wrote:
> On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > ## deep argument inspection
> > >
> > > Background: seccomp users would like to write filters that traverse
> > > the user pointers passed into many syscalls, but seccomp can't do this
> > > dereference for a variety of reasons (mostly involving race conditions and
> > > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> > 
> > Also, other than for syscall entry, it might be worth thinking about
> > whether we want to have a special hook into seccomp for io_uring.
> > io_uring is growing support for more and more syscalls, including
> > things like openat2, connect, sendmsg, splice and so on, and that list
> > is probably just going to grow in the future. If people start wanting
> > to use io_uring in software with seccomp filters, it might be
> > necessary to come up with some mechanism to prevent io_uring from
> > permitting access to almost everything else...
> > 
> > Probably not a big priority for now, but something to keep in mind for
> > the future.
> 
> Indeed. Quite a few people have raised concerns about io_uring and its
> debug-ability, but I agree that another less-commonly-mentioned concern
> should be how you restrict io_uring(2) from doing operations you've
> disallowed through seccomp. Though obviously user_notif shouldn't be
> allowed. :D

As soon as you switch kernels to an io_uring supported kernel while
maintaing a blacklist without updating all your seccomp filters you're
currently hosed (Yes, blacklists aren't great but they have their
uses.).

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19  7:09 ` Aleksa Sarai
@ 2020-05-19 11:01   ` Christian Brauner
  2020-05-19 13:42     ` Aleksa Sarai
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2020-05-19 11:01 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Kees Cook, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Daniel Vetter, containers,
	linux-api, linux-kernel

On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> > ## fd passing
> > 
> > Background: seccomp users want to be able to install an fd in a
> > monitored process during a user_notif to emulate "open" calls (or
> > similar), possibly across security boundaries, etc.
> > 
> > On the fd passing front, it seems that gaining pidfd_addfd() is the way
> > to go as it allows for generic use not tied to seccomp in particular.
> > I expect this feature will be developed orthogonally to seccomp (where
> > does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> > be friendlier to help with using it. Things that need to be resolved:
> > 
> > - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
> >   if we're going to step back and make some design choices here, is
> >   there a place for pidfds in seccomp user_notif, in order to avoid
> >   needing the user_notif cookie? I think probably not: it's a rather lot
> >   of overhead for notifications. It seems it's safe to perform an fd
> >   installation with these steps:
> > 	- get pidnr from user_notif_recv
> > 	- open pidfd from pidnr
> > 	- re-verify user_notif cookie is still valid
> > 	- send new fd via pidfd
> > 	- reply with user_notif_send
> > 	- close pidfd
> 
> I agree that while it would be conceptually nicer to not need the
> user_notif cookie, creating a new pidfd for every notification isn't a
> good idea (not to mention that the program may not care about the pid or
> pidfd when determining its response -- making the pidfd more of a pain
> than anything else).
> 
> You'd also need to have special-casing based on the userspace length of
> user_notif because old programs wouldn't know they'd have to close the
> pidfd returned (which then brings us back to the fact that user_notif
> doesn't specify what the usize of the struct is a-la
> copy_struct_from_user).
> 
> > ## deep argument inspection
> > 
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> > 
> > During the last plumbers and in conversations since, the grudging
> > consensus was reached that having seccomp do this for ALL syscalls was
> > likely going to be extremely disruptive for very little gain (i.e.
> > many things, like pathnames, have differing lifetimes, aliases, unstable
> > kernel object references, etc[6]), but that there were a small subset of
> > syscalls for which this WOULD be beneficial, and those are the newly
> > created "Extensible Argument" syscalls (is there a better name for this
> > design? I'm calling it "EA" for the rest of the email), like clone3(),
> > openat2(), etc, which pass a pointer and a size:
> 
> I called them "extensible struct" syscalls, but "extensible arguments"
> is probably a better name. Though if you want to call them EA, make sure
> to include the ™. ;)
> 
> > long clone3(struct clone_args *cl_args, size_t size);
> > 
> > I think it should be possible to extend seccomp to examine this structure
> > by appending it to seccomp_data, and allowing filters to examine the
> > contents. This means that no BPF language extensions are required for
> > seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> > seccomp's design principles work well with maps, kernel helpers, etc,
> > and I think the earlier the examination of using eBPF for user_notif
> > bares this out).
> 
> While I agree that we don't need to jump to eBPF for this, I do think
> we'll need an opcode for is_zeroed() (or bpf_check_uarg_tail_zero()) --
> see below.
> 
> > In order for this to work, there are a number of prerequisites:
> > 
> > - argument caching, in two halves: syscall side and seccomp side:
> >   - the EA syscalls needs to include awareness of potential seccomp
> >     hooking. i.e. seccomp may have done the copy_from_user() already and
> >     kept a cached copy.
> >   - seccomp needs to potentially DO the copy_from_user() itself when it
> >     hits these syscalls for a given filter, and put it somewhere for
> >     later use by the syscall.
> > 
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall. I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> I agree that while this is an important thing to handle properly, it's
> internal and probably doesn't deserve the lions share of design
> discussion.
> 
> > - the sizes of these EA structs are, by design, growable over time.
> >   seccomp and its users need to be handle this in a forward and backward
> >   compatible way, similar to the design of the EA syscall interface
> >   itself.
> > 
> > The growing size of the EA struct will need some API design. For filters
> > to operate on the contiguous seccomp_data+EA struct, the filter will
> > need to know how large seccomp_data is (more on this later), and how
> > large the EA struct is. When the filter is written in userspace, it can
> > do the math, point into the expected offsets, and get what it needs. For
> > this to work correctly in the kernel, though, the seccomp BPF verifier
> > needs to know the size of the EA struct as well, so it can correctly
> > perform the offset checking (as it currently does for just the
> > seccomp_data struct size).
> > 
> > Since there is not really any caller-based "seccomp state" associated
> > across seccomp(2) calls, I don't think we can add a new command to tell
> > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > the kernel doesn't track who "I" is besides just being "current", which
> > doesn't take into account the thread lifetime -- if a process launcher
> > knows about one size and the child knows about another, things will get
> > confused. The sizes really are just associated with individual filters,
> > based on the syscalls they're examining. So, I have thoughts on possible
> > solutions:
> > 
> > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> >   EA style so we can pass in more than a filter and include also an
> >   array of syscall to size mappings. (I don't like this...)
> > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> >   the meaning of the uarg from "filter" to a EA-style structure with
> >   sizes and pointers to the filter and an array of syscall to size
> >   mappings. (I like this slightly better, but I still don't like it.)
> > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> >   the "max offset" value seen during filter verification, and zero-fill
> >   the EA struct with zeros to that size when constructing the
> >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> >   filter doesn't care what any of the sizes are, and userspace doesn't
> >   care what any of the sizes are. (I like this as it makes the problems
> >   to solve contained entirely by the seccomp infrastructure and does not
> >   touch user API, but I worry I'm missing some gotcha I haven't
> >   considered.)
> 
> Okay, so here is my view on this. I think that the third option is
> closest to what I'd like to see. Based on Jann's email, I think we're on
> the same page but I'd just like to elaborate it a bit further:
> 
> First of all -- ideally, the backward and forward compatibility that EA
> syscalls give us should be reflected with seccomp filters being
> similarly compatible. Otherwise we're going to run into issues where all
> of the hard work with ensuring EA syscalls behave when extended will be
> less valuable if seccomp cannot handle it sufficiently. This means that
> I would hope that every combination of {old,new} filter/kernel/program
> would work on a best-effort (but fail-safe) basis.
> 
> In my view, the simplest way (from the kernel side) would be to simply
> do what you outlined in (3) -- make all accesses past usize (and even
> ksize) be zeroed.
> 
> However in order to make an old filter fail-safe on a new kernel with a
> new program, we'd need a new opcode which basically does
> bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> would punt the fail-safe problem to userspace (libseccomp would need to
> generate a check that any unknown-to-the-filter fields are zero). I
> don't think this is a decision we can make in-kernel because it might be
> that the filter doesn't care about the last field in a struct (and thus
> doesn't access it) but we don't know the difference between a field the
> filter doesn't care about and a field it doesn't know about.
> 
> Another issue which you haven't mentioned here is how do we deal with
> pointers inside an EA struct. clone3() already has this with set_tid,

It's also passed with a set_tid_size argument so we'd know how large
set_tid is.

> and the thread on user_notif seems to be converging towards having the
> user_notif2 struct just contain two pointers to EA structs. We could
> punt on this for now, so long as we leave room for us to deal with this
> problem in the future. However, I think it would be misguided to enable
> deep argument inspection on syscalls which have such entries (such as
> clone3) until we've figured out how we want to handle nested pointers --

We can't really punt on this and should figure this out now. I've
pointed to this problem in my other mail as well.
Though I currently fail to see why this should be a huge problem to get
that working as long as each pointer comes with a known size.

> at the very least because the bpf_check_uarg_tail_zero() opcode
> semantics need to consider this problem.
> 
> A silly proposal I had was that because the EA would never be bigger
> than PAGE_SIZE, we could take the offset into the EA and multiply it by
> PAGE_SIZE each time we wanted to do a "dereference" of a pointer in EA.

If you have a size to it it should be ok? I'm probably missing
something.

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19 11:01   ` Christian Brauner
@ 2020-05-19 13:42     ` Aleksa Sarai
  2020-05-19 13:53       ` Aleksa Sarai
  0 siblings, 1 reply; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-19 13:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Daniel Vetter, containers,
	linux-api, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7996 bytes --]

On 2020-05-19, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> > > - the sizes of these EA structs are, by design, growable over time.
> > >   seccomp and its users need to be handle this in a forward and backward
> > >   compatible way, similar to the design of the EA syscall interface
> > >   itself.
> > > 
> > > The growing size of the EA struct will need some API design. For filters
> > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > need to know how large seccomp_data is (more on this later), and how
> > > large the EA struct is. When the filter is written in userspace, it can
> > > do the math, point into the expected offsets, and get what it needs. For
> > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > needs to know the size of the EA struct as well, so it can correctly
> > > perform the offset checking (as it currently does for just the
> > > seccomp_data struct size).
> > > 
> > > Since there is not really any caller-based "seccomp state" associated
> > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > the kernel doesn't track who "I" is besides just being "current", which
> > > doesn't take into account the thread lifetime -- if a process launcher
> > > knows about one size and the child knows about another, things will get
> > > confused. The sizes really are just associated with individual filters,
> > > based on the syscalls they're examining. So, I have thoughts on possible
> > > solutions:
> > > 
> > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > >   EA style so we can pass in more than a filter and include also an
> > >   array of syscall to size mappings. (I don't like this...)
> > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > >   the meaning of the uarg from "filter" to a EA-style structure with
> > >   sizes and pointers to the filter and an array of syscall to size
> > >   mappings. (I like this slightly better, but I still don't like it.)
> > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > >   the "max offset" value seen during filter verification, and zero-fill
> > >   the EA struct with zeros to that size when constructing the
> > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > >   care what any of the sizes are. (I like this as it makes the problems
> > >   to solve contained entirely by the seccomp infrastructure and does not
> > >   touch user API, but I worry I'm missing some gotcha I haven't
> > >   considered.)
> > 
> > Okay, so here is my view on this. I think that the third option is
> > closest to what I'd like to see. Based on Jann's email, I think we're on
> > the same page but I'd just like to elaborate it a bit further:
> > 
> > First of all -- ideally, the backward and forward compatibility that EA
> > syscalls give us should be reflected with seccomp filters being
> > similarly compatible. Otherwise we're going to run into issues where all
> > of the hard work with ensuring EA syscalls behave when extended will be
> > less valuable if seccomp cannot handle it sufficiently. This means that
> > I would hope that every combination of {old,new} filter/kernel/program
> > would work on a best-effort (but fail-safe) basis.
> > 
> > In my view, the simplest way (from the kernel side) would be to simply
> > do what you outlined in (3) -- make all accesses past usize (and even
> > ksize) be zeroed.
> > 
> > However in order to make an old filter fail-safe on a new kernel with a
> > new program, we'd need a new opcode which basically does
> > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > would punt the fail-safe problem to userspace (libseccomp would need to
> > generate a check that any unknown-to-the-filter fields are zero). I
> > don't think this is a decision we can make in-kernel because it might be
> > that the filter doesn't care about the last field in a struct (and thus
> > doesn't access it) but we don't know the difference between a field the
> > filter doesn't care about and a field it doesn't know about.
> > 
> > Another issue which you haven't mentioned here is how do we deal with
> > pointers inside an EA struct. clone3() already has this with set_tid,
> 
> It's also passed with a set_tid_size argument so we'd know how large
> set_tid is.

Right, of course -- that's a given. See below for the point I was
making.

> > and the thread on user_notif seems to be converging towards having the
> > user_notif2 struct just contain two pointers to EA structs. We could
> > punt on this for now, so long as we leave room for us to deal with this
> > problem in the future. However, I think it would be misguided to enable
> > deep argument inspection on syscalls which have such entries (such as
> > clone3) until we've figured out how we want to handle nested pointers --
> 
> We can't really punt on this and should figure this out now. I've
> pointed to this problem in my other mail as well.
> Though I currently fail to see why this should be a huge problem to get
> that working as long as each pointer comes with a known size.

It's not a huge problem necessarily, but we would need to figure out how
we would represent the "nested pointers" in the flattened version which
the seccomp filter will actually offset into. Specifically, my point is
that if we imagine that the proposed new seccomp_data API is:

  [<seccomp_data>][<EA struct contents>]

then any pointers in the EA struct will just be numerical values when
copied. How would write a filter on clone3() that requires the first
entry of set_tid to be 34? You couldn't with this simplified setup. And
unless I'm mistaken, BPF doesn't have pointer dereferences.

So we need to embed the "nested pointers" somehow in the flattened
version of the EA struct. I don't think we can just append them to the
main EA struct -- while you might have the length, that'd mean that
seccomp would generate an array of zeroes whose length is a
user-specified value? And then what about extensible structs that have
their size embedded inside them (as you or someone else suggested for
seccomp user_notif)? How would the eBPF program know the length of the
struct?

Maybe we could have some kind of jump table which the filter could use
(and just have there be a jump table for each embedded struct -- so
multiple embeddings have their own jump tables). So if we imagine some
user-extensible struct like

	struct open_how {
		u64 flags, mode, resolve;
		struct foo *foo;
		struct bar *bar;
		struct baz *baz;
	};

	struct foo {
		u64 foo;
	};

	struct bar {
		u64 bar;
		struct barbaz *barbaz;
	};

	struct barbaz {
		u64 barbaz;
	};

	struct baz {
		u64 baz;
	};

Then we would lay out the seccomp_data as:

	[<seccomp data>][<jump table><open_how>]
		[<jump table><open_how->foo>]
		[<jump table><open_how->bar>]
			[<jump table><open_how->bar->barbaz>]
		[<jump table><open_how->baz>]

The jump table could be as simple as a list of tuples (relative offset
of field in this struct from end of jump table, relative offset of the
copied structure at the end of the jump table). However, it's not
possible to do for-loops in cBPF -- so maybe we'd need to represent the
jump table differently (perhaps having it just be the same size as the
struct).

The above is just a rough sketch, maybe there's a much simpler solution
I'm not seeing.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19 13:42     ` Aleksa Sarai
@ 2020-05-19 13:53       ` Aleksa Sarai
  0 siblings, 0 replies; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-19 13:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Kees Cook, Chris Palmer, Jeffrey Vander Stoep,
	containers, linux-kernel, Matt Denton, Daniel Vetter, linux-api

[-- Attachment #1: Type: text/plain, Size: 8497 bytes --]

On 2020-05-19, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-05-19, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > > On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> > > > - the sizes of these EA structs are, by design, growable over time.
> > > >   seccomp and its users need to be handle this in a forward and backward
> > > >   compatible way, similar to the design of the EA syscall interface
> > > >   itself.
> > > > 
> > > > The growing size of the EA struct will need some API design. For filters
> > > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > > need to know how large seccomp_data is (more on this later), and how
> > > > large the EA struct is. When the filter is written in userspace, it can
> > > > do the math, point into the expected offsets, and get what it needs. For
> > > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > > needs to know the size of the EA struct as well, so it can correctly
> > > > perform the offset checking (as it currently does for just the
> > > > seccomp_data struct size).
> > > > 
> > > > Since there is not really any caller-based "seccomp state" associated
> > > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > > the kernel doesn't track who "I" is besides just being "current", which
> > > > doesn't take into account the thread lifetime -- if a process launcher
> > > > knows about one size and the child knows about another, things will get
> > > > confused. The sizes really are just associated with individual filters,
> > > > based on the syscalls they're examining. So, I have thoughts on possible
> > > > solutions:
> > > > 
> > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > > >   EA style so we can pass in more than a filter and include also an
> > > >   array of syscall to size mappings. (I don't like this...)
> > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > > >   the meaning of the uarg from "filter" to a EA-style structure with
> > > >   sizes and pointers to the filter and an array of syscall to size
> > > >   mappings. (I like this slightly better, but I still don't like it.)
> > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > > >   the "max offset" value seen during filter verification, and zero-fill
> > > >   the EA struct with zeros to that size when constructing the
> > > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > > >   care what any of the sizes are. (I like this as it makes the problems
> > > >   to solve contained entirely by the seccomp infrastructure and does not
> > > >   touch user API, but I worry I'm missing some gotcha I haven't
> > > >   considered.)
> > > 
> > > Okay, so here is my view on this. I think that the third option is
> > > closest to what I'd like to see. Based on Jann's email, I think we're on
> > > the same page but I'd just like to elaborate it a bit further:
> > > 
> > > First of all -- ideally, the backward and forward compatibility that EA
> > > syscalls give us should be reflected with seccomp filters being
> > > similarly compatible. Otherwise we're going to run into issues where all
> > > of the hard work with ensuring EA syscalls behave when extended will be
> > > less valuable if seccomp cannot handle it sufficiently. This means that
> > > I would hope that every combination of {old,new} filter/kernel/program
> > > would work on a best-effort (but fail-safe) basis.
> > > 
> > > In my view, the simplest way (from the kernel side) would be to simply
> > > do what you outlined in (3) -- make all accesses past usize (and even
> > > ksize) be zeroed.
> > > 
> > > However in order to make an old filter fail-safe on a new kernel with a
> > > new program, we'd need a new opcode which basically does
> > > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > > would punt the fail-safe problem to userspace (libseccomp would need to
> > > generate a check that any unknown-to-the-filter fields are zero). I
> > > don't think this is a decision we can make in-kernel because it might be
> > > that the filter doesn't care about the last field in a struct (and thus
> > > doesn't access it) but we don't know the difference between a field the
> > > filter doesn't care about and a field it doesn't know about.
> > > 
> > > Another issue which you haven't mentioned here is how do we deal with
> > > pointers inside an EA struct. clone3() already has this with set_tid,
> > 
> > It's also passed with a set_tid_size argument so we'd know how large
> > set_tid is.
> 
> Right, of course -- that's a given. See below for the point I was
> making.
> 
> > > and the thread on user_notif seems to be converging towards having the
> > > user_notif2 struct just contain two pointers to EA structs. We could
> > > punt on this for now, so long as we leave room for us to deal with this
> > > problem in the future. However, I think it would be misguided to enable
> > > deep argument inspection on syscalls which have such entries (such as
> > > clone3) until we've figured out how we want to handle nested pointers --
> > 
> > We can't really punt on this and should figure this out now. I've
> > pointed to this problem in my other mail as well.
> > Though I currently fail to see why this should be a huge problem to get
> > that working as long as each pointer comes with a known size.
> 
> It's not a huge problem necessarily, but we would need to figure out how
> we would represent the "nested pointers" in the flattened version which
> the seccomp filter will actually offset into. Specifically, my point is
> that if we imagine that the proposed new seccomp_data API is:
> 
>   [<seccomp_data>][<EA struct contents>]
> 
> then any pointers in the EA struct will just be numerical values when
> copied. How would write a filter on clone3() that requires the first
> entry of set_tid to be 34? You couldn't with this simplified setup. And
> unless I'm mistaken, BPF doesn't have pointer dereferences.
> 
> So we need to embed the "nested pointers" somehow in the flattened
> version of the EA struct. I don't think we can just append them to the
> main EA struct -- while you might have the length, that'd mean that
> seccomp would generate an array of zeroes whose length is a
> user-specified value? And then what about extensible structs that have
> their size embedded inside them (as you or someone else suggested for
> seccomp user_notif)? How would the eBPF program know the length of the
> struct?
> 
> Maybe we could have some kind of jump table which the filter could use
> (and just have there be a jump table for each embedded struct -- so
> multiple embeddings have their own jump tables). So if we imagine some
> user-extensible struct like
> 
> 	struct open_how {
> 		u64 flags, mode, resolve;
> 		struct foo *foo;
> 		struct bar *bar;
> 		struct baz *baz;
> 	};
> 
> 	struct foo {
> 		u64 foo;
> 	};
> 
> 	struct bar {
> 		u64 bar;
> 		struct barbaz *barbaz;
> 	};
> 
> 	struct barbaz {
> 		u64 barbaz;
> 	};
> 
> 	struct baz {
> 		u64 baz;
> 	};

Sorry, I omitted all of the _size fields for extensibility. Just imagine
I added them (or that the u64s represent the size).

> Then we would lay out the seccomp_data as:
> 
> 	[<seccomp data>][<jump table><open_how>]
> 		[<jump table><open_how->foo>]
> 		[<jump table><open_how->bar>]
> 			[<jump table><open_how->bar->barbaz>]
> 		[<jump table><open_how->baz>]
> 
> The jump table could be as simple as a list of tuples (relative offset
> of field in this struct from end of jump table, relative offset of the
> copied structure at the end of the jump table). However, it's not
> possible to do for-loops in cBPF -- so maybe we'd need to represent the
> jump table differently (perhaps having it just be the same size as the
> struct).
> 
> The above is just a rough sketch, maybe there's a much simpler solution
> I'm not seeing.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
                   ` (3 preceding siblings ...)
  2020-05-19 10:26 ` Christian Brauner
@ 2020-05-19 14:07 ` Tycho Andersen
  2020-05-20  9:05 ` Sargun Dhillon
  2020-05-22 20:09 ` Sargun Dhillon
  6 siblings, 0 replies; 22+ messages in thread
From: Tycho Andersen @ 2020-05-19 14:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Sargun Dhillon, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, containers, linux-api, linux-kernel

On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
> So, diving right in:
> 
> 
> ## fd passing
> 
> Background: seccomp users want to be able to install an fd in a
> monitored process during a user_notif to emulate "open" calls (or
> similar), possibly across security boundaries, etc.
> 
> On the fd passing front, it seems that gaining pidfd_addfd() is the way
> to go as it allows for generic use not tied to seccomp in particular.
> I expect this feature will be developed orthogonally to seccomp (where
> does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> be friendlier to help with using it. Things that need to be resolved:
> 
> - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
>   if we're going to step back and make some design choices here, is
>   there a place for pidfds in seccomp user_notif, in order to avoid
>   needing the user_notif cookie? I think probably not: it's a rather lot
>   of overhead for notifications. It seems it's safe to perform an fd
>   installation with these steps:
> 	- get pidnr from user_notif_recv
> 	- open pidfd from pidnr
> 	- re-verify user_notif cookie is still valid
> 	- send new fd via pidfd
> 	- reply with user_notif_send
> 	- close pidfd

Yep, this looks safe.

> - how to deal with changing sizes of the user_notif structures to
>   include a pidnr. (Which will be its own topic below.)
> 
> 
> ## deep argument inspection
> 
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> During the last plumbers and in conversations since, the grudging
> consensus was reached that having seccomp do this for ALL syscalls was
> likely going to be extremely disruptive for very little gain (i.e.
> many things, like pathnames, have differing lifetimes, aliases, unstable
> kernel object references, etc[6]), but that there were a small subset of
> syscalls for which this WOULD be beneficial, and those are the newly
> created "Extensible Argument" syscalls (is there a better name for this
> design? I'm calling it "EA" for the rest of the email), like clone3(),
> openat2(), etc, which pass a pointer and a size:
> 
> long clone3(struct clone_args *cl_args, size_t size);
> 
> I think it should be possible to extend seccomp to examine this structure
> by appending it to seccomp_data, and allowing filters to examine the
> contents. This means that no BPF language extensions are required for
> seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> seccomp's design principles work well with maps, kernel helpers, etc,
> and I think the earlier the examination of using eBPF for user_notif
> bares this out).
> 
> In order for this to work, there are a number of prerequisites:
> 
> - argument caching, in two halves: syscall side and seccomp side:
>   - the EA syscalls needs to include awareness of potential seccomp
>     hooking. i.e. seccomp may have done the copy_from_user() already and
>     kept a cached copy.
>   - seccomp needs to potentially DO the copy_from_user() itself when it
>     hits these syscalls for a given filter, and put it somewhere for
>     later use by the syscall.
> - the sizes of these EA structs are, by design, growable over time.
>   seccomp and its users need to be handle this in a forward and backward
>   compatible way, similar to the design of the EA syscall interface
>   itself.
> 
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall. I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.
> 
> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
> 
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
> 
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
>   EA style so we can pass in more than a filter and include also an
>   array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
>   the meaning of the uarg from "filter" to a EA-style structure with
>   sizes and pointers to the filter and an array of syscall to size
>   mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
>   the "max offset" value seen during filter verification, and zero-fill
>   the EA struct with zeros to that size when constructing the
>   seccomp_data + EA struct that the filter will examine. Then the seccomp
>   filter doesn't care what any of the sizes are, and userspace doesn't
>   care what any of the sizes are. (I like this as it makes the problems
>   to solve contained entirely by the seccomp infrastructure and does not
>   touch user API, but I worry I'm missing some gotcha I haven't
>   considered.)

This sounds pretty nice to me :)

> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.
> 
> 
> ## changing structure sizes
> 
> Background: there have been regular needs to add things to various
> seccomp structures. Each come with their own unique pains, and solving
> this as completely as possible in a future-proof way would be very nice.
> 
> As noted in "fd passing" above, there is a desire to add some useful
> things to the user_notif struct (e.g. thread group pid). Similarly,
> there have been requests in the past (though I can't find any good
> references right now, just my indirect comments[3]) to grow seccomp_data.
> Combined with the EA struct work above, there's a clear need for seccomp
> to reexamine how it deals with its API structures (and how this
> interacts with filters).
> 
> First, let's consider seccomp_data. If we grow it, the EA struct offset
> will move, based on the deep arg inspection design above. Alternatively,
> we could instead put seccomp_data offset 0, and EA struct at offset
> PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> the filter access whatever it thinks is there, with it being zero-filled
> by the kernel. For any values where 0 is valid, there will just need to
> be a "is that field valid?" bit before it:
> 
> 	unsigned long feature_bits;
> 	unsigned long interesting_thing_1;
> 	unsigned long interesting_thing_2;
> 	unsigned long interesting_thing_3;
> 	...
> 
> and the filter would check feature_bits...
> 
> (However, this needs to be carefully considered given that seccomp_data
> is embedded in user_notif... should the EA struct from userspace also be
> copied into user_notif? More thoughts on this below...)
> 
> For user_notif, I think we need something in and around these options:
> 
> - make a new API that explicitly follows EA struct design
>   (and while read()/write() might be easier[4], I tend to agree with
>   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
>   for data". Though I wonder if read() could be used for the notifications,
>   which ARE data, and use ioctl() for the responses?)
> - make a new API that is perf_event_open()-style where fields are
>   explicitly requested, as Sargun suggested[5]. (This looks like it
>   might be complex to construct, but would get us by far the most
>   extensible API.)

This already exists, and was my original intent with the current API.
You can add new ioctl()s:

struct seccomp_interesting_stuff {
    __u16 size;
    __u64 id;
    __u64 interesting_thing_1;
    __u64 interesting_thing_2;
};
ioctl(listener, SECCOMP_IOCTL_GET_INTERESTING_STUFF, &buf);

The original intent was that you could either grow one of these
structures, or add a new one. It's not clear to me really *why* we
need a whole new API where everything fits into one structure.

Or, we could build a requested fields approach that you discuss on
top. But I don't see a reason to add a whole new RECV2.

> - jam whatever we pick into the existing API (we'll be forced to do
>   SOMETHING to make the old API still work, so, I dunno what that will
>   look like until we finish the rest of the design).
> 
> If we did a requested-fields approach, what would the user_notif event
> block of bytes look like? Would it be entirely dynamic based on the
> initial ioctl()? Another design consideration here is that we don't want
> the kernel doing tons of work (especially copying) and tossing tons
> of stuff into a huge structure that the user doesn't care about. In
> addition to explicit fields, maybe the EA struct could be included,
> perhaps with specified offset/size, so only the portion the user_notif
> user wanted to inspect was copied?
> 
> The complexity of the per-field API is higher, but I think it might be
> the most robust and have the greatest chance at being performant.
> For example, "send me user_notif but I only care about the pid" would
> mean no syscall arguments are copied, etc.
> 
> 
> ## syscall bitmasks
> 
> Background: the number one bit of feedback on seccomp has been performance
> concerns, especially for fast syscalls like futex(). When looking at
> where time is spent, it is very clearly spent running the filters (which
> I found surprising, given that adding TIF_SECCOMP tended to trip the
> "slow path" syscall path (though most architectures these days just
> don't have a fast path any more thanks to Meltdown). It would be nice
> to make filtering faster without running BPF at all. :)
> 
> Nearly every thread on adding eBPF, for example, has been about trying to
> speed up the if/then nature of BPF for finding a syscall that the filter
> wants to always accept (or reject). The bulk of most seccomp filters
> are pretty simple, usually they're either "reject everything except
> $set-of-syscalls", or "accept everything except $set-of-syscalls". The
> stuff in between tends to be a mix, like "accept $some, process $these
> with argument checks, and reject $remaining".
> 
> In all three cases, the entire seccomp() path could be sped up by having
> a syscall bitmask that could be applied before the filters are ever run,
> with 3 (actually 2) syscall bitmasks: accept, reject, or process. If
> a syscall was in the "accept" bitmask, we immediately exit seccomp and
> continue. Otherwise, if it's in the "reject" bitmask, we mark it rejected
> and immediately exit seccomp. And finally, we run the filters. In all
> ways, doing bitmask math is going to be faster than running the BPF. ;)
> 
> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
> 
> - new API for "add this syscall to the reject bitmask". We can't really
>   do an "accept" bitmask addition without processing the attached
>   filters...
> - process attached filters! Each time a filter is added, have the
>   BPF verifier do an analysis to determine if there are any static
>   results, and set bits in the various bitmasks to represent it.
>   i.e. when seccomp is first enabled on a thread, the "accept"
>   bitmask is populated with all syscalls, and for each filter, do
>   [math,simulation,magic] and knock each syscall out of "accept" if

This sounds like a recipe for bugs :)

>   it isn't always accepted, and further see if there are any syscalls
>   that are always rejected, and mark those in the "reject" bitmask.

What if instead you have a reject bitmask which is always fine to add
to, and an accept bitmask which is only valid if there are no actual
filters attached. Is that enough for people?

Tycho

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19  2:48   ` Aleksa Sarai
  2020-05-19 10:35     ` Christian Brauner
@ 2020-05-19 16:18     ` Alexei Starovoitov
  2020-05-19 21:40       ` Kees Cook
  2020-05-20  1:20       ` Aleksa Sarai
  1 sibling, 2 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2020-05-19 16:18 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jann Horn, Kees Cook, Christian Brauner, Tycho Andersen,
	Sargun Dhillon, Matt Denton, Chris Palmer, Jeffrey Vander Stoep,
	Linux Containers, Linux API, kernel list, Daniel Borkmann, bpf

On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > ## deep argument inspection
> > >
> > > Background: seccomp users would like to write filters that traverse
> > > the user pointers passed into many syscalls, but seccomp can't do this
> > > dereference for a variety of reasons (mostly involving race conditions and
> > > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> >
> > Also, other than for syscall entry, it might be worth thinking about
> > whether we want to have a special hook into seccomp for io_uring.
> > io_uring is growing support for more and more syscalls, including
> > things like openat2, connect, sendmsg, splice and so on, and that list
> > is probably just going to grow in the future. If people start wanting
> > to use io_uring in software with seccomp filters, it might be
> > necessary to come up with some mechanism to prevent io_uring from
> > permitting access to almost everything else...
> >
> > Probably not a big priority for now, but something to keep in mind for
> > the future.
>
> Indeed. Quite a few people have raised concerns about io_uring and its
> debug-ability, but I agree that another less-commonly-mentioned concern
> should be how you restrict io_uring(2) from doing operations you've
> disallowed through seccomp. Though obviously user_notif shouldn't be
> allowed. :D
>
> > > The argument caching bit is, I think, rather mechanical in nature since
> > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > allocates seccomp_data (maybe going so far as to have it split across two
> > > pages with the syscall argument struct always starting on the 2nd page
> > > boundary), and copying the EA struct into that page, which will be both
> > > used by the filter and by the syscall.
> >
> > We could also do the same kind of thing the eBPF verifier does in
> > convert_ctx_accesses(), and rewrite the context accesses to actually
> > go through two different pointers depending on the (constant) offset
> > into seccomp_data.
>
> My main worry with this is that we'll need to figure out what kind of
> offset mathematics are necessary to deal with pointers inside the
> extensible struct. As a very ugly proposal, you could make it so that
> you multiply the offset by PAGE_SIZE each time you want to dereference
> the pointer at that offset (unless we want to add new opcodes to cBPF to
> allow us to represent this).

Please don't. cbpf is frozen.

>
> This might even be needed for seccomp user_notif -- given one of the
> recent proposals was basically to just add two (extensible) struct
> pointers inside the main user_notif struct.
>
> > > I imagine state tracking ("is
> > > there a cached EA?", "what is the address of seccomp_data?", "what is
> > > the address of the EA?") can be associated with the thread struct.
> >
> > You probably mean the task struct?
> >
> > > The growing size of the EA struct will need some API design. For filters
> > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > need to know how large seccomp_data is (more on this later), and how
> > > large the EA struct is. When the filter is written in userspace, it can
> > > do the math, point into the expected offsets, and get what it needs. For
> > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > needs to know the size of the EA struct as well, so it can correctly
> > > perform the offset checking (as it currently does for just the
> > > seccomp_data struct size).
> > >
> > > Since there is not really any caller-based "seccomp state" associated
> > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > the kernel doesn't track who "I" is besides just being "current", which
> > > doesn't take into account the thread lifetime -- if a process launcher
> > > knows about one size and the child knows about another, things will get
> > > confused. The sizes really are just associated with individual filters,
> > > based on the syscalls they're examining. So, I have thoughts on possible
> > > solutions:
> > >
> > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > >   EA style so we can pass in more than a filter and include also an
> > >   array of syscall to size mappings. (I don't like this...)
> > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > >   the meaning of the uarg from "filter" to a EA-style structure with
> > >   sizes and pointers to the filter and an array of syscall to size
> > >   mappings. (I like this slightly better, but I still don't like it.)
> > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > >   the "max offset" value seen during filter verification, and zero-fill
> > >   the EA struct with zeros to that size when constructing the
> > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > >   care what any of the sizes are. (I like this as it makes the problems
> > >   to solve contained entirely by the seccomp infrastructure and does not
> > >   touch user API, but I worry I'm missing some gotcha I haven't
> > >   considered.)
> >
> > We don't need to actually zero-fill memory for this beyond what the
> > kernel supports - AFAIK the existing APIs already say that passing a
> > short length is equivalent to passing zeroes, so we can just replace
> > all out-of-bounds loads with zeroing registers in the filter.
> > The tricky case is what should happen if the userspace program passes
> > in fields that the filter doesn't know about. The filter can see the
> > length field passed in by userspace, and then just reject anything
> > where the length field is bigger than the structure size the filter
> > knows about. But maybe it'd be slightly nicer if there was an
> > operation for "tell me whether everything starting at offset X is
> > zeroes", so that if someone compiles with newer kernel headers where
> > the struct is bigger, and leaves the new fields zeroed, the syscalls
> > still go through an outdated filter properly.
>
> I think the best way of handling this (without breaking programs
> senselessly) is to have filters essentially emulate
> copy_struct_from_user() semantics -- which is along the lines of what
> you've suggested.

and cpbf load instruction will become copy_from_user() underneath?
I don't see how that can work.
Have you considered implications to jits, register usage, etc ?

ebpf will become sleepable soon. It will be able to do copy_from_user()
and examine any level of user pointer dereference.
toctou is still going to be a concern though,
but such ebpf+copy_from_user analysis and syscall sandboxing
will not need to change kernel code base around syscalls at all.
No need to invent E-syscalls and all the rest I've seen in this thread.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19 16:18     ` Alexei Starovoitov
@ 2020-05-19 21:40       ` Kees Cook
  2020-05-20  1:20       ` Aleksa Sarai
  1 sibling, 0 replies; 22+ messages in thread
From: Kees Cook @ 2020-05-19 21:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aleksa Sarai, Jann Horn, Christian Brauner, Tycho Andersen,
	Sargun Dhillon, Matt Denton, Chris Palmer, Jeffrey Vander Stoep,
	Linux Containers, Linux API, kernel list, Daniel Borkmann, bpf

On Tue, May 19, 2020 at 09:18:47AM -0700, Alexei Starovoitov wrote:
> On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > > ## deep argument inspection
> > > >
> > > > The argument caching bit is, I think, rather mechanical in nature since
> > > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > > allocates seccomp_data (maybe going so far as to have it split across two
> > > > pages with the syscall argument struct always starting on the 2nd page
> > > > boundary), and copying the EA struct into that page, which will be both
> > > > used by the filter and by the syscall.
> > >
> > > We could also do the same kind of thing the eBPF verifier does in
> > > convert_ctx_accesses(), and rewrite the context accesses to actually
> > > go through two different pointers depending on the (constant) offset
> > > into seccomp_data.
> >
> > My main worry with this is that we'll need to figure out what kind of
> > offset mathematics are necessary to deal with pointers inside the
> > extensible struct. As a very ugly proposal, you could make it so that
> > you multiply the offset by PAGE_SIZE each time you want to dereference
> > the pointer at that offset (unless we want to add new opcodes to cBPF to
> > allow us to represent this).
> 
> Please don't. cbpf is frozen.

https://www.youtube.com/watch?v=L0MK7qz13bU

If the only workable design paths for deep arg inspection end up needing
BPF helpers, I would agree that it's time for seccomp to grow eBPF
language support. I'm still hoping there's a clean solution that doesn't
require a seccomp language extension.

> > > We don't need to actually zero-fill memory for this beyond what the
> > > kernel supports - AFAIK the existing APIs already say that passing a
> > > short length is equivalent to passing zeroes, so we can just replace
> > > all out-of-bounds loads with zeroing registers in the filter.
> > > The tricky case is what should happen if the userspace program passes
> > > in fields that the filter doesn't know about. The filter can see the
> > > length field passed in by userspace, and then just reject anything
> > > where the length field is bigger than the structure size the filter
> > > knows about. But maybe it'd be slightly nicer if there was an
> > > operation for "tell me whether everything starting at offset X is
> > > zeroes", so that if someone compiles with newer kernel headers where
> > > the struct is bigger, and leaves the new fields zeroed, the syscalls
> > > still go through an outdated filter properly.
> >
> > I think the best way of handling this (without breaking programs
> > senselessly) is to have filters essentially emulate
> > copy_struct_from_user() semantics -- which is along the lines of what
> > you've suggested.
> 
> and cpbf load instruction will become copy_from_user() underneath?

No, this was meaning internal checking about struct sizes needs to exist
(not the user copy parts).

> I don't see how that can work.
> Have you considered implications to jits, register usage, etc ?
> 
> ebpf will become sleepable soon. It will be able to do copy_from_user()
> and examine any level of user pointer dereference.
> toctou is still going to be a concern though,
> but such ebpf+copy_from_user analysis and syscall sandboxing
> will not need to change kernel code base around syscalls at all.
> No need to invent E-syscalls and all the rest I've seen in this thread.

To avoid the ToCToU, the seccomp infrastructure must do the
copy_from_user(), so there's not need for the sleepable stuff in seccomp
that I can see. The question is mainly one of flattening.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19 16:18     ` Alexei Starovoitov
  2020-05-19 21:40       ` Kees Cook
@ 2020-05-20  1:20       ` Aleksa Sarai
  2020-05-20  5:17         ` Alexei Starovoitov
  1 sibling, 1 reply; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-20  1:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aleksa Sarai, Daniel Borkmann, Kees Cook, Chris Palmer,
	Jann Horn, Jeffrey Vander Stoep, Linux Containers, kernel list,
	Matt Denton, Linux API, Christian Brauner, bpf

[-- Attachment #1: Type: text/plain, Size: 8756 bytes --]

On 2020-05-19, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > > ## deep argument inspection
> > > >
> > > > Background: seccomp users would like to write filters that traverse
> > > > the user pointers passed into many syscalls, but seccomp can't do this
> > > > dereference for a variety of reasons (mostly involving race conditions and
> > > > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> > >
> > > Also, other than for syscall entry, it might be worth thinking about
> > > whether we want to have a special hook into seccomp for io_uring.
> > > io_uring is growing support for more and more syscalls, including
> > > things like openat2, connect, sendmsg, splice and so on, and that list
> > > is probably just going to grow in the future. If people start wanting
> > > to use io_uring in software with seccomp filters, it might be
> > > necessary to come up with some mechanism to prevent io_uring from
> > > permitting access to almost everything else...
> > >
> > > Probably not a big priority for now, but something to keep in mind for
> > > the future.
> >
> > Indeed. Quite a few people have raised concerns about io_uring and its
> > debug-ability, but I agree that another less-commonly-mentioned concern
> > should be how you restrict io_uring(2) from doing operations you've
> > disallowed through seccomp. Though obviously user_notif shouldn't be
> > allowed. :D
> >
> > > > The argument caching bit is, I think, rather mechanical in nature since
> > > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > > allocates seccomp_data (maybe going so far as to have it split across two
> > > > pages with the syscall argument struct always starting on the 2nd page
> > > > boundary), and copying the EA struct into that page, which will be both
> > > > used by the filter and by the syscall.
> > >
> > > We could also do the same kind of thing the eBPF verifier does in
> > > convert_ctx_accesses(), and rewrite the context accesses to actually
> > > go through two different pointers depending on the (constant) offset
> > > into seccomp_data.
> >
> > My main worry with this is that we'll need to figure out what kind of
> > offset mathematics are necessary to deal with pointers inside the
> > extensible struct. As a very ugly proposal, you could make it so that
> > you multiply the offset by PAGE_SIZE each time you want to dereference
> > the pointer at that offset (unless we want to add new opcodes to cBPF to
> > allow us to represent this).
> 
> Please don't. cbpf is frozen.

I have an alternative proposal in another mail[1].

> > This might even be needed for seccomp user_notif -- given one of the
> > recent proposals was basically to just add two (extensible) struct
> > pointers inside the main user_notif struct.
> >
> > > > I imagine state tracking ("is
> > > > there a cached EA?", "what is the address of seccomp_data?", "what is
> > > > the address of the EA?") can be associated with the thread struct.
> > >
> > > You probably mean the task struct?
> > >
> > > > The growing size of the EA struct will need some API design. For filters
> > > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > > need to know how large seccomp_data is (more on this later), and how
> > > > large the EA struct is. When the filter is written in userspace, it can
> > > > do the math, point into the expected offsets, and get what it needs. For
> > > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > > needs to know the size of the EA struct as well, so it can correctly
> > > > perform the offset checking (as it currently does for just the
> > > > seccomp_data struct size).
> > > >
> > > > Since there is not really any caller-based "seccomp state" associated
> > > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > > the kernel doesn't track who "I" is besides just being "current", which
> > > > doesn't take into account the thread lifetime -- if a process launcher
> > > > knows about one size and the child knows about another, things will get
> > > > confused. The sizes really are just associated with individual filters,
> > > > based on the syscalls they're examining. So, I have thoughts on possible
> > > > solutions:
> > > >
> > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > > >   EA style so we can pass in more than a filter and include also an
> > > >   array of syscall to size mappings. (I don't like this...)
> > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > > >   the meaning of the uarg from "filter" to a EA-style structure with
> > > >   sizes and pointers to the filter and an array of syscall to size
> > > >   mappings. (I like this slightly better, but I still don't like it.)
> > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > > >   the "max offset" value seen during filter verification, and zero-fill
> > > >   the EA struct with zeros to that size when constructing the
> > > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > > >   care what any of the sizes are. (I like this as it makes the problems
> > > >   to solve contained entirely by the seccomp infrastructure and does not
> > > >   touch user API, but I worry I'm missing some gotcha I haven't
> > > >   considered.)
> > >
> > > We don't need to actually zero-fill memory for this beyond what the
> > > kernel supports - AFAIK the existing APIs already say that passing a
> > > short length is equivalent to passing zeroes, so we can just replace
> > > all out-of-bounds loads with zeroing registers in the filter.
> > > The tricky case is what should happen if the userspace program passes
> > > in fields that the filter doesn't know about. The filter can see the
> > > length field passed in by userspace, and then just reject anything
> > > where the length field is bigger than the structure size the filter
> > > knows about. But maybe it'd be slightly nicer if there was an
> > > operation for "tell me whether everything starting at offset X is
> > > zeroes", so that if someone compiles with newer kernel headers where
> > > the struct is bigger, and leaves the new fields zeroed, the syscalls
> > > still go through an outdated filter properly.
> >
> > I think the best way of handling this (without breaking programs
> > senselessly) is to have filters essentially emulate
> > copy_struct_from_user() semantics -- which is along the lines of what
> > you've suggested.
> 
> and cpbf load instruction will become copy_from_user() underneath? I
> don't see how that can work. Have you considered implications to jits,
> register usage, etc ?
> 
> ebpf will become sleepable soon. It will be able to do copy_from_user()
> and examine any level of user pointer dereference.
> toctou is still going to be a concern though,
> but such ebpf+copy_from_user analysis and syscall sandboxing
> will not need to change kernel code base around syscalls at all.
> No need to invent E-syscalls and all the rest I've seen in this thread.

No it won't become copy_from_user(), nor will there be a TOCTOU race.

The idea is that seccomp will proactively copy the struct (and
recursively any of the struct pointers inside) before the syscall runs
-- as this is done by seccomp it doesn't require any copy_from_user()
primitives in cBPF. We then run the cBPF filter on the copied struct,
just like how cBPF programs currently operate on seccomp_data (how this
would be exposed to the cBPF program as part of the seccomp ABI is the
topic of discussion here).

Then, when the actual syscall code runs, the struct will have already
been copied and the syscall won't copy it again.

eBPF being able to do copy_from_user() isn't really relevant to this
discussion because we need seccomp to be sure that the structure being
filtered by the program is the same structure that the syscall ends up
using. As you mentioned, there's a fundamental TOCTOU there.

[1]: https://lore.kernel.org/lkml/20200519134244.37bhucyram4n6sjk@yavin.dot.cyphar.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-20  1:20       ` Aleksa Sarai
@ 2020-05-20  5:17         ` Alexei Starovoitov
  2020-05-20  6:18           ` Aleksa Sarai
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2020-05-20  5:17 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Aleksa Sarai, Daniel Borkmann, Kees Cook, Chris Palmer,
	Jann Horn, Jeffrey Vander Stoep, Linux Containers, kernel list,
	Matt Denton, Linux API, Christian Brauner, bpf

On Wed, May 20, 2020 at 11:20:45AM +1000, Aleksa Sarai wrote:
> 
> No it won't become copy_from_user(), nor will there be a TOCTOU race.
> 
> The idea is that seccomp will proactively copy the struct (and
> recursively any of the struct pointers inside) before the syscall runs
> -- as this is done by seccomp it doesn't require any copy_from_user()
> primitives in cBPF. We then run the cBPF filter on the copied struct,
> just like how cBPF programs currently operate on seccomp_data (how this
> would be exposed to the cBPF program as part of the seccomp ABI is the
> topic of discussion here).
> 
> Then, when the actual syscall code runs, the struct will have already
> been copied and the syscall won't copy it again.

Let's take bpf syscall as an example.
Are you suggesting that all of syscall logic of conditionally parsing
the arguments will be copy-pasted into seccomp-syscall infra, then
it will do copy_from_user() all the data and replace all aligned_u64
in "union bpf_attr" with kernel copied pointers instead of user pointers
and make all of bpf syscall's copy_from_user() actions to be conditional ?
If seccomp is on, use kernel pointers... if seccomp is off, do copy_from_user ?
And the same idea will be replicated for all syscalls?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-20  5:17         ` Alexei Starovoitov
@ 2020-05-20  6:18           ` Aleksa Sarai
  0 siblings, 0 replies; 22+ messages in thread
From: Aleksa Sarai @ 2020-05-20  6:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Aleksa Sarai, Daniel Borkmann, Kees Cook, Chris Palmer,
	Jann Horn, Jeffrey Vander Stoep, Linux Containers, kernel list,
	Matt Denton, Linux API, Christian Brauner, bpf

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On 2020-05-19, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Wed, May 20, 2020 at 11:20:45AM +1000, Aleksa Sarai wrote:
> > No it won't become copy_from_user(), nor will there be a TOCTOU race.
> > 
> > The idea is that seccomp will proactively copy the struct (and
> > recursively any of the struct pointers inside) before the syscall runs
> > -- as this is done by seccomp it doesn't require any copy_from_user()
> > primitives in cBPF. We then run the cBPF filter on the copied struct,
> > just like how cBPF programs currently operate on seccomp_data (how this
> > would be exposed to the cBPF program as part of the seccomp ABI is the
> > topic of discussion here).
> > 
> > Then, when the actual syscall code runs, the struct will have already
> > been copied and the syscall won't copy it again.
> 
> Let's take bpf syscall as an example.
> Are you suggesting that all of syscall logic of conditionally parsing
> the arguments will be copy-pasted into seccomp-syscall infra, then
> it will do copy_from_user() all the data and replace all aligned_u64
> in "union bpf_attr" with kernel copied pointers instead of user pointers
> and make all of bpf syscall's copy_from_user() actions to be conditional ?
> If seccomp is on, use kernel pointers... if seccomp is off, do copy_from_user ?
> And the same idea will be replicated for all syscalls?

This would be done optionally per-syscall. Only syscalls which want to
opt-in to such a mechanism (such as clone3 and openat2) would be
affected. Also, bpf is possibly the least-friendly syscall to pick as an
example of these types of filters -- openat2/clone3 is much simpler to
consider.

The point is that if we both agree that seccomp needs to have a way to
do "deep argument inspection" (filtering based on the struct argument to
a syscall), then some sort of caching mechanism is simply necessary to
solve the problem. Otherwise there's a trivial TOCTOU and seccomp
filtering for such syscalls would be rendered almost useless.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-19 10:26 ` Christian Brauner
@ 2020-05-20  8:23   ` Sargun Dhillon
  0 siblings, 0 replies; 22+ messages in thread
From: Sargun Dhillon @ 2020-05-20  8:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Tycho Andersen, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, Linux Containers, Linux API, LKML



On Tue, May 19, 2020 at 3:26 AM Christian Brauner <christian.brauner@ubuntu.com> wrote:
>
> On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> > Hi!
> >
> > This is my attempt at a brain-dump on my plans for nearish-term seccomp
> > features. Welcome to my TED talk... ;)
> >
> > These are the things I've been thinking about:
> >
> > - fd passing
> > - deep argument inspection
> > - changing structure sizes
> > - syscall bitmasks
> >
> > ## changing structure sizes
> >
> > Background: there have been regular needs to add things to various
> > seccomp structures. Each come with their own unique pains, and solving
> > this as completely as possible in a future-proof way would be very nice.
> >
> > As noted in "fd passing" above, there is a desire to add some useful
> > things to the user_notif struct (e.g. thread group pid). Similarly,
> > there have been requests in the past (though I can't find any good
> > references right now, just my indirect comments[3]) to grow seccomp_data.
> > Combined with the EA struct work above, there's a clear need for seccomp
> > to reexamine how it deals with its API structures (and how this
> > interacts with filters).
> >
> > First, let's consider seccomp_data. If we grow it, the EA struct offset
> > will move, based on the deep arg inspection design above. Alternatively,
> > we could instead put seccomp_data offset 0, and EA struct at offset
> > PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> > the filter access whatever it thinks is there, with it being zero-filled
> > by the kernel. For any values where 0 is valid, there will just need to
> > be a "is that field valid?" bit before it:
>
> That's exactly what clone3() - and openat2() too - I think is doing.
> (Case in point, CLONE_INTO_CGROUP with the .cgroup member referencing a
>  file descriptor for which 0 is obviously valid.)
>
> >
> >       unsigned long feature_bits;
> >       unsigned long interesting_thing_1;
> >       unsigned long interesting_thing_2;
> >       unsigned long interesting_thing_3;
> >       ...
> >
> > and the filter would check feature_bits...
> >
> > (However, this needs to be carefully considered given that seccomp_data
> > is embedded in user_notif... should the EA struct from userspace also be
> > copied into user_notif? More thoughts on this below...)
> >
> > For user_notif, I think we need something in and around these options:
> >
> > - make a new API that explicitly follows EA struct design
> >   (and while read()/write() might be easier[4], I tend to agree with
> >   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> >   for data". Though I wonder if read() could be used for the notifications,
> >   which ARE data, and use ioctl() for the responses?)
>
> I'm still pretty convinced that my original suggestion here to switch
> user_notif over to esyscall design is the way to go and probably the
> least complex.
>
I think that the big difference in the esyscall design is that the
nested structures don't require evolution. In addition, the kernel
deals with the burden of old versions of userspace structs without
requiring the user to do dynamic allocation.

> > - make a new API that is perf_event_open()-style where fields are
> >   explicitly requested, as Sargun suggested[5]. (This looks like it
> >   might be complex to construct, but would get us by far the most
> >   extensible API.)
> > - jam whatever we pick into the existing API (we'll be forced to do
> >   SOMETHING to make the old API still work, so, I dunno what that will
> >   look like until we finish the rest of the design).
> >
> > If we did a requested-fields approach, what would the user_notif event
> > block of bytes look like? Would it be entirely dynamic based on the
> > initial ioctl()? Another design consideration here is that we don't want
> > the kernel doing tons of work (especially copying) and tossing tons
> > of stuff into a huge structure that the user doesn't care about. In
> > addition to explicit fields, maybe the EA struct could be included,
> > perhaps with specified offset/size, so only the portion the user_notif
> > user wanted to inspect was copied?
> >
> > The complexity of the per-field API is higher, but I think it might be
> > the most robust and have the greatest chance at being performant.
> > For example, "send me user_notif but I only care about the pid" would
> > mean no syscall arguments are copied, etc.
>
> Before we go with such a flexible design we should ask who's going to
> use it that way? At least for our use-cass we always want everything.
> Maybe I'm missing other uses though. But most use-cases seem kinda
> pointless, e.g. just getting the pid seems not interesting. I could see
> a use-case for pid + syscall number without arguments and the notifier
> always replying with _CONTINUE but even that seems a little odd to me.
>
I think it's worth being clear about the perf_event_attr style API [1][2].
That's here there are two steps:
1. Configure what / how / versions of fields you want to receive. This looks
something like this:
struct seccomp_notif_config {
	/*
	 * Size of this data structure, indicating which 
	 * member fields are included.
	 */
	__u32	size;
	/* Bitmask of optional fields */
	__u64	optional_fixed_size_fields;
	/* Size indicates version. 0 means omit */
	__u32	size_of_variable_member1;
	__u32	size_of_variable_member2;
	/* New varible sized members are added at the end */
}
2. Then at read time, the user builds a structure like this on the stack:
struct notif_flat {
	/* Existing variable members cannot have their sizes changed */
	struct variable_member1 ...;
	struct variable_member2 ...;
	__u32 optional_field1;
	__u32 optional_field2;
}
And you can read (or ioctl) the notification into the buffer in one-go.

There is a slightly different alternative where the data structure
you share with the kernel looks more like:
struct notif_submembers {
	__u32 optional_field1;
	__u32 optional_field2;
}
struct notif_ptrs {
	/* Points are empty buffers the size of variable member */
	struct *variable_member1;
	struct *variable_member2;
	// You cannot interleave fixed fields, they may be at the end
	struct notif_submembers members;
}

I think that userspace signaling up to kernelspace what it wants is a better
design. I'm not sure how the kernel would tell userspace to lay out a structure
of mixed pointers, and flat fields, so I think you'd have something like this:

/* this datastructure is returned by the kernel */
struct seccom_sizes {
	__u32	num_members;
	/* Ack, VLA */
	__u32	member_sizes[];
	__u32	submember_size;;	
}

If we go down the route of notif_ptrs, it requires the kernel read in
the datastructure, find the pointers in userspace and copy out to a bunch of
other buffers. In addition, it means the userspace has to deal with managing
all these buffers (And have a mechanism like seccomp_notif_sizes to find the
number of structures, and their respective buffer size). Because this is
dynamic and not static (can vary based on kernel version), the user can no
longer rely on stack allocations / static programming, and has to do at least
some basic casting back and forth to void.

If we go down the route of notif_flat, it makes thing slightly nicer, unless
we want to upgrade (extend) a field without moving it to the bottom of the
datastructure. If we try to do in-place growth it will move around all
the other fields.

I think that perf_event_attr style API not only solves this problem more
cleanly (at least from the perspective of userspace), but it also means
less extra memory copies if we don't need to. I think that this is interesting
from the perspective of performance, and simpler userspace code without
having to do dynamic allocations for fields that I do not care about.
Although the fields we've discussed adding so far (pidfd, tgid) are tiny,
cheap, and easy to work with, they still mean that userspace has to
deal with dynamic memory allocation.

Mostly, I'd like to get clarity on:
1. Do we want to go down the route of perf_event_attr or continue to have
a mechanism where the kernel tells userspace how to layout a struct at runtime

If we want to do the latter, can someone counterpropose an API if they
have a better idea than that I've described aboe.
2. Do we want to want to open the door to read(2) for receiving events?
3. Do we want to *ever* send other things with events? (FDs) -- it sounds
   like we've answered this one with a no
4. Should the kernel be able to handle userspace applications without
requiring them to dynamic allocations?
5. Where does this become a performance concern?
6. Is there *anything* that is a good idea to pass back with the notification
(FDs, locks, etc...?)

-Sargun

[1]: https://lore.kernel.org/linux-api/20200518213734.GA25216@ircssh-2.c.rugged-nimbus-611.internal/
[2]: https://lore.kernel.org/linux-api/20200518083224.GA16270@ircssh-2.c.rugged-nimbus-611.internal/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
                   ` (4 preceding siblings ...)
  2020-05-19 14:07 ` Tycho Andersen
@ 2020-05-20  9:05 ` Sargun Dhillon
  2020-05-22 20:09 ` Sargun Dhillon
  6 siblings, 0 replies; 22+ messages in thread
From: Sargun Dhillon @ 2020-05-20  9:05 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Tycho Andersen, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, containers, linux-api, linux-kernel

On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
> So, diving right in:
> 
> 
> ## fd passing
> 
> Background: seccomp users want to be able to install an fd in a
> monitored process during a user_notif to emulate "open" calls (or
> similar), possibly across security boundaries, etc.
> 
> On the fd passing front, it seems that gaining pidfd_addfd() is the way
> to go as it allows for generic use not tied to seccomp in particular.
> I expect this feature will be developed orthogonally to seccomp (where
> does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> be friendlier to help with using it. Things that need to be resolved:
> 
> - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
>   if we're going to step back and make some design choices here, is
>   there a place for pidfds in seccomp user_notif, in order to avoid
>   needing the user_notif cookie? I think probably not: it's a rather lot
>   of overhead for notifications. It seems it's safe to perform an fd
>   installation with these steps:
> 	- get pidnr from user_notif_recv
> 	- open pidfd from pidnr
> 	- re-verify user_notif cookie is still valid
> 	- send new fd via pidfd
> 	- reply with user_notif_send
> 	- close pidfd
> 
> - how to deal with changing sizes of the user_notif structures to
>   include a pidnr. (Which will be its own topic below.)
> 
> 
> ## deep argument inspection
> 
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> During the last plumbers and in conversations since, the grudging
> consensus was reached that having seccomp do this for ALL syscalls was
> likely going to be extremely disruptive for very little gain (i.e.
> many things, like pathnames, have differing lifetimes, aliases, unstable
> kernel object references, etc[6]), but that there were a small subset of
> syscalls for which this WOULD be beneficial, and those are the newly
> created "Extensible Argument" syscalls (is there a better name for this
> design? I'm calling it "EA" for the rest of the email), like clone3(),
> openat2(), etc, which pass a pointer and a size:
> 
> long clone3(struct clone_args *cl_args, size_t size);
> 
> I think it should be possible to extend seccomp to examine this structure
> by appending it to seccomp_data, and allowing filters to examine the
> contents. This means that no BPF language extensions are required for
> seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> seccomp's design principles work well with maps, kernel helpers, etc,
> and I think the earlier the examination of using eBPF for user_notif
> bares this out).
> 
> In order for this to work, there are a number of prerequisites:
> 
> - argument caching, in two halves: syscall side and seccomp side:
>   - the EA syscalls needs to include awareness of potential seccomp
>     hooking. i.e. seccomp may have done the copy_from_user() already and
>     kept a cached copy.
>   - seccomp needs to potentially DO the copy_from_user() itself when it
>     hits these syscalls for a given filter, and put it somewhere for
>     later use by the syscall.
> - the sizes of these EA structs are, by design, growable over time.
>   seccomp and its users need to be handle this in a forward and backward
>   compatible way, similar to the design of the EA syscall interface
>   itself.
> 
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall. I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.
> 
> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can correctly
> perform the offset checking (as it currently does for just the
> seccomp_data struct size).
> 
> Since there is not really any caller-based "seccomp state" associated
> across seccomp(2) calls, I don't think we can add a new command to tell
> the kernel "I'm expecting the EA struct size to be $foo bytes", since
> the kernel doesn't track who "I" is besides just being "current", which
> doesn't take into account the thread lifetime -- if a process launcher
> knows about one size and the child knows about another, things will get
> confused. The sizes really are just associated with individual filters,
> based on the syscalls they're examining. So, I have thoughts on possible
> solutions:
> 
> - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
>   EA style so we can pass in more than a filter and include also an
>   array of syscall to size mappings. (I don't like this...)
> - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
>   the meaning of the uarg from "filter" to a EA-style structure with
>   sizes and pointers to the filter and an array of syscall to size
>   mappings. (I like this slightly better, but I still don't like it.)
> - leverage the EA design and just accept anything <= PAGE_SIZE, record
>   the "max offset" value seen during filter verification, and zero-fill
>   the EA struct with zeros to that size when constructing the
>   seccomp_data + EA struct that the filter will examine. Then the seccomp
>   filter doesn't care what any of the sizes are, and userspace doesn't
>   care what any of the sizes are. (I like this as it makes the problems
>   to solve contained entirely by the seccomp infrastructure and does not
>   touch user API, but I worry I'm missing some gotcha I haven't
>   considered.)
> 
I may be ridiculed for suggesting this approach, and maybe it's a bit
mad-science. I'll be honest, my familiarity with mm is low, and although
I think what I'm describing can be done, I'm unsure of the complexity,
and performance impact.

I was playing with userfaultfd a while ago for somewhat related reasons
and I found a patchset supporting write protection [1]. This got me
interested in wondering if we could leverage this. What if we had a
mechanism to read a process's memory, and simultaneously mark it
as read-only.

#define memstruct_flag_mark_ro	1
#define notify_on_mod	2
struct memstruct {
	__u32 flags;
	/* We might even support multiple iovecs... */
	struct iovec *local_iov;
	struct iovec *remote_iov;
}
Flow:
1. Supervisor (s1) launches child (c1)
2. c1 makes syscall that triggers seccomp_notif -- let's say clone3 --
a call where we can't do injection or such on behalf of the child.
3. s1 receives notification + seccomp data
4. s1 calls ioctl(..., SECCOMP_READ_MEM, memstruct) (where memstruct is
above)
5. When s1 calls this, the pages which are accessed by the iovec get
marked as read only, so a fault is triggered if the user process tries
to change things.
6. s1 says, continue syscall
7. Syscall returns, and pages are reverted to pre-notification state.

Upon fault, if the range being changed lies outside of what was copied
back via the SECCOMP_READ_MEM ioctl, then it is passed through as-is.
If the range is inside, and notify_on_mod is unset, it will SIGSEGV.
If the range is inside, and notify_on_mod is set, we will have to do
a userfaultfd-like notification back down to userspace.

I realize this approach is hacky, and ugly, and potentially goes
into the same territory as userfaultfd, which has proven to be a security
thing, but the benefits are that no kernel, cBPF, etc.. need to change.

The only downsides I see are complexity, performance, and potentially
there may be some userspace programs innocously manipulating memory
that might get caught in such an iovec.

Thoughts?
 
> And then, my age-old concern, that maybe doesn't need a solution... I
> remain plagued by the lack of pathname inspection. But I think the
> ToCToU nature of it means we just cannot do it from seccomp. It does
> make filtering openat2()'s EA struct a bit funny... a filter has no idea
> what path it applies to... but that doesn't matter because the object
> the path points to might change[6] during the syscall. Argh.
> 
> 
I have no idea how to solve this if the process can do stuff like move
files around on disk, or switch out namespaces while in the user notification.


[1]: https://lwn.net/Articles/786896/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 21:04 seccomp feature development Kees Cook
                   ` (5 preceding siblings ...)
  2020-05-20  9:05 ` Sargun Dhillon
@ 2020-05-22 20:09 ` Sargun Dhillon
  6 siblings, 0 replies; 22+ messages in thread
From: Sargun Dhillon @ 2020-05-22 20:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Tycho Andersen, Matt Denton, Chris Palmer,
	Jeffrey Vander Stoep, containers, linux-api, linux-kernel

On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
What's your take on enabling multiple filters with listeners being attached,
so that different seccomp interceptors can operate together. I'm wondering
how this would work.

One idea that I had is adding a new flag to the seccomp filter
installation -- something like NEXT_FILTER_COMPATIBLE. When a filter is
installed with a listener, it will check if all previous filters were
instaled with NEXT_FILTER_COMPATIBLE.

If the call is intercepted by a listener, and the return is overriden,
then it short-circuits, and the subsequent filters are not evaluated.

On the other hand, if the continue response is send, then the
subsequent filters are called.

What do you think?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: seccomp feature development
  2020-05-18 22:39 ` Jann Horn
  2020-05-19  2:48   ` Aleksa Sarai
  2020-05-19  7:24   ` Sargun Dhillon
@ 2020-06-03 19:09   ` Kees Cook
  2 siblings, 0 replies; 22+ messages in thread
From: Kees Cook @ 2020-06-03 19:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Tycho Andersen, Sargun Dhillon, Matt Denton,
	Chris Palmer, Jeffrey Vander Stoep, Linux Containers, Linux API,
	kernel list

[trying to get back to this thread -- I've been distracted]

On Tue, May 19, 2020 at 12:39:39AM +0200, Jann Horn wrote:
> On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > ## deep argument inspection
> >
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> Also, other than for syscall entry, it might be worth thinking about
> whether we want to have a special hook into seccomp for io_uring.
> io_uring is growing support for more and more syscalls, including
> things like openat2, connect, sendmsg, splice and so on, and that list
> is probably just going to grow in the future. If people start wanting
> to use io_uring in software with seccomp filters, it might be
> necessary to come up with some mechanism to prevent io_uring from
> permitting access to almost everything else...

/me perks up. Oh my, I hadn't been paying attention -- I thought this
was strictly for I/O ... like it's named. I will go read up.

> [...]
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall.
> 
> We could also do the same kind of thing the eBPF verifier does in
> convert_ctx_accesses(), and rewrite the context accesses to actually
> go through two different pointers depending on the (constant) offset
> into seccomp_data.

Ah, as in "for seccomp_data accesses, add offset $foo and for EA struct
add offset $bar"? Yeah, though my preference is to avoid rewriting the
filters as much as possible. But yes, that's a good point about not
requiring them be strictly contiguous.

> > I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> You probably mean the task struct?

Yup; think-o.

> > ## syscall bitmasks
> 
> YES PLEASE

I've got a working PoC for this now. It's sneaky.

> Other options:
>  - add a "load from read-only memory" opcode and permit specifying the
> data that should be in that memory when loading the filter

I think you've mentioned something like this before to me, but can you
remind me the details? If you mean RO userspace memory, don't we still
run the risk of racing mprotect, etc?

>  - make the seccomp API take an array of (syscall-number,
> instruction-offset) tuples, and start evaluation of the filter at an
> offset if it's one of those syscalls

To avoid making cBPF changes, yeah, perhaps have a way to add per-syscall
filters.

> One more thing that would be really nice: Is there a way we can have
> 64-bit registers in our seccomp filters? At the moment, every
> comparison has to turn into three ALU ops, which is pretty messy;
> libseccomp got that wrong (<https://crbug.com/project-zero/1769>), and
> it contributes to the horrific code Chrome's BPF generator creates.
> Here's some pseudocode from my hacky BPF disassembler, which shows
> pretty much just the filter code for filtering getpriority() and
> setpriority() in a Chrome renderer, with tons of useless dead code:
> 
> 0139         if args[0].high == 0x00000000: [true +3, false +0]
> 013e           if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140           if args[1].high == 0x00000000: [true +3, false +0]
> 0145             if args[1].low == 0x00000000: [true +149, false +0]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 0147             if args[1].high == 0x00000000: [true +3, false +0]
> 014c               if args[1].low == 0x00000001: [true +142, false
> +141] -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da               ret ERRNO
> 0148             if args[1].high != 0xffffffff: [true +142, false +0]
> -> ret TRAP
> 014a             if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0141           if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143           if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 013a         if args[0].high != 0xffffffff: [true +156, false +0] -> ret TRAP
> 013c         if args[0].low NO-COMMON-BITS 0x80000000: [true +154,
> false +0] -> ret TRAP
> 013e         if args[0].low != 0x00000000: [true +157, false +0] -> ret TRAP
> 0140         if args[1].high == 0x00000000: [true +3, false +0]
> 0145           if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147           if args[1].high == 0x00000000: [true +3, false +0]
> 014c             if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da             ret ERRNO
> 0148           if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a           if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0141         if args[1].high != 0xffffffff: [true +149, false +0] -> ret TRAP
> 0143         if args[1].low NO-COMMON-BITS 0x80000000: [true +147,
> false +0] -> ret TRAP
> 0145         if args[1].low == 0x00000000: [true +149, false +0] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 0147         if args[1].high == 0x00000000: [true +3, false +0]
> 014c           if args[1].low == 0x00000001: [true +142, false +141]
> -> ret ALLOW (syscalls: getpriority, setpriority)
> 01da           ret ERRNO
> 0148         if args[1].high != 0xffffffff: [true +142, false +0] -> ret TRAP
> 014a         if args[1].low NO-COMMON-BITS 0x80000000: [true +140,
> false +0] -> ret TRAP
> 014c         if args[1].low == 0x00000001: [true +142, false +141] ->
> ret ALLOW (syscalls: getpriority, setpriority)
> 01da         ret ERRNO
> 
> which is generated by this little snippet of C++ code:
> 
> ResultExpr RestrictGetSetpriority(pid_t target_pid) {
>   const Arg<int> which(0);
>   const Arg<int> who(1);
>   return If(which == PRIO_PROCESS,
>             Switch(who).CASES((0, target_pid), Allow()).Default(Error(EPERM)))
>       .Else(CrashSIGSYS());
> }

What would this look like in eBPF?

> On anything other than 32-bit MIPS, 32-bit powerpc and 32-bit sparc,
> we're actually already using the eBPF backend infrastructure... so
> maybe it would be an option to keep enforcing basically the same rules
> that we currently have for cBPF, but use the eBPF instruction format?

Yeah, I think this might be a good idea just for the reduction in
complexity for these things. The "unpriv BPF" problem needs to be solved
for this still, though, yes?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-06-03 19:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-18 21:04 seccomp feature development Kees Cook
2020-05-18 22:39 ` Jann Horn
2020-05-19  2:48   ` Aleksa Sarai
2020-05-19 10:35     ` Christian Brauner
2020-05-19 16:18     ` Alexei Starovoitov
2020-05-19 21:40       ` Kees Cook
2020-05-20  1:20       ` Aleksa Sarai
2020-05-20  5:17         ` Alexei Starovoitov
2020-05-20  6:18           ` Aleksa Sarai
2020-05-19  7:24   ` Sargun Dhillon
2020-05-19  8:30     ` Christian Brauner
2020-06-03 19:09   ` Kees Cook
2020-05-18 22:55 ` Andy Lutomirski
2020-05-19  7:09 ` Aleksa Sarai
2020-05-19 11:01   ` Christian Brauner
2020-05-19 13:42     ` Aleksa Sarai
2020-05-19 13:53       ` Aleksa Sarai
2020-05-19 10:26 ` Christian Brauner
2020-05-20  8:23   ` Sargun Dhillon
2020-05-19 14:07 ` Tycho Andersen
2020-05-20  9:05 ` Sargun Dhillon
2020-05-22 20:09 ` Sargun Dhillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).