All of
 help / color / mirror / Atom feed
From: Christian Brauner <>
Subject: [Ksummit-discuss]  [TECH TOPIC] seccomp
Date: Fri, 19 Jul 2019 11:35:54 +0200	[thread overview]
Message-ID: <> (raw)

Hey everyone,

I would like to discuss approaches to enabling deep argument inspection
with seccomp and if we reach an agreement am also happy to do the work
and implement it.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF which
enables a process (watchee) to retrieve a fd for its seccomp filter.
This fd can then be handed to another (usually more privileged)
process (watcher).
The watcher will then be able to receive seccomp messages about the
syscalls having been performed by the watchee.

I have integrated this feature into userspace. We currently make heavy
use of this to intercept mknod() syscalls in user namespaces aka in
If the mknod() syscall matches a device in a pre-determined whitelist
the privileged watcher will perform the mknod syscall in lieu of the
unprivileged watchee and report back to the watchee on the success or
failure of its attempt. If the syscall does not match a device in a
whitelist we simply report an error.

We recently also started to intercept the setxattr() syscall to allow
the creation of various, well-known xattrs including

The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), setxattr() and many other syscalls that we would
like to intercept suffer from two major problems:
1. they are not easily filterable like mknod() because they have pointer
2. some of them might actually succeed in user namespaces already (e.g.
   fscaps etc.)

The 1. problem is not specific to SECCOMP_RET_USER_NOTIF but also
apparently affects future system call design.
We recently merged the clone3() syscall into mainline which moves the
flag from a register argument into a dedicated extensible struct
clone_args to lift the flag limit from legacy clone() and allowing for
extensions while supporting all legacy workloads.

One of the counter arguments leveraged against my design early on was
that this means clone3() cannot be easily filtered by seccomp due to 1.
This argument was fortunately not seen as defeating.
I would argue that there sure is value in trying to design syscalls that
can be handled by seccomp nicely but that seccomp can't become a burden
on designing extensible syscalls.
The openat2() syscall proposed currenly also does use a dedicated
argument struct which contains flags and the seccomp argument popped
back up again.

In light of all this, I would argue that we should seriously look into
extending seccomp to allow filtering on pointer arguments.

There is a close connection between 1. and 2. When a watcher intercepts
a syscall from a watchee and starts to inspect its arguments it can -
depending on the syscall rather often actually - determine whether or
not the syscall would succeed or fail. If it knows that the syscall will
succeed it currently still has to perform it in lieu of the watchee
since there is no way to tell the kernel to "resume" or actually perform
the syscall. It would be nice if we could discuss approaches to enabling
this feature as well.

I'm happy to lead this session and can also illustrate how this feature
is heavily used and how we run into its limitations.


             reply	other threads:[~2019-07-19  9:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-19  9:35 Christian Brauner [this message]
2019-07-19 12:32 ` [Ksummit-discuss] [TECH TOPIC] seccomp Andy Lutomirski
2019-07-20  3:18   ` Kees Cook
2019-08-14 17:54     ` Andy Lutomirski
2019-08-15 17:48       ` Kees Cook
2019-08-15 18:26         ` Andy Lutomirski
2019-08-15 18:31           ` Christian Brauner
2019-08-15 19:21             ` Andy Lutomirski
2019-07-20  7:23 ` James Morris
2019-07-20  7:41   ` Christian Brauner
2019-07-25 14:18     ` Serge E. Hallyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.