All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: Kees Cook <keescook@chromium.org>,
	Tycho Andersen <tycho@tycho.ws>,
	Sargun Dhillon <sargun@sargun.me>,
	Matt Denton <mpdenton@google.com>,
	Chris Palmer <palmer@google.com>,
	Jeffrey Vander Stoep <jeffv@google.com>,
	Daniel Vetter <daniel@ffwll.ch>,
	containers@lists.linux-foundation.org, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: seccomp feature development
Date: Tue, 19 May 2020 13:01:28 +0200	[thread overview]
Message-ID: <20200519110128.oot4zzxhivtmk7us@wittgenstein> (raw)
In-Reply-To: <20200519070929.55r3xvybjg6nnbsj@yavin.dot.cyphar.com>

On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> > ## fd passing
> > 
> > Background: seccomp users want to be able to install an fd in a
> > monitored process during a user_notif to emulate "open" calls (or
> > similar), possibly across security boundaries, etc.
> > 
> > On the fd passing front, it seems that gaining pidfd_addfd() is the way
> > to go as it allows for generic use not tied to seccomp in particular.
> > I expect this feature will be developed orthogonally to seccomp (where
> > does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> > be friendlier to help with using it. Things that need to be resolved:
> > 
> > - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
> >   if we're going to step back and make some design choices here, is
> >   there a place for pidfds in seccomp user_notif, in order to avoid
> >   needing the user_notif cookie? I think probably not: it's a rather lot
> >   of overhead for notifications. It seems it's safe to perform an fd
> >   installation with these steps:
> > 	- get pidnr from user_notif_recv
> > 	- open pidfd from pidnr
> > 	- re-verify user_notif cookie is still valid
> > 	- send new fd via pidfd
> > 	- reply with user_notif_send
> > 	- close pidfd
> 
> I agree that while it would be conceptually nicer to not need the
> user_notif cookie, creating a new pidfd for every notification isn't a
> good idea (not to mention that the program may not care about the pid or
> pidfd when determining its response -- making the pidfd more of a pain
> than anything else).
> 
> You'd also need to have special-casing based on the userspace length of
> user_notif because old programs wouldn't know they'd have to close the
> pidfd returned (which then brings us back to the fact that user_notif
> doesn't specify what the usize of the struct is a-la
> copy_struct_from_user).
> 
> > ## deep argument inspection
> > 
> > Background: seccomp users would like to write filters that traverse
> > the user pointers passed into many syscalls, but seccomp can't do this
> > dereference for a variety of reasons (mostly involving race conditions and
> > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> > 
> > During the last plumbers and in conversations since, the grudging
> > consensus was reached that having seccomp do this for ALL syscalls was
> > likely going to be extremely disruptive for very little gain (i.e.
> > many things, like pathnames, have differing lifetimes, aliases, unstable
> > kernel object references, etc[6]), but that there were a small subset of
> > syscalls for which this WOULD be beneficial, and those are the newly
> > created "Extensible Argument" syscalls (is there a better name for this
> > design? I'm calling it "EA" for the rest of the email), like clone3(),
> > openat2(), etc, which pass a pointer and a size:
> 
> I called them "extensible struct" syscalls, but "extensible arguments"
> is probably a better name. Though if you want to call them EA, make sure
> to include the ™. ;)
> 
> > long clone3(struct clone_args *cl_args, size_t size);
> > 
> > I think it should be possible to extend seccomp to examine this structure
> > by appending it to seccomp_data, and allowing filters to examine the
> > contents. This means that no BPF language extensions are required for
> > seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> > seccomp's design principles work well with maps, kernel helpers, etc,
> > and I think the earlier the examination of using eBPF for user_notif
> > bares this out).
> 
> While I agree that we don't need to jump to eBPF for this, I do think
> we'll need an opcode for is_zeroed() (or bpf_check_uarg_tail_zero()) --
> see below.
> 
> > In order for this to work, there are a number of prerequisites:
> > 
> > - argument caching, in two halves: syscall side and seccomp side:
> >   - the EA syscalls needs to include awareness of potential seccomp
> >     hooking. i.e. seccomp may have done the copy_from_user() already and
> >     kept a cached copy.
> >   - seccomp needs to potentially DO the copy_from_user() itself when it
> >     hits these syscalls for a given filter, and put it somewhere for
> >     later use by the syscall.
> > 
> > The argument caching bit is, I think, rather mechanical in nature since
> > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > allocates seccomp_data (maybe going so far as to have it split across two
> > pages with the syscall argument struct always starting on the 2nd page
> > boundary), and copying the EA struct into that page, which will be both
> > used by the filter and by the syscall. I imagine state tracking ("is
> > there a cached EA?", "what is the address of seccomp_data?", "what is
> > the address of the EA?") can be associated with the thread struct.
> 
> I agree that while this is an important thing to handle properly, it's
> internal and probably doesn't deserve the lions share of design
> discussion.
> 
> > - the sizes of these EA structs are, by design, growable over time.
> >   seccomp and its users need to be handle this in a forward and backward
> >   compatible way, similar to the design of the EA syscall interface
> >   itself.
> > 
> > The growing size of the EA struct will need some API design. For filters
> > to operate on the contiguous seccomp_data+EA struct, the filter will
> > need to know how large seccomp_data is (more on this later), and how
> > large the EA struct is. When the filter is written in userspace, it can
> > do the math, point into the expected offsets, and get what it needs. For
> > this to work correctly in the kernel, though, the seccomp BPF verifier
> > needs to know the size of the EA struct as well, so it can correctly
> > perform the offset checking (as it currently does for just the
> > seccomp_data struct size).
> > 
> > Since there is not really any caller-based "seccomp state" associated
> > across seccomp(2) calls, I don't think we can add a new command to tell
> > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > the kernel doesn't track who "I" is besides just being "current", which
> > doesn't take into account the thread lifetime -- if a process launcher
> > knows about one size and the child knows about another, things will get
> > confused. The sizes really are just associated with individual filters,
> > based on the syscalls they're examining. So, I have thoughts on possible
> > solutions:
> > 
> > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> >   EA style so we can pass in more than a filter and include also an
> >   array of syscall to size mappings. (I don't like this...)
> > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> >   the meaning of the uarg from "filter" to a EA-style structure with
> >   sizes and pointers to the filter and an array of syscall to size
> >   mappings. (I like this slightly better, but I still don't like it.)
> > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> >   the "max offset" value seen during filter verification, and zero-fill
> >   the EA struct with zeros to that size when constructing the
> >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> >   filter doesn't care what any of the sizes are, and userspace doesn't
> >   care what any of the sizes are. (I like this as it makes the problems
> >   to solve contained entirely by the seccomp infrastructure and does not
> >   touch user API, but I worry I'm missing some gotcha I haven't
> >   considered.)
> 
> Okay, so here is my view on this. I think that the third option is
> closest to what I'd like to see. Based on Jann's email, I think we're on
> the same page but I'd just like to elaborate it a bit further:
> 
> First of all -- ideally, the backward and forward compatibility that EA
> syscalls give us should be reflected with seccomp filters being
> similarly compatible. Otherwise we're going to run into issues where all
> of the hard work with ensuring EA syscalls behave when extended will be
> less valuable if seccomp cannot handle it sufficiently. This means that
> I would hope that every combination of {old,new} filter/kernel/program
> would work on a best-effort (but fail-safe) basis.
> 
> In my view, the simplest way (from the kernel side) would be to simply
> do what you outlined in (3) -- make all accesses past usize (and even
> ksize) be zeroed.
> 
> However in order to make an old filter fail-safe on a new kernel with a
> new program, we'd need a new opcode which basically does
> bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> would punt the fail-safe problem to userspace (libseccomp would need to
> generate a check that any unknown-to-the-filter fields are zero). I
> don't think this is a decision we can make in-kernel because it might be
> that the filter doesn't care about the last field in a struct (and thus
> doesn't access it) but we don't know the difference between a field the
> filter doesn't care about and a field it doesn't know about.
> 
> Another issue which you haven't mentioned here is how do we deal with
> pointers inside an EA struct. clone3() already has this with set_tid,

It's also passed with a set_tid_size argument so we'd know how large
set_tid is.

> and the thread on user_notif seems to be converging towards having the
> user_notif2 struct just contain two pointers to EA structs. We could
> punt on this for now, so long as we leave room for us to deal with this
> problem in the future. However, I think it would be misguided to enable
> deep argument inspection on syscalls which have such entries (such as
> clone3) until we've figured out how we want to handle nested pointers --

We can't really punt on this and should figure this out now. I've
pointed to this problem in my other mail as well.
Though I currently fail to see why this should be a huge problem to get
that working as long as each pointer comes with a known size.

> at the very least because the bpf_check_uarg_tail_zero() opcode
> semantics need to consider this problem.
> 
> A silly proposal I had was that because the EA would never be bigger
> than PAGE_SIZE, we could take the offset into the EA and multiply it by
> PAGE_SIZE each time we wanted to do a "dereference" of a pointer in EA.

If you have a size to it it should be ok? I'm probably missing
something.

Christian

  reply	other threads:[~2020-05-19 11:01 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-18 21:04 seccomp feature development Kees Cook
2020-05-18 22:39 ` Jann Horn
2020-05-19  2:48   ` Aleksa Sarai
2020-05-19 10:35     ` Christian Brauner
2020-05-19 16:18     ` Alexei Starovoitov
2020-05-19 21:40       ` Kees Cook
2020-05-20  1:20       ` Aleksa Sarai
2020-05-20  5:17         ` Alexei Starovoitov
2020-05-20  6:18           ` Aleksa Sarai
2020-05-19  7:24   ` Sargun Dhillon
2020-05-19  8:30     ` Christian Brauner
2020-06-03 19:09   ` Kees Cook
2020-05-18 22:55 ` Andy Lutomirski
2020-05-19  7:09 ` Aleksa Sarai
2020-05-19 11:01   ` Christian Brauner [this message]
2020-05-19 13:42     ` Aleksa Sarai
2020-05-19 13:53       ` Aleksa Sarai
2020-05-19 10:26 ` Christian Brauner
2020-05-20  8:23   ` Sargun Dhillon
2020-05-19 14:07 ` Tycho Andersen
2020-05-20  9:05 ` Sargun Dhillon
2020-05-22 20:09 ` Sargun Dhillon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200519110128.oot4zzxhivtmk7us@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=cyphar@cyphar.com \
    --cc=daniel@ffwll.ch \
    --cc=jeffv@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpdenton@google.com \
    --cc=palmer@google.com \
    --cc=sargun@sargun.me \
    --cc=tycho@tycho.ws \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.