On 2020-05-19, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-05-19, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> > On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > > On 2020-05-18, Kees Cook <keescook@chromium.org> wrote:
> > > > - the sizes of these EA structs are, by design, growable over time.
> > > >   seccomp and its users need to be handle this in a forward and backward
> > > >   compatible way, similar to the design of the EA syscall interface
> > > >   itself.
> > > > 
> > > > The growing size of the EA struct will need some API design. For filters
> > > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > > need to know how large seccomp_data is (more on this later), and how
> > > > large the EA struct is. When the filter is written in userspace, it can
> > > > do the math, point into the expected offsets, and get what it needs. For
> > > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > > needs to know the size of the EA struct as well, so it can correctly
> > > > perform the offset checking (as it currently does for just the
> > > > seccomp_data struct size).
> > > > 
> > > > Since there is not really any caller-based "seccomp state" associated
> > > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > > the kernel doesn't track who "I" is besides just being "current", which
> > > > doesn't take into account the thread lifetime -- if a process launcher
> > > > knows about one size and the child knows about another, things will get
> > > > confused. The sizes really are just associated with individual filters,
> > > > based on the syscalls they're examining. So, I have thoughts on possible
> > > > solutions:
> > > > 
> > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > > >   EA style so we can pass in more than a filter and include also an
> > > >   array of syscall to size mappings. (I don't like this...)
> > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > > >   the meaning of the uarg from "filter" to a EA-style structure with
> > > >   sizes and pointers to the filter and an array of syscall to size
> > > >   mappings. (I like this slightly better, but I still don't like it.)
> > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > > >   the "max offset" value seen during filter verification, and zero-fill
> > > >   the EA struct with zeros to that size when constructing the
> > > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > > >   care what any of the sizes are. (I like this as it makes the problems
> > > >   to solve contained entirely by the seccomp infrastructure and does not
> > > >   touch user API, but I worry I'm missing some gotcha I haven't
> > > >   considered.)
> > > 
> > > Okay, so here is my view on this. I think that the third option is
> > > closest to what I'd like to see. Based on Jann's email, I think we're on
> > > the same page but I'd just like to elaborate it a bit further:
> > > 
> > > First of all -- ideally, the backward and forward compatibility that EA
> > > syscalls give us should be reflected with seccomp filters being
> > > similarly compatible. Otherwise we're going to run into issues where all
> > > of the hard work with ensuring EA syscalls behave when extended will be
> > > less valuable if seccomp cannot handle it sufficiently. This means that
> > > I would hope that every combination of {old,new} filter/kernel/program
> > > would work on a best-effort (but fail-safe) basis.
> > > 
> > > In my view, the simplest way (from the kernel side) would be to simply
> > > do what you outlined in (3) -- make all accesses past usize (and even
> > > ksize) be zeroed.
> > > 
> > > However in order to make an old filter fail-safe on a new kernel with a
> > > new program, we'd need a new opcode which basically does
> > > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > > would punt the fail-safe problem to userspace (libseccomp would need to
> > > generate a check that any unknown-to-the-filter fields are zero). I
> > > don't think this is a decision we can make in-kernel because it might be
> > > that the filter doesn't care about the last field in a struct (and thus
> > > doesn't access it) but we don't know the difference between a field the
> > > filter doesn't care about and a field it doesn't know about.
> > > 
> > > Another issue which you haven't mentioned here is how do we deal with
> > > pointers inside an EA struct. clone3() already has this with set_tid,
> > 
> > It's also passed with a set_tid_size argument so we'd know how large
> > set_tid is.
> 
> Right, of course -- that's a given. See below for the point I was
> making.
> 
> > > and the thread on user_notif seems to be converging towards having the
> > > user_notif2 struct just contain two pointers to EA structs. We could
> > > punt on this for now, so long as we leave room for us to deal with this
> > > problem in the future. However, I think it would be misguided to enable
> > > deep argument inspection on syscalls which have such entries (such as
> > > clone3) until we've figured out how we want to handle nested pointers --
> > 
> > We can't really punt on this and should figure this out now. I've
> > pointed to this problem in my other mail as well.
> > Though I currently fail to see why this should be a huge problem to get
> > that working as long as each pointer comes with a known size.
> 
> It's not a huge problem necessarily, but we would need to figure out how
> we would represent the "nested pointers" in the flattened version which
> the seccomp filter will actually offset into. Specifically, my point is
> that if we imagine that the proposed new seccomp_data API is:
> 
>   [<seccomp_data>][<EA struct contents>]
> 
> then any pointers in the EA struct will just be numerical values when
> copied. How would write a filter on clone3() that requires the first
> entry of set_tid to be 34? You couldn't with this simplified setup. And
> unless I'm mistaken, BPF doesn't have pointer dereferences.
> 
> So we need to embed the "nested pointers" somehow in the flattened
> version of the EA struct. I don't think we can just append them to the
> main EA struct -- while you might have the length, that'd mean that
> seccomp would generate an array of zeroes whose length is a
> user-specified value? And then what about extensible structs that have
> their size embedded inside them (as you or someone else suggested for
> seccomp user_notif)? How would the eBPF program know the length of the
> struct?
> 
> Maybe we could have some kind of jump table which the filter could use
> (and just have there be a jump table for each embedded struct -- so
> multiple embeddings have their own jump tables). So if we imagine some
> user-extensible struct like
> 
> 	struct open_how {
> 		u64 flags, mode, resolve;
> 		struct foo *foo;
> 		struct bar *bar;
> 		struct baz *baz;
> 	};
> 
> 	struct foo {
> 		u64 foo;
> 	};
> 
> 	struct bar {
> 		u64 bar;
> 		struct barbaz *barbaz;
> 	};
> 
> 	struct barbaz {
> 		u64 barbaz;
> 	};
> 
> 	struct baz {
> 		u64 baz;
> 	};

Sorry, I omitted all of the _size fields for extensibility. Just imagine
I added them (or that the u64s represent the size).

> Then we would lay out the seccomp_data as:
> 
> 	[<seccomp data>][<jump table><open_how>]
> 		[<jump table><open_how->foo>]
> 		[<jump table><open_how->bar>]
> 			[<jump table><open_how->bar->barbaz>]
> 		[<jump table><open_how->baz>]
> 
> The jump table could be as simple as a list of tuples (relative offset
> of field in this struct from end of jump table, relative offset of the
> copied structure at the end of the jump table). However, it's not
> possible to do for-loops in cBPF -- so maybe we'd need to represent the
> jump table differently (perhaps having it just be the same size as the
> struct).
> 
> The above is just a rough sketch, maybe there's a much simpler solution
> I'm not seeing.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>