On 2020-05-19, Aleksa Sarai wrote: > On 2020-05-19, Christian Brauner wrote: > > On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote: > > > On 2020-05-18, Kees Cook wrote: > > > > - the sizes of these EA structs are, by design, growable over time. > > > > seccomp and its users need to be handle this in a forward and backward > > > > compatible way, similar to the design of the EA syscall interface > > > > itself. > > > > > > > > The growing size of the EA struct will need some API design. For filters > > > > to operate on the contiguous seccomp_data+EA struct, the filter will > > > > need to know how large seccomp_data is (more on this later), and how > > > > large the EA struct is. When the filter is written in userspace, it can > > > > do the math, point into the expected offsets, and get what it needs. For > > > > this to work correctly in the kernel, though, the seccomp BPF verifier > > > > needs to know the size of the EA struct as well, so it can correctly > > > > perform the offset checking (as it currently does for just the > > > > seccomp_data struct size). > > > > > > > > Since there is not really any caller-based "seccomp state" associated > > > > across seccomp(2) calls, I don't think we can add a new command to tell > > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since > > > > the kernel doesn't track who "I" is besides just being "current", which > > > > doesn't take into account the thread lifetime -- if a process launcher > > > > knows about one size and the child knows about another, things will get > > > > confused. The sizes really are just associated with individual filters, > > > > based on the syscalls they're examining. So, I have thoughts on possible > > > > solutions: > > > > > > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the > > > > EA style so we can pass in more than a filter and include also an > > > > array of syscall to size mappings. (I don't like this...) > > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes > > > > the meaning of the uarg from "filter" to a EA-style structure with > > > > sizes and pointers to the filter and an array of syscall to size > > > > mappings. (I like this slightly better, but I still don't like it.) > > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record > > > > the "max offset" value seen during filter verification, and zero-fill > > > > the EA struct with zeros to that size when constructing the > > > > seccomp_data + EA struct that the filter will examine. Then the seccomp > > > > filter doesn't care what any of the sizes are, and userspace doesn't > > > > care what any of the sizes are. (I like this as it makes the problems > > > > to solve contained entirely by the seccomp infrastructure and does not > > > > touch user API, but I worry I'm missing some gotcha I haven't > > > > considered.) > > > > > > Okay, so here is my view on this. I think that the third option is > > > closest to what I'd like to see. Based on Jann's email, I think we're on > > > the same page but I'd just like to elaborate it a bit further: > > > > > > First of all -- ideally, the backward and forward compatibility that EA > > > syscalls give us should be reflected with seccomp filters being > > > similarly compatible. Otherwise we're going to run into issues where all > > > of the hard work with ensuring EA syscalls behave when extended will be > > > less valuable if seccomp cannot handle it sufficiently. This means that > > > I would hope that every combination of {old,new} filter/kernel/program > > > would work on a best-effort (but fail-safe) basis. > > > > > > In my view, the simplest way (from the kernel side) would be to simply > > > do what you outlined in (3) -- make all accesses past usize (and even > > > ksize) be zeroed. > > > > > > However in order to make an old filter fail-safe on a new kernel with a > > > new program, we'd need a new opcode which basically does > > > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This > > > would punt the fail-safe problem to userspace (libseccomp would need to > > > generate a check that any unknown-to-the-filter fields are zero). I > > > don't think this is a decision we can make in-kernel because it might be > > > that the filter doesn't care about the last field in a struct (and thus > > > doesn't access it) but we don't know the difference between a field the > > > filter doesn't care about and a field it doesn't know about. > > > > > > Another issue which you haven't mentioned here is how do we deal with > > > pointers inside an EA struct. clone3() already has this with set_tid, > > > > It's also passed with a set_tid_size argument so we'd know how large > > set_tid is. > > Right, of course -- that's a given. See below for the point I was > making. > > > > and the thread on user_notif seems to be converging towards having the > > > user_notif2 struct just contain two pointers to EA structs. We could > > > punt on this for now, so long as we leave room for us to deal with this > > > problem in the future. However, I think it would be misguided to enable > > > deep argument inspection on syscalls which have such entries (such as > > > clone3) until we've figured out how we want to handle nested pointers -- > > > > We can't really punt on this and should figure this out now. I've > > pointed to this problem in my other mail as well. > > Though I currently fail to see why this should be a huge problem to get > > that working as long as each pointer comes with a known size. > > It's not a huge problem necessarily, but we would need to figure out how > we would represent the "nested pointers" in the flattened version which > the seccomp filter will actually offset into. Specifically, my point is > that if we imagine that the proposed new seccomp_data API is: > > [][] > > then any pointers in the EA struct will just be numerical values when > copied. How would write a filter on clone3() that requires the first > entry of set_tid to be 34? You couldn't with this simplified setup. And > unless I'm mistaken, BPF doesn't have pointer dereferences. > > So we need to embed the "nested pointers" somehow in the flattened > version of the EA struct. I don't think we can just append them to the > main EA struct -- while you might have the length, that'd mean that > seccomp would generate an array of zeroes whose length is a > user-specified value? And then what about extensible structs that have > their size embedded inside them (as you or someone else suggested for > seccomp user_notif)? How would the eBPF program know the length of the > struct? > > Maybe we could have some kind of jump table which the filter could use > (and just have there be a jump table for each embedded struct -- so > multiple embeddings have their own jump tables). So if we imagine some > user-extensible struct like > > struct open_how { > u64 flags, mode, resolve; > struct foo *foo; > struct bar *bar; > struct baz *baz; > }; > > struct foo { > u64 foo; > }; > > struct bar { > u64 bar; > struct barbaz *barbaz; > }; > > struct barbaz { > u64 barbaz; > }; > > struct baz { > u64 baz; > }; Sorry, I omitted all of the _size fields for extensibility. Just imagine I added them (or that the u64s represent the size). > Then we would lay out the seccomp_data as: > > [][] > [foo>] > [bar>] > [bar->barbaz>] > [baz>] > > The jump table could be as simple as a list of tuples (relative offset > of field in this struct from end of jump table, relative offset of the > copied structure at the end of the jump table). However, it's not > possible to do for-loops in cBPF -- so maybe we'd need to represent the > jump table differently (perhaps having it just be the same size as the > struct). > > The above is just a rough sketch, maybe there's a much simpler solution > I'm not seeing. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH