* [Ksummit-discuss] [TECH TOPIC] seccomp feature development @ 2020-05-20 16:17 Kees Cook 2020-05-20 16:31 ` Al Viro 0 siblings, 1 reply; 16+ messages in thread From: Kees Cook @ 2020-05-20 16:17 UTC (permalink / raw) To: ksummit-discuss As recently outlined[1], there are are a number of seccomp topics that need discussion: - fd passing - deep argument inspection - changing structure sizes - syscall bitmasks Specifically, seccomp needs to grow the ability to inspect Extensible Argument syscalls, which requires that it inspect userspace memory without Time-of-Check/Time-of-Use races and without double-copying. Additionally, since the structures can grow and be nested, there needs to be a way to deal with flattening the arguments into a linear buffer that can be examined by seccomp's BPF dialect. All of this also needs to be handled by the USER_NOTIF implementation. Finally, fd passing needs to be finished, and there needs to be an exploration of syscall bitmasks to augment the existing filters to gain back some performance. -Kees (This has been submitted to the LPC site as well[2].) [1] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/ [2] https://linuxplumbersconf.org/event/7/abstracts/596/ -- Kees Cook _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 16:17 [Ksummit-discuss] [TECH TOPIC] seccomp feature development Kees Cook @ 2020-05-20 16:31 ` Al Viro 2020-05-20 18:05 ` Kees Cook 0 siblings, 1 reply; 16+ messages in thread From: Al Viro @ 2020-05-20 16:31 UTC (permalink / raw) To: Kees Cook; +Cc: ksummit-discuss On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote: > As recently outlined[1], there are are a number of seccomp topics that > need discussion: > > - fd passing > - deep argument inspection > - changing structure sizes > - syscall bitmasks > > Specifically, seccomp needs to grow the ability to inspect Extensible > Argument syscalls, which requires that it inspect userspace memory > without Time-of-Check/Time-of-Use races and without double-copying. > Additionally, since the structures can grow and be nested, there needs > to be a way to ... catch and kill the bullshit ABI "enhancements" that would attempt that kind of garbage. I'm not sure why that is seccomp-related, though... _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 16:31 ` Al Viro @ 2020-05-20 18:05 ` Kees Cook 2020-05-20 18:16 ` Al Viro 2020-05-20 18:27 ` Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: Kees Cook @ 2020-05-20 18:05 UTC (permalink / raw) To: Al Viro; +Cc: ksummit-discuss On Wed, May 20, 2020 at 05:31:02PM +0100, Al Viro wrote: > On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote: > > As recently outlined[1], there are are a number of seccomp topics that > > need discussion: > > > > - fd passing > > - deep argument inspection > > - changing structure sizes > > - syscall bitmasks > > > > Specifically, seccomp needs to grow the ability to inspect Extensible > > Argument syscalls, which requires that it inspect userspace memory > > without Time-of-Check/Time-of-Use races and without double-copying. > > Additionally, since the structures can grow and be nested, there needs > > to be a way to > > ... catch and kill the bullshit ABI "enhancements" that would attempt that > kind of garbage. I'm not sure why that is seccomp-related, though... We already have structs passed to syscalls that contain pointers to yet more structs. Do you mean those should be disallowed? (Personally, I would love that, but this doesn't seem to match the reality of the design considerations of those syscalls.) -- Kees Cook _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 18:05 ` Kees Cook @ 2020-05-20 18:16 ` Al Viro 2020-05-20 18:27 ` Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: Al Viro @ 2020-05-20 18:16 UTC (permalink / raw) To: Kees Cook; +Cc: ksummit-discuss On Wed, May 20, 2020 at 11:05:58AM -0700, Kees Cook wrote: > On Wed, May 20, 2020 at 05:31:02PM +0100, Al Viro wrote: > > On Wed, May 20, 2020 at 09:17:41AM -0700, Kees Cook wrote: > > > As recently outlined[1], there are are a number of seccomp topics that > > > need discussion: > > > > > > - fd passing > > > - deep argument inspection > > > - changing structure sizes > > > - syscall bitmasks > > > > > > Specifically, seccomp needs to grow the ability to inspect Extensible > > > Argument syscalls, which requires that it inspect userspace memory > > > without Time-of-Check/Time-of-Use races and without double-copying. > > > Additionally, since the structures can grow and be nested, there needs > > > to be a way to > > > > ... catch and kill the bullshit ABI "enhancements" that would attempt that > > kind of garbage. I'm not sure why that is seccomp-related, though... > > We already have structs passed to syscalls that contain pointers to yet > more structs. Do you mean those should be disallowed? (Personally, I > would love that, but this doesn't seem to match the reality of the > design considerations of those syscalls.) We have no chance to kill the existing ones off, but we certainly can stop accepting new ones. I would make an exception for struct iovec arrays passed as an argument, provided that the data they refer to is opaque, and while pselect6() kind of kludges are bloody revolting, they might be inevitable in some cases - not without a very good explanation from their authors, though, and "I wanna have 18 arguments; whaddya mean, don't?!" is not one. _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 18:05 ` Kees Cook 2020-05-20 18:16 ` Al Viro @ 2020-05-20 18:27 ` Linus Torvalds 2020-05-20 19:04 ` Kees Cook 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2020-05-20 18:27 UTC (permalink / raw) To: Kees Cook; +Cc: ksummit On Wed, May 20, 2020 at 11:06 AM Kees Cook <keescook@chromium.org> wrote: > We already have structs passed to syscalls that contain pointers to yet > more structs. Give real examples of where this matters for security, please, and where somebody would want to control this. Yes, we have things like clone3() that pass a struct with pointers to user space (eg the wait location etc). Yes, we have tons of things like ioctl's that have random struct pointer arguments with random contents. Yes, we have iovec's etc that have arrays of pointers to user space. But no, none of these seem to be things that seccomp should care about. So I am not in the least interested in some kind of general discussion about system calls with "pointers to pointers". They exist. Deal with it. It's not in the least an interesting issue, and no, we shouldn't make seccomp and friends incredibly more complicated for it. If you want to do sandboxing, you disallow those things entirely if you don't trust them, or make the case-by-case argument for why they don't matter. If you want to do something fancier (special compat emulation using seccomp and a ptrace fallback? I dunno) you are going to just have to deal with it. It's not simple, but it's not the kernels problem. You may have to emulate it *entirely* in the ptracer (ie instead of "check the arguments and let it continue" you _actually_ emulate it to avoid any races) And if you have some actual and imminent real security issue, you mention _that_ and explain _that_, and accept that maybe you need to do that expensive emulation (because the kernel people just don't care about your private hang-ups) or you need to explain why it's a real issue and why the kernel should help with your odd special case. Don't make this some kind of abstract conceptual problem thing. Because it's not. Some computer scientists think that everythinig should be really generic and solutions that solve some problem for every possible case are the only good solutions. But those people are wrong. The thing that _really_ matters is the details. Not the nebulous theory. So details, please. Linus _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 18:27 ` Linus Torvalds @ 2020-05-20 19:04 ` Kees Cook 2020-05-20 19:08 ` Linus Torvalds 2020-05-20 22:12 ` Alexei Starovoitov 0 siblings, 2 replies; 16+ messages in thread From: Kees Cook @ 2020-05-20 19:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: ksummit On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote: > Don't make this some kind of abstract conceptual problem thing. > Because it's not. I have no intention of making this abstract (the requests for expanding seccomp coverage have been for only a select class of syscalls, and specifically clone3 and openat2) nor more complicated than it needs to be (I regularly resist expanding the seccomp BPF dialect into eBPF). > So details, please. We've been discussing it all here: https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/ The example given in the thread was dealing with things like clone3's struct clone_args's set_tid member, which is a pointer to a dynamically sized array. Things seccomp is NOT expected to introspect due to complexity would be stuff like the bpf() syscall. Perhaps the question is "how deeply does seccomp need to inspect?" and maybe it does not get to see anything beyond just the "top level" struct (i.e. struct clone_args) and all pointers within THAT become opaque? That certainly simplifies the design. -- Kees Cook _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 19:04 ` Kees Cook @ 2020-05-20 19:08 ` Linus Torvalds 2020-05-20 20:24 ` Christian Brauner 2020-05-20 22:12 ` Alexei Starovoitov 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2020-05-20 19:08 UTC (permalink / raw) To: Kees Cook; +Cc: ksummit On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > Things seccomp is NOT expected to introspect due to complexity would be > stuff like the bpf() syscall. Right. I don't dispute at all that those kinds of pointer-to-pointer things exist all over. But: > Perhaps the question is "how deeply does seccomp need to inspect?" > and maybe it does not get to see anything beyond just the "top level" > struct (i.e. struct clone_args) and all pointers within THAT become > opaque? That certainly simplifies the design. Exactly. I think that's the most common situation by far. Does anybody really really need to care at a deep level, and why? Linus _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 19:08 ` Linus Torvalds @ 2020-05-20 20:24 ` Christian Brauner 2020-05-20 20:52 ` Kees Cook 0 siblings, 1 reply; 16+ messages in thread From: Christian Brauner @ 2020-05-20 20:24 UTC (permalink / raw) To: Linus Torvalds, Kees Cook; +Cc: ksummit On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > > Things seccomp is NOT expected to introspect due to complexity would be > > stuff like the bpf() syscall. > > Right. > > I don't dispute at all that those kinds of pointer-to-pointer things > exist all over. > > But: > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > and maybe it does not get to see anything beyond just the "top level" > > struct (i.e. struct clone_args) and all pointers within THAT become > > opaque? That certainly simplifies the design. > > Exactly. I think that's the most common situation by far. Does anybody > really really need to care at a deep level, and why? We mostly don't and making all second-level pointers opaque is ok imho. First, I don't think we need to really nest structs. (We have netlink for that.) Second, features for such syscall that require other pointers can and usually will be placed under a flag in the first-level struct. If that's filterable you have the option to turn that of based on the flag. As long as the flag identifies one feature and not one feature that can have other features it's no different from filtering a simple flag anyway. Even for clone3() it only has one feature that has a pointer in the struct and that's for checkpoint/restore and lets them select a specific pid and it comes with a size argument that is capped by the maximum nesting depth of pid namespaces in the kernel. So if you see that the size argument is not 0 in the first level struct you can disallow that too same as if it were placed under the flag. So no second-level nesting required. Probably the first level pointer is alreay making some people vomit but it's useful and for some syscalls almost cannot be avoided. But I think that we need some documented consensus on all that stuff which I stressed in other mails before. I'll hand something in about this, if that's ok than we can hash this out. Christian _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 20:24 ` Christian Brauner @ 2020-05-20 20:52 ` Kees Cook 2020-05-20 21:02 ` Christian Brauner 2020-05-22 4:06 ` Aleksa Sarai 0 siblings, 2 replies; 16+ messages in thread From: Kees Cook @ 2020-05-20 20:52 UTC (permalink / raw) To: Aleksa Sarai, Christian Brauner; +Cc: ksummit On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote: > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > > and maybe it does not get to see anything beyond just the "top level" > > > struct (i.e. struct clone_args) and all pointers within THAT become > > > opaque? That certainly simplifies the design. > > > > Exactly. I think that's the most common situation by far. Does anybody > > really really need to care at a deep level, and why? > > We mostly don't and making all second-level pointers opaque is ok imho. That'll make things MUCH easier. :) > But I think that we need some documented consensus on all that stuff > which I stressed in other mails before. I'll hand something in about > this, if that's ok than we can hash this out. Aleksa, I know you had an entire presentation[1] on the extensible argument syscalls, but was there any text-based design doc that you made? It would be really nice to update Documentation/process/adding-syscalls.rst with the specifics[2], and to (now) include the "no nested flags" requirement. What do you think? -Kees [1] https://github.com/cyphar/talks/tree/master/2020/01-linux-conf-au/syscall-extensions https://www.youtube.com/watch?v=ggD-eb3yPVs [2] https://www.kernel.org/doc/html/latest/process/adding-syscalls.html?highlight=syscall#designing-the-api-planning-for-extension -- Kees Cook _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 20:52 ` Kees Cook @ 2020-05-20 21:02 ` Christian Brauner 2020-05-22 4:06 ` Aleksa Sarai 1 sibling, 0 replies; 16+ messages in thread From: Christian Brauner @ 2020-05-20 21:02 UTC (permalink / raw) To: Kees Cook; +Cc: Aleksa Sarai, ksummit On Wed, May 20, 2020 at 01:52:30PM -0700, Kees Cook wrote: > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote: > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > > > and maybe it does not get to see anything beyond just the "top level" > > > > struct (i.e. struct clone_args) and all pointers within THAT become > > > > opaque? That certainly simplifies the design. > > > > > > Exactly. I think that's the most common situation by far. Does anybody > > > really really need to care at a deep level, and why? > > > > We mostly don't and making all second-level pointers opaque is ok imho. > > That'll make things MUCH easier. :) > > > But I think that we need some documented consensus on all that stuff > > which I stressed in other mails before. I'll hand something in about > > this, if that's ok than we can hash this out. > > Aleksa, I know you had an entire presentation[1] on the extensible > argument syscalls, but was there any text-based design doc that you made? > > It would be really nice to update Documentation/process/adding-syscalls.rst > with the specifics[2], and to (now) include the "no nested flags" > requirement. What do you think? I've sent out: https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/ a while back which we drafted together specifically to documented extensible syscalls. I wanted to resend it this week. I'll update it with all register based flag arguments should be unsigned int which I already droned about this week in another thread. The no nested flags requirement can go in there as well. Aleksa and I also had a joint session for Plumbers/Ksummit planned hence the "handing something in". Christian _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 20:52 ` Kees Cook 2020-05-20 21:02 ` Christian Brauner @ 2020-05-22 4:06 ` Aleksa Sarai 2020-05-22 7:35 ` Christian Brauner 1 sibling, 1 reply; 16+ messages in thread From: Aleksa Sarai @ 2020-05-22 4:06 UTC (permalink / raw) To: Kees Cook; +Cc: Christian Brauner, ksummit [-- Attachment #1.1: Type: text/plain, Size: 2790 bytes --] On 2020-05-20, Kees Cook <keescook@chromium.org> wrote: > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote: > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > > > and maybe it does not get to see anything beyond just the "top level" > > > > struct (i.e. struct clone_args) and all pointers within THAT become > > > > opaque? That certainly simplifies the design. > > > > > > Exactly. I think that's the most common situation by far. Does anybody > > > really really need to care at a deep level, and why? > > > > We mostly don't and making all second-level pointers opaque is ok imho. > > That'll make things MUCH easier. :) To be clear, my insistence on the second-level pointers topic is coming from the view that we should make sure whatever model we use for the first iteration of deep argument inspection can be expanded to second-level pointers if we need them. The jump-table proposal I had was just an example of how we could plan out a design that could be implemented piece-meal (heck, we don't even need jump-tables in the first iteration -- so long as we have an idea for how they'd work). I also hasten to point out that if we make all second-level pointers opaque then you won't be able to filter clone3() based on ->set_tid. Now, maybe that's something nobody cares about, but it should be taken into consideration that one of the handful of "obvious" syscalls will already not be completely-filterable with second-level pointers being opaque. But if that's fine (at least for a first iteration), then I'm also okay with that. > > But I think that we need some documented consensus on all that stuff > > which I stressed in other mails before. I'll hand something in about > > this, if that's ok than we can hash this out. > > Aleksa, I know you had an entire presentation[1] on the extensible > argument syscalls, but was there any text-based design doc that you made? > > It would be really nice to update Documentation/process/adding-syscalls.rst > with the specifics[2], and to (now) include the "no nested flags" > requirement. What do you think? Christian and I wrote a patch for adding-syscalls last year[1], but Jon felt that it should require greater community consensus before it gets put into adding-syscalls. But yes, I'm definitely in favour of having this be a properly-documented aspect of new syscall design. [1]: https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/ -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] [-- Attachment #2: Type: text/plain, Size: 186 bytes --] _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-22 4:06 ` Aleksa Sarai @ 2020-05-22 7:35 ` Christian Brauner 2020-05-22 11:27 ` Christian Brauner 0 siblings, 1 reply; 16+ messages in thread From: Christian Brauner @ 2020-05-22 7:35 UTC (permalink / raw) To: Aleksa Sarai; +Cc: ksummit On Fri, May 22, 2020 at 02:06:06PM +1000, Aleksa Sarai wrote: > On 2020-05-20, Kees Cook <keescook@chromium.org> wrote: > > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote: > > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > > > > and maybe it does not get to see anything beyond just the "top level" > > > > > struct (i.e. struct clone_args) and all pointers within THAT become > > > > > opaque? That certainly simplifies the design. > > > > > > > > Exactly. I think that's the most common situation by far. Does anybody > > > > really really need to care at a deep level, and why? > > > > > > We mostly don't and making all second-level pointers opaque is ok imho. > > > > That'll make things MUCH easier. :) > > To be clear, my insistence on the second-level pointers topic is coming > from the view that we should make sure whatever model we use for the > first iteration of deep argument inspection can be expanded to > second-level pointers if we need them. The jump-table proposal I had was > just an example of how we could plan out a design that could be > implemented piece-meal (heck, we don't even need jump-tables in the > first iteration -- so long as we have an idea for how they'd work). > > I also hasten to point out that if we make all second-level pointers > opaque then you won't be able to filter clone3() based on ->set_tid. That's not an interesting second-level case. Either turn it on or off; base it on set_tid_size which tells you whether someone requested it or not. There's absolutely no reason to filter around in set_tid size ([1]). That was considered when adding this for checkpoint restore. You either allow someone that is sufficiently capable in the owning user namespace of each pid namespace it wants to select specific pids in or you simply deny it. That's not a great strawman. Interesting second level pointers are where you have second level pointers that can point to differnet things or multiple features at once based on an opaque switch. If you're looking for interesting second level pointers look at bpf(). One example, is just the BPF_OBJ_GET_INFO_BY_FD command wich passes a struct info which contains an fd and depending on what type of fd that is, info can be either struct bpf_prog_info, struct bpf_map_info, or struct bpf_btf_info some of which can have other third level pointers in there. [1]: There already wouldn't be any point to this if it were a first level pointer because you always need to determine the pid namespace hierarchy of the caller first to know whether or not you want to deny choosing a specific pid in a given namespace. That's nonsense. > Now, maybe that's something nobody cares about, but it should be taken > into consideration that one of the handful of "obvious" syscalls will > already not be completely-filterable with second-level pointers being > opaque. > > But if that's fine (at least for a first iteration), then I'm also okay > with that. > > > > But I think that we need some documented consensus on all that stuff > > > which I stressed in other mails before. I'll hand something in about > > > this, if that's ok than we can hash this out. > > > > Aleksa, I know you had an entire presentation[1] on the extensible > > argument syscalls, but was there any text-based design doc that you made? > > > > It would be really nice to update Documentation/process/adding-syscalls.rst > > with the specifics[2], and to (now) include the "no nested flags" > > requirement. What do you think? > > Christian and I wrote a patch for adding-syscalls last year[1], but Jon > felt that it should require greater community consensus before it gets > put into adding-syscalls. But yes, I'm definitely in favour of having > this be a properly-documented aspect of new syscall design. > > [1]: https://lore.kernel.org/linux-doc/20191002151437.5367-1-christian.brauner@ubuntu.com/ > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/> _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-22 7:35 ` Christian Brauner @ 2020-05-22 11:27 ` Christian Brauner 0 siblings, 0 replies; 16+ messages in thread From: Christian Brauner @ 2020-05-22 11:27 UTC (permalink / raw) To: Aleksa Sarai; +Cc: ksummit On Fri, May 22, 2020 at 09:35:35AM +0200, Christian Brauner wrote: > On Fri, May 22, 2020 at 02:06:06PM +1000, Aleksa Sarai wrote: > > On 2020-05-20, Kees Cook <keescook@chromium.org> wrote: > > > On Wed, May 20, 2020 at 10:24:01PM +0200, Christian Brauner wrote: > > > > On Wed, May 20, 2020 at 12:08:52PM -0700, Linus Torvalds wrote: > > > > > On Wed, May 20, 2020 at 12:04 PM Kees Cook <keescook@chromium.org> wrote: > > > > > > Perhaps the question is "how deeply does seccomp need to inspect?" > > > > > > and maybe it does not get to see anything beyond just the "top level" > > > > > > struct (i.e. struct clone_args) and all pointers within THAT become > > > > > > opaque? That certainly simplifies the design. > > > > > > > > > > Exactly. I think that's the most common situation by far. Does anybody > > > > > really really need to care at a deep level, and why? > > > > > > > > We mostly don't and making all second-level pointers opaque is ok imho. > > > > > > That'll make things MUCH easier. :) > > > > To be clear, my insistence on the second-level pointers topic is coming > > from the view that we should make sure whatever model we use for the > > first iteration of deep argument inspection can be expanded to > > second-level pointers if we need them. The jump-table proposal I had was > > just an example of how we could plan out a design that could be > > implemented piece-meal (heck, we don't even need jump-tables in the > > first iteration -- so long as we have an idea for how they'd work). > > > > I also hasten to point out that if we make all second-level pointers > > opaque then you won't be able to filter clone3() based on ->set_tid. > > That's not an interesting second-level case. Either turn it on or off; > base it on set_tid_size which tells you whether someone requested it or > not. There's absolutely no reason to filter around in set_tid size ([1]). > That was considered when adding this for checkpoint restore. You either > allow someone that is sufficiently capable in the owning user namespace > of each pid namespace it wants to select specific pids in or you simply > deny it. That's not a great strawman. > > Interesting second level pointers are where you have second level > pointers that can point to differnet things or multiple features at once > based on an opaque switch. If you're looking for interesting second > level pointers look at bpf(). One example, is just the > BPF_OBJ_GET_INFO_BY_FD command wich passes a struct info which contains > an fd and depending on what type of fd that is, info can be either > struct bpf_prog_info, struct bpf_map_info, or struct bpf_btf_info some > of which can have other third level pointers in there. Other examples include (possibly epoll_ctl's struct epoll_event), struct iovec in general, {get,set}_robust_list(), kexec_load()'s struct kexec_segment, and sendmsg()'s and recvmsg()'s sruct msghdr with a large number of additional substructs passed through passing a struct iovec. Most of these I reckon are uninteresting and will just in general be not allowed if there's a problem. > > [1]: There already wouldn't be any point to this if it were a first > level pointer because you always need to determine the pid > namespace hierarchy of the caller first to know whether or not you > want to deny choosing a specific pid in a given namespace. That's > nonsense. _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 19:04 ` Kees Cook 2020-05-20 19:08 ` Linus Torvalds @ 2020-05-20 22:12 ` Alexei Starovoitov 2020-05-20 23:39 ` Kees Cook 1 sibling, 1 reply; 16+ messages in thread From: Alexei Starovoitov @ 2020-05-20 22:12 UTC (permalink / raw) To: Kees Cook; +Cc: bpf, ksummit On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote: > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote: > > Don't make this some kind of abstract conceptual problem thing. > > Because it's not. > > I have no intention of making this abstract (the requests for expanding > seccomp coverage have been for only a select class of syscalls, and > specifically clone3 and openat2) nor more complicated than it needs to be > (I regularly resist expanding the seccomp BPF dialect into eBPF). Kees, since you've forked the thread I'm adding bpf mailing list back and re-iterating my point: ** Nack to cBPF extensions ** How that is relevant? You're proposing to add copy_from_user() to selected syscalls, like clone3, and present large __u32 array to cBPF program. In other words existing fixed sized 'struct seccomp_data' will become either variable length or jumbo fixed size like one page. In the fomer case it would mean that cBPF would need to be extended with variable length logic. Which in turn means it will suffer from spectre v1 issues. We've spent a lot of time fixing spectre v1 issues with eBPF. Including teaching the verifier to recognize speculative patterns inside the programs so that malicious bpf progs trying to exploit spec v1 will be caught at load time. There is no other tool (compiler or static analysis) that can do similar analysis. I suggest that you look into what eBPF is actually doing instead of trying to reinvent the wheel. If you go with latter approach of presenting cBPF with giant 'struct seccomp_data + page' that extra page would need to be zeroed out before invocation of bpf program which will make seccomp even less usable that it is today. Currently it's slow and unusable in production datacenter. People suggested for years to adopt eBPF in seccomp to accelerate it, but, as you confessed, you resisted and sounds like now you want to implement seccomp specific syscall bitmask? Which means more kernel code, more bugs, more security issues. imo that's another reinvented wheel when eBPF can do it already. I don't think it's a good idea to add kernel code when eBPF-based solution exists and capable of examining any level of nested args. > Perhaps the question is "how deeply does seccomp need to inspect?" > and maybe it does not get to see anything beyond just the "top level" > struct (i.e. struct clone_args) and all pointers within THAT become > opaque? That certainly simplifies the design. clone3's 'struct clone_args' has set_tid pointer as a second level. I don't think that sticking to first level of pointers for this particular syscall will make seccomp filtering any more practical. _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 22:12 ` Alexei Starovoitov @ 2020-05-20 23:39 ` Kees Cook 2020-05-21 0:43 ` Alexei Starovoitov 0 siblings, 1 reply; 16+ messages in thread From: Kees Cook @ 2020-05-20 23:39 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: bpf, ksummit On Wed, May 20, 2020 at 03:12:56PM -0700, Alexei Starovoitov wrote: > On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote: > > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote: > > > Don't make this some kind of abstract conceptual problem thing. > > > Because it's not. > > > > I have no intention of making this abstract (the requests for expanding > > seccomp coverage have been for only a select class of syscalls, and > > specifically clone3 and openat2) nor more complicated than it needs to be > > (I regularly resist expanding the seccomp BPF dialect into eBPF). > > Kees, since you've forked the thread I'm adding bpf mailing list back and > re-iterating my point: > ** Nack to cBPF extensions ** Yes, I know. I agreed[1] with you on this point. > How that is relevant? > You're proposing to add copy_from_user() to selected syscalls, like clone3, > and present large __u32 array to cBPF program. > In other words existing fixed sized 'struct seccomp_data' will become > either variable length or jumbo fixed size like one page. > In the fomer case it would mean that cBPF would need to be extended > with variable length logic. Which in turn means it will suffer from > spectre v1 issues. I don't expect to need to do anything with variable lengths in the seccomp BPF dialect. As I said in the other thread, if we are faced with design trade-offs that require extending the seccomp filter language, we would switch to eBPF. > If you go with latter approach of presenting cBPF with giant > 'struct seccomp_data + page' that extra page would need to be zeroed out > before invocation of bpf program which will make seccomp even less usable > that it is today. Currently it's slow and unusable in production datacenter. Making universal declarations based on your opinion does not help convince people of your position. Saying it's "unusable in production datacenter" is perhaps true for you, but hardly true for the many datacenters that do use it. Additionally, we're obviously not interested in making seccomp _slower_. The entire point of an investigation of the design is to examine our options and find the right solution. > People suggested for years to adopt eBPF in seccomp to accelerate it, > but, as you confessed, you resisted and sounds like now you want to > implement seccomp specific syscall bitmask? Yes -- because it's an order of magnitude faster than even a single instruction BPF seccomp filter. The vast majority of seccomp filters need nothing more than a single yes/no, and right now the bulk of processing time is spent running the BPF filter. I would prefer to avoid BPF entirely where possible for seccomp. > Which means more kernel code, more bugs, more security issues. Right. This is a solid design principle, and one I agree with: avoid adding code, keep things simple, everything will have bugs. And, as it stands, seccomp has had a significantly safer history than eBPF, largely due to its goal of staying as utterly small and simple as possible. I don't intend to discard that stance, and it's why I would rather continue to shield seccomp from the regularly occurring eBPF flaws. > imo that's another reinvented wheel when eBPF can do it already. I don't think > it's a good idea to add kernel code when eBPF-based solution exists and capable > of examining any level of nested args. Thanks to the neighboring thread here, the requirements no longer[2] include nested args. Also, you're mixing bitmasks (to accelerate the overwhelmingly common case) with the deep argument inspection (which is a rare but needed case). > > Perhaps the question is "how deeply does seccomp need to inspect?" > > and maybe it does not get to see anything beyond just the "top level" > > struct (i.e. struct clone_args) and all pointers within THAT become > > opaque? That certainly simplifies the design. > > clone3's 'struct clone_args' has set_tid pointer as a second level. > I don't think that sticking to first level of pointers for this particular > syscall will make seccomp filtering any more practical. Yup, we all agree. :) -Kees [1] https://lore.kernel.org/lkml/202005191434.57253AD@keescook/ [2] https://lore.kernel.org/ksummit-discuss/20200520221256.tzqkjpeswv3d6ne2@ast-mbp.dhcp.thefacebook.com/T/#m01a045c8715cfff8399ba86171039110befecbcf -- Kees Cook _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] seccomp feature development 2020-05-20 23:39 ` Kees Cook @ 2020-05-21 0:43 ` Alexei Starovoitov 0 siblings, 0 replies; 16+ messages in thread From: Alexei Starovoitov @ 2020-05-21 0:43 UTC (permalink / raw) To: Kees Cook; +Cc: bpf, ksummit On Wed, May 20, 2020 at 04:39:20PM -0700, Kees Cook wrote: > On Wed, May 20, 2020 at 03:12:56PM -0700, Alexei Starovoitov wrote: > > On Wed, May 20, 2020 at 12:04:04PM -0700, Kees Cook wrote: > > > On Wed, May 20, 2020 at 11:27:03AM -0700, Linus Torvalds wrote: > > > > Don't make this some kind of abstract conceptual problem thing. > > > > Because it's not. > > > > > > I have no intention of making this abstract (the requests for expanding > > > seccomp coverage have been for only a select class of syscalls, and > > > specifically clone3 and openat2) nor more complicated than it needs to be > > > (I regularly resist expanding the seccomp BPF dialect into eBPF). > > > > Kees, since you've forked the thread I'm adding bpf mailing list back and > > re-iterating my point: > > ** Nack to cBPF extensions ** > > Yes, I know. I agreed[1] with you on this point. > > > How that is relevant? > > You're proposing to add copy_from_user() to selected syscalls, like clone3, > > and present large __u32 array to cBPF program. > > In other words existing fixed sized 'struct seccomp_data' will become > > either variable length or jumbo fixed size like one page. > > In the fomer case it would mean that cBPF would need to be extended > > with variable length logic. Which in turn means it will suffer from > > spectre v1 issues. > > I don't expect to need to do anything with variable lengths in the > seccomp BPF dialect. As I said in the other thread, if we are faced with > design trade-offs that require extending the seccomp filter language, we > would switch to eBPF. > > > If you go with latter approach of presenting cBPF with giant > > 'struct seccomp_data + page' that extra page would need to be zeroed out > > before invocation of bpf program which will make seccomp even less usable > > that it is today. Currently it's slow and unusable in production datacenter. > > Making universal declarations based on your opinion does not help > convince people of your position. Saying it's "unusable in production > datacenter" is perhaps true for you, but hardly true for the many > datacenters that do use it. The datacenter that went with full bypass of kernel storage and networking subsystems where application don't do many syscalls per second any more ? Sure. In such cases extreme seccomp overhead is irrelevant. Just like kpti and retpoline overhead. > Additionally, we're obviously not interested in making seccomp _slower_. > The entire point of an investigation of the design is to examine our > options and find the right solution. > > > People suggested for years to adopt eBPF in seccomp to accelerate it, > > but, as you confessed, you resisted and sounds like now you want to > > implement seccomp specific syscall bitmask? > > Yes -- because it's an order of magnitude faster than even a single > instruction BPF seccomp filter. The vast majority of seccomp filters need > nothing more than a single yes/no, and right now the bulk of processing > time is spent running the BPF filter. I would prefer to avoid BPF > entirely where possible for seccomp. Are you running in interpreted mode? Otherwise above statement is nonsense or you haven't done eBPF benchmarking in long time. JITed eBPF has exact same speed as kernel C code. Even extreme use cases of bpf programs with single 'return 0' instruction are being optimized into minimal number of native instructions. So with above you're saying that giant bitmask of syscalls in C code is faster than 'int foo() { return 0; }' in C code ? Simply absurd. You realize that eBPF is processing tens of millions of packets per second and folks measure the overhead in nanoseconds. Indirect call and retpoline overhead was removed from a lot bpf use cases in networking specifically because every nanosecond counts. > > > Which means more kernel code, more bugs, more security issues. > > Right. This is a solid design principle, and one I agree with: avoid > adding code, keep things simple, everything will have bugs. And, as it > stands, seccomp has had a significantly safer history than eBPF, largely > due to its goal of staying as utterly small and simple as possible. I > don't intend to discard that stance, and it's why I would rather continue > to shield seccomp from the regularly occurring eBPF flaws. It's subjectively safer. I argue it's enjoying smaller bug rate because syzbots are not looking into it and it's not being actively developed. In the year 2020 there were three verifier bugs that could have been exploited through unpriv. All three were found by new kBdysch fuzzer. In 2019 there was nothing. Not because people didn't try, but because syzbot reached its limit. The pace of bpf development is accelerating. There will be more bugs found and introduced in the verifier. Yet that doesn't stop folks to use eBPF to secure the datacenters. The JITs are also being developed. There are bugs in them and they affect both cBPF and eBPF. If you're worried about that it's probably time to get rid of cBPF from seccomp. Invent your own syscall processing thingy, fix bugs and stop adding new features. That's the only way to reduce the bug rate. We're not going to slow down JITs development because of seccomp. _______________________________________________ Ksummit-discuss mailing list Ksummit-discuss@lists.linuxfoundation.org https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-05-22 11:27 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-20 16:17 [Ksummit-discuss] [TECH TOPIC] seccomp feature development Kees Cook 2020-05-20 16:31 ` Al Viro 2020-05-20 18:05 ` Kees Cook 2020-05-20 18:16 ` Al Viro 2020-05-20 18:27 ` Linus Torvalds 2020-05-20 19:04 ` Kees Cook 2020-05-20 19:08 ` Linus Torvalds 2020-05-20 20:24 ` Christian Brauner 2020-05-20 20:52 ` Kees Cook 2020-05-20 21:02 ` Christian Brauner 2020-05-22 4:06 ` Aleksa Sarai 2020-05-22 7:35 ` Christian Brauner 2020-05-22 11:27 ` Christian Brauner 2020-05-20 22:12 ` Alexei Starovoitov 2020-05-20 23:39 ` Kees Cook 2020-05-21 0:43 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).