On 2020-07-15, Kees Cook wrote: > Earlier Andy Lutomirski wrote: > > Let’s add some seccomp folks. We probably also want to be able to run > > seccomp-like filters on io_uring requests. So maybe io_uring should call into > > seccomp-and-tracing code for each action. > > Okay, I'm finally able to spend time looking at this. And thank you to > the many people that CCed me into this and earlier discussions (at least > Jann, Christian, and Andy). > > It *seems* like there is a really clean mapping of SQE OPs to syscalls. > To that end, yes, it should be trivial to add ptrace and seccomp support > (sort of). The trouble comes for doing _interception_, which is how both > ptrace and seccomp are designed. > > In the basic case of seccomp, various syscalls are just being checked > for accept/reject. It seems like that would be easy to wire up. For the > more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc), > I think any such results would need to be "upgraded" to "reject". Things > are a bit complex in that seccomp's form of "reject" can be "return > errno" (easy) or it can be "kill thread (or thread_group)" which ... > becomes less clear. (More on this later.) > > In the basic case of "I want to run strace", this is really just a > creative use of ptrace in that interception is being used only for > reporting. Does ptrace need to grow a way to create/attach an io_uring > eventfd? Or should there be an entirely different tool for > administrative analysis of io_uring events (kind of how disk IO can be > monitored)? I would hope that we wouldn't introduce ptrace to io_uring, because unless we plan to attach to io_uring events via GDB it's simply the wrong tool for the job. strace does use ptrace, but that's mostly because Linux's dynamic tracing was still in its infancy at the time (and even today it requires more privileges than ptrace) -- but you can emulate strace using bpftrace these days fairly easily. So really what is being asked here is "can we make it possible to debug io_uring programs as easily as traditional I/O programs". And this does not require ptrace, nor should ptrace be part of this discussion IMHO. I believe this issue (along with seccomp-style filtering) have been mentioned informally in the past, but I am happy to finally see a thread about this appear. > For io_uring generally, I have a few comments/questions: > > - Why did a new syscall get added that couldn't be extended? All new > syscalls should be using Extended Arguments. :( io_uring was introduced in Linux 5.1, predating clone3() and openat2(). My larger concern is that io_uring operations aren't extensible-structs -- but we can resolve that issue with some slight ugliness if we ever run into the problem. > - Why aren't the io_uring syscalls in the man-page git? (It seems like > they're in liburing, but that's should document the _library_ not the > syscalls, yes?) I imagine because using the syscall requires specific memory barriers which we probably don't want most C programs to be fiddling with directly. Sort of similar to how iptables doesn't have a syscall-style man page. > Speaking to Stefano's proposal[1]: > > - There appear to be three classes of desired restrictions: > - opcodes for io_uring_register() (which can be enforced entirely with > seccomp right now). > - opcodes from SQEs (this _could_ be intercepted by seccomp, but is > not currently written) > - opcodes of the types of restrictions to restrict... for making sure > things can't be changed after being set? seccomp already enforces > that kind of "can only be made stricter" Unless I misunderstood the patch cover-letter, Stefano's proposal is to have a mechanism for adding restrictions to individual io_urings -- so we still need a separate mechanism (or an extended version of Stefano's proposal) to allow for the "reduce attack surface" usecase of seccomp. It seems to me like Stefano's proposal is more related to cases where you might SCM_RIGHTS-send an io_uring to an unprivileged process. > Solving the mapping of seccomp interception types into CQEs (or anything > more severe) will likely inform what it would mean to map ptrace events > to CQEs. So, I think they're related, and we should get seccomp hooked > up right away, and that might help us see how (if) ptrace should be > attached. We could just emulate the seccomp-bpf API with the pseudo-syscalls done as a result of CQEs, though I'm not sure how happy folks will be with this kind of glue code in "seccomp-uring" (though in theory it would allow us to attach existing filters to io_uring...). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH