On 2019-05-25, Linus Torvalds wrote: > In fact, I think resolveat() as a model is fundamentally wrong for yet > another reason: O_CREAT. If you want to _create_ a new file, and you > want to still have the path resolution modifiers in place, the > resolveat() model is broken, because it only gives you path resolution > for the lookup, and then when you do openat(O_CREAT) for the final > component, you now don't have any way to limit that last component. > > Sure, you can probably effectively hack around it with resolveat() on > everything but the last component, and then > openat(O_CREAT|O_EXCL|O_NOFOLLOW) on the last component, because the > O_EXCL|O_NOFOLLOW should mean that it won't do any of the unsafe > things. And then (if you didn't actually want the O_EXCL), you handle > the race between "somebody else got there first" by re-trying etc. So > I suspect the O_CREAT thing could be worked around with extra > complexity, but it's an example of how the O_PATH model really screws > you over. > > End result: I really think resolveat() is broken. It absolutely > *needs* to be a full-fledged "openat()" system call, just with added > path resolution flags. That's a very good point. I was starting to work on O_CREAT via resolveat(2) and it definitely was much harder than most people would be happy to deal with -- I'm not even sure that I handled all the cases. I'll go for an openat2(2) then. Thinking about it some more -- since it's a new syscall, I could actually implement the O_PATH link-mode as being just the regular mode argument (since openat(2) ignores the mode for non-O_CREAT anyway). The openat(2) wrapper might be more than one-line as a result but it should avoid polluting the resolution flags with mode flags (since openat(O_PATH) needs to have a g+rwx mode for backwards-compatibility). > > openat2(rootfd, "path/to/file", O_PATH, RESOLVE_IN_ROOT, &statbuf); > > Note that for O_CREAT, it either needs the 'mode' parameter too, or > the statbuf would need to be an in-out thing. I think the in-out model > might be nice (the varargs model with a conditional 'mode' parameter > is horrid, I think), but at some point it's just bike-shedding. > > Also, I'm not absolutely married to the statbuf, but I do think it > might be a useful extension. A *lot* of users need the size of the > file for subsequent mmap() calls (or for buffer management for > read/write interface) or for things like just headers (ie > "Content-length:" in html etc). > > I'm not sure people actually want a full 'struct stat', but people > historically also do st_ino/st_dev for indexing into existing > user-space caches (or to check permissions like that it's on the right > filesystem, but the resolve flags kind of make that less of an issue). > And st_mode to verify that it's a proper regular file etc etc. > > You can always just leave it as NULL if you don't care, there's almost > no downside to adding it, and I do think that people who want a "walk > pathname carefully" model almost always would want to also check the > end result anyway. Yeah, I agree -- most folks would want to double-check what they've opened or O_PATH'd. Though I'm still not clear what is the best way of doing the "stat" argument is -- especially given how much fun architecture-specific shenanigans are in fs/stat.c (should I only use cp_new_stat64 or have a separate 64-bit syscall). > > Is there a large amount of overhead or downside to the current > > open->fstat way of doing things, or is this more of a "if we're going to > > add more ways of opening we might as well add more useful things"? > > Right now, system calls are sadly very expensive on a lot of hardware. > We used to be very proud of the fact that Linux system calls were > fast, but with meltdown and retpoline etc, we're back to "system calls > can be several thousand cycles each, just in overhead, on commonly > available hardware". > > Is "several thousand cycles" fatal? Not necessarily. But I think that > if we do add a new system call, particularly a fancy one for special > security-conscious models, we should look at what people need and use, > and want. And performance should always be a concern. > > I realize that people who come at this from primarily just a security > issue background may think that security is the primary goal. But no. > Security always needs to realize that the _primary_ goal is to have > people _use_ it. Without users, security is entirely pointless. And > the users part is partly performance, but mostly "it's convenient". Yup, I agree. I was hoping to shunt most of the convenience to userspace to avoid ruffling feathers in VFS-land, but I'm more than happy to make the kernel ABI more convenient. > The whole "this is Linux-specific" is a big inconvenience point I hope that other OSes will take our lead and have a similar interface so that the particular inconvenience can go away eventually (this was one of the arguments for resolveat(2) -- it's a clear "this is a new idea" interface rather than mixing it with other O_* flags). > Talking about securely opening things - another flag that we may want > to avoid issues is a "don't open device nodes" flag. Sure, O_NONBLOCK > plus checking the st_mode of the result is usually sufficient, but > it's actually fairly easy to get that wrong. Things like /dev/tty and > /dev/zero often need to be available for various reasons, and have > been used to screw careless "open and read" users up that missed a > check. Sure, this could be added -- though I'm sure folks would have disagreements over whether it should be a resolution flag on an open flag. > I also do wonder that if the only actual user-facing interface for the > resolution flags is a new system call, should we not make the > *default* value be "don't open anything odd at all". > > So instead of saying RESOLVE_XDEV for "don't cross mount points", > maybe the flags should be the other way around, and say "yes, allow > mount point crossings", and "yes, explicitly allow device node > opening", and "yes, allow DOTDOT" etc. This seems like a reasonable default, except for the cases of RESOLVE_BENEATH and RESOLVE_IN_ROOT (that would make using AT_FDCWD more cumbersome than necessary). But other than that, I'd be happy to invert all the other flags. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH