On 2018-11-23, Jürg Billeter wrote: > On Tue, 2018-11-13 at 01:26 +1100, Aleksa Sarai wrote: > > * O_BENEATH: Disallow "escapes" from the starting point of the > > filesystem tree during resolution (you must stay "beneath" the > > starting point at all times). Currently this is done by disallowing > > ".." and absolute paths (either in the given path or found during > > symlink resolution) entirely, as well as all "magic link" jumping. > > With open_tree(2) and OPEN_TREE_CLONE, will O_BENEATH still be > necessary? As I understand it, O_BENEATH could be replaced by a much > simpler flag that only disallows absolute paths (incl. absolute > symlinks). And it would have the benefit that you can actually pass the > tree/directory fd to another process and escaping would not be possible > even if that other process doesn't use O_BENEATH (after calling > mount_setattr(2) to make sure it's locked down). > > This approach would also make it easy to restrict writes via a cloned > tree/directory fd by marking it read-only via mount_setattr(2) (and > locking down the read-only flag). This would again be especially useful > when passing tree/directory fds across processes, or for voluntary > self-lockdown within a process for robustness against security bugs. > > This wouldn't affect any of the other flags in this patch. And for full > equivalence to O_BENEATH you'd have to use O_NOMAGICLINKS in addition > to O_NOABSOLUTE, or whatever that new flag would be called. > > Or is OPEN_TREE_CLONE too expensive for this use case? Or is there > anything else I'm missing? OPEN_TREE_CLONE currently requires CAP_SYS_ADMIN in mnt_ns->user_ns, which requires a fork and unshare -- or at least a vfork and some other magic -- at which point we're back to just doing a pivot_root(2) for most operations. I think open_tree(2) -- which I really should sit down and play around with -- would be an interesting way of doing O_BENEATH, but I think you'd still need to have the same race protections we have in the current O_BENEATH proposal to handle "..". So really you'd be using open_tree(OPEN_TREE_CLONE) just so that you can use the "path.mnt" setting code, which I'm not sure is the best way of doing it (plus the other interesting ideas which you get with the other mount API changes). But I am quite hopeful for what cool things we'll be able to make using the new mount API. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH