* shiftfs status and future development @ 2018-06-14 18:44 Seth Forshee 2018-06-15 13:56 ` Serge E. Hallyn ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-14 18:44 UTC (permalink / raw) To: James Bottomley, containers, linux-fsdevel Cc: Stéphane Graber, Christian Brauner, Tyler Hicks I wanted to inquire about the current status of shiftfs and the plans for it moving forward. We'd like to have this functionality available for use in lxd, and I'm interesetd in helping with development (or picking up development if it's stalled). To start, is anyone still working on shiftfs or similar functionality? I haven't found it in any git tree on kernel.org, and as far as mailing list activity the last submission I can find is [1]. Is there anything newer than this? Based on past mailing list discussions, it seems like there was still debate as to whether this feature should be an overlay filesystem or something supported at the vfs level. Was this ever resolved? Thanks, Seth [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-14 18:44 shiftfs status and future development Seth Forshee @ 2018-06-15 13:56 ` Serge E. Hallyn 2018-06-15 14:59 ` Seth Forshee 2018-06-15 14:54 ` Aleksa Sarai 2018-06-15 15:28 ` James Bottomley 2 siblings, 1 reply; 33+ messages in thread From: Serge E. Hallyn @ 2018-06-15 13:56 UTC (permalink / raw) To: Seth Forshee Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks, Christian Brauner Quoting Seth Forshee (seth.forshee@canonical.com): > I wanted to inquire about the current status of shiftfs and the plans > for it moving forward. We'd like to have this functionality available > for use in lxd, and I'm interesetd in helping with development (or > picking up development if it's stalled). > > To start, is anyone still working on shiftfs or similar functionality? I > haven't found it in any git tree on kernel.org, and as far as mailing > list activity the last submission I can find is [1]. Is there anything > newer than this? > > Based on past mailing list discussions, it seems like there was still > debate as to whether this feature should be an overlay filesystem or > something supported at the vfs level. Was this ever resolved? > > Thanks, > Seth > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com Hey Seth, I haven't heard anything in a long time. But if this is going to pick back up, can we come up with a detailed set of goals and requirements? I don't recall whether the last version still worked like this, but I'm still not comfortable with the idea of a system where after a reboot, container-created root-owned files are owned by host root until a path is specially marked. Enforcing that the "source" directory is itself uid-shifted would greatly ease my mind. -serge ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 13:56 ` Serge E. Hallyn @ 2018-06-15 14:59 ` Seth Forshee 2018-06-15 15:25 ` Matthew Wilcox 2018-06-16 3:03 ` James Bottomley 0 siblings, 2 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-15 14:59 UTC (permalink / raw) To: Serge E. Hallyn Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > Quoting Seth Forshee (seth.forshee@canonical.com): > > I wanted to inquire about the current status of shiftfs and the plans > > for it moving forward. We'd like to have this functionality available > > for use in lxd, and I'm interesetd in helping with development (or > > picking up development if it's stalled). > > > > To start, is anyone still working on shiftfs or similar functionality? I > > haven't found it in any git tree on kernel.org, and as far as mailing > > list activity the last submission I can find is [1]. Is there anything > > newer than this? > > > > Based on past mailing list discussions, it seems like there was still > > debate as to whether this feature should be an overlay filesystem or > > something supported at the vfs level. Was this ever resolved? > > > > Thanks, > > Seth > > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com > > Hey Seth, > > I haven't heard anything in a long time. But if this is going to pick > back up, can we come up with a detailed set of goals and requirements? I was planning to follow up later with some discussion of requirements. Here are some of ours: - Supports any id maps possible for a user namespace - Does not break inotify - Passes accurate disk usage and source information from the "underlay" - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) - Works with nested containers I'm also interested in collecting any requirements others might have. > I don't recall whether the last version still worked like this, but I'm > still not comfortable with the idea of a system where after a reboot, > container-created root-owned files are owned by host root until a path > is specially marked. Enforcing that the "source" directory is itself > uid-shifted would greatly ease my mind. I understand the concern and share the discomfort to some degree, but I'm not convinced that requiring the source subtree be shifted is the right approach. First, let's address the marking question. As you stated, an approach that leaves the subree unmarked for a period of time is problematic, and imo this is a fatal flaw with marking as a protection for e.g. execing some suid root file written by a container. Writing some such mark to the filesystem would make it persistent, but it could also limit the support to a limited set of filesystems. However, I do think it's necessary for a user with sufficient capabilities to "bless" a subtree for mounting in a less privileged context, so this is a feature of marking that I would like to keep. I think the new mount apis in David Howells' filesystem context patches [1] might give us a nicer way to do this. For example, root in init_user_ns could set up a mount fd which specifies the source subtree for the id shift. At that time the kernel could check for ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem containing the source subtree. Then the fd could be passed to a container in a user namespace, who could use it to attach the mount to its filesystem tree. The same concept could be extended to nested containers, as long as the user setting the source subtree has CAP_SYS_ADMIN towards sb->s_user_ns for the subtree. Now back to reuiring the srouce subtree be id shifted. I understand the motivation for wanting this, but I'm not sure I'm in favor of it. To start, there are other ways to ensure that id shifted mounts don't lead to problems, such as putting the subtree under a directory accessible only by root or putting it in a nosuid or noexec mount. For some implementations those sorts of protections are going to make sense. Having this requirement may also add significant time to mounting, as I assume it would involve iterating through all filesystem objects. Additionally, that requirement is likely to significantly complicate the implementation. The simplest implementation would just translate the k[ug]ids in the inodes to a target user ns. A slightly more complicated approach might translate them based on a source and destination user ns. If it's implemented based on passing in an arbitrary id map at mount time it will be more complex and duplicate functionality that user namespaces already give us. Thanks, Seth [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.stgit@warthog.procyon.org.uk ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 14:59 ` Seth Forshee @ 2018-06-15 15:25 ` Matthew Wilcox 2018-06-15 15:56 ` Aleksa Sarai 2018-06-15 16:09 ` James Bottomley 2018-06-16 3:03 ` James Bottomley 1 sibling, 2 replies; 33+ messages in thread From: Matthew Wilcox @ 2018-06-15 15:25 UTC (permalink / raw) To: Seth Forshee Cc: Serge E. Hallyn, James Bottomley, containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Fri, Jun 15, 2018 at 09:59:17AM -0500, Seth Forshee wrote: > - Supports any id maps possible for a user namespace Have we already ruled out storing the container's UID/GID/perms in an extended attribute, and having all the files owned by the owner of the container from the perspective of the unshifted fs. Then shiftfs reads the xattr and presents the files with the container's idea of what the UID is? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 15:25 ` Matthew Wilcox @ 2018-06-15 15:56 ` Aleksa Sarai 2018-06-15 16:09 ` James Bottomley 1 sibling, 0 replies; 33+ messages in thread From: Aleksa Sarai @ 2018-06-15 15:56 UTC (permalink / raw) To: Matthew Wilcox Cc: Seth Forshee, containers, James Bottomley, Tyler Hicks, Christian Brauner, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1310 bytes --] On 2018-06-15, Matthew Wilcox <willy@infradead.org> wrote: > > - Supports any id maps possible for a user namespace > > Have we already ruled out storing the container's UID/GID/perms in an > extended attribute, and having all the files owned by the owner of the > container from the perspective of the unshifted fs. Then shiftfs reads > the xattr and presents the files with the container's idea of what the > UID is? I think, while simple, this idea has the problem that you couldn't really have a single directory be shifted more than once without copying it (or using an overlayfs which is then shiftfs'd). So for the usecase of giving each container on a system a unique allocation of host uids and gids (while using the same image storage) you would run into some issues. It does remind me of something similar we do as part of the "rootless containers" project -- we have "user.rootlesscontainers" which contains a protobuf payload with the "owner" information. Though in rootless containers we are using this xattr for something quite different: faking chown(2) and similar operations to make it look as though an unprivileged user namespace contains more than one user. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 15:25 ` Matthew Wilcox 2018-06-15 15:56 ` Aleksa Sarai @ 2018-06-15 16:09 ` James Bottomley 2018-06-15 17:04 ` Aleksa Sarai 1 sibling, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-15 16:09 UTC (permalink / raw) To: Matthew Wilcox, Seth Forshee Cc: containers, Tyler Hicks, Christian Brauner, linux-fsdevel On Fri, 2018-06-15 at 08:25 -0700, Matthew Wilcox wrote: > On Fri, Jun 15, 2018 at 09:59:17AM -0500, Seth Forshee wrote: > > - Supports any id maps possible for a user namespace > > Have we already ruled out storing the container's UID/GID/perms in an > extended attribute, and having all the files owned by the owner of > the container from the perspective of the unshifted fs. Then shiftfs > reads the xattr and presents the files with the container's idea of > what the UID is? I've got an experimental patch set that does the *mark* as an xattr. However the shift is still done through s_userns, which allows for multiple shifts. James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 16:09 ` James Bottomley @ 2018-06-15 17:04 ` Aleksa Sarai 2018-06-15 17:22 ` James Bottomley 0 siblings, 1 reply; 33+ messages in thread From: Aleksa Sarai @ 2018-06-15 17:04 UTC (permalink / raw) To: James Bottomley Cc: Matthew Wilcox, Seth Forshee, Tyler Hicks, linux-fsdevel, containers, Christian Brauner [-- Attachment #1: Type: text/plain, Size: 975 bytes --] On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > - Supports any id maps possible for a user namespace > > > > Have we already ruled out storing the container's UID/GID/perms in an > > extended attribute, and having all the files owned by the owner of > > the container from the perspective of the unshifted fs. Then shiftfs > > reads the xattr and presents the files with the container's idea of > > what the UID is? > > I've got an experimental patch set that does the *mark* as an xattr. I forgot to ask you about this when we all met face-to-face -- can you go over what the purpose of marking the mounts before being able to shifts is? When I saw your demo at LPC I was quite confused about what it was doing (I think you mentioned it was a security feature, but I must admit I didn't follow the explanation). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 17:04 ` Aleksa Sarai @ 2018-06-15 17:22 ` James Bottomley 2018-06-15 20:47 ` Seth Forshee 0 siblings, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-15 17:22 UTC (permalink / raw) To: Aleksa Sarai Cc: containers, Matthew Wilcox, Seth Forshee, Christian Brauner, Tyler Hicks, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 2798 bytes --] On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote: > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com > > wrote: > > > > - Supports any id maps possible for a user namespace > > > > > > Have we already ruled out storing the container's UID/GID/perms > > > in an extended attribute, and having all the files owned by the > > > owner of the container from the perspective of the unshifted > > > fs. Then shiftfs reads the xattr and presents the files with the > > > container's idea of what the UID is? > > > > I've got an experimental patch set that does the *mark* as an > > xattr. > > I forgot to ask you about this when we all met face-to-face -- can > you go over what the purpose of marking the mounts before being able > to shifts is? When I saw your demo at LPC I was quite confused about > what it was doing (I think you mentioned it was a security feature, > but I must admit I didn't follow the explanation). OK, so the basic security problem is that an unprivileged tenant cannot be allowed arbitrary access to both the shifted and underlying unshifted locations because they can do writes to the shifted mount that appear at real uid/gid 0 in the underlying unshifted location, setting up all sorts of unpleasant threats of which suid execution is just the most obvious one. My mount marking solution, which the v2 (and forthcoming v3) has is the idea that the admin buries the real underlying location deep in a path inaccessible (to the tenant) part of the filesystem and then exposes a marked mount point to the tenant by doing mount -t shiftfs -o mark <underlying location> <tenant visible> Then in the <tenant visible> location we can block the potential exploits. When the tenant is building an unprivileged container, it can do mount -t shiftfs <tenant visible> <container location> And the <container location> will now have the shifting in place. This scheme is ephemeral (the marked mount has to be recreated on every boot) and rather complex, so the alternative is to add a permanent mark to <underlying location> so that regular tenant access can be secured (or even prohibited) but the tenant can still do mount -t shiftfs <underlying location> <container location> To get the shifting properties in the container. In this version of the scheme, the shift mountable directory is marked with a security xattr that is permanent (survives reboot) but requires that the filesystem support xattrs, of course. The down side of the xattr scheme is that the securing against the tenant part becomes an xattr enforced thing rather than a shiftfs enforced thing, so it has to be an additional patch to the kernel itself rather than being inside a self contained module. James [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 17:22 ` James Bottomley @ 2018-06-15 20:47 ` Seth Forshee 2018-06-15 21:09 ` James Bottomley 0 siblings, 1 reply; 33+ messages in thread From: Seth Forshee @ 2018-06-15 20:47 UTC (permalink / raw) To: James Bottomley Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner, Tyler Hicks, linux-fsdevel On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote: > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote: > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com > > > wrote: > > > > > - Supports any id maps possible for a user namespace > > > > > > > > Have we already ruled out storing the container's UID/GID/perms > > > > in an extended attribute, and having all the files owned by the > > > > owner of the container from the perspective of the unshifted > > > > fs. Then shiftfs reads the xattr and presents the files with the > > > > container's idea of what the UID is? > > > > > > I've got an experimental patch set that does the *mark* as an > > > xattr. > > > > I forgot to ask you about this when we all met face-to-face -- can > > you go over what the purpose of marking the mounts before being able > > to shifts is? When I saw your demo at LPC I was quite confused about > > what it was doing (I think you mentioned it was a security feature, > > but I must admit I didn't follow the explanation). > > OK, so the basic security problem is that an unprivileged tenant cannot > be allowed arbitrary access to both the shifted and underlying > unshifted locations because they can do writes to the shifted mount > that appear at real uid/gid 0 in the underlying unshifted location, > setting up all sorts of unpleasant threats of which suid execution is > just the most obvious one. > > My mount marking solution, which the v2 (and forthcoming v3) has is the > idea that the admin buries the real underlying location deep in a path > inaccessible (to the tenant) part of the filesystem and then exposes a > marked mount point to the tenant by doing > > mount -t shiftfs -o mark <underlying location> <tenant visible> > > Then in the <tenant visible> location we can block the potential > exploits. When the tenant is building an unprivileged container, it > can do > > mount -t shiftfs <tenant visible> <container location> > > And the <container location> will now have the shifting in place. More generally, we can't allow an unprivileged user ns to mount any subtree with an id shift unless the context that controls that subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it would be a simple matter for any user to create a user ns and make an id shifted mount of /. The marking in shiftfs is one way of solving this problem. I don't know if you saw my comments about marking earlier in the thread. Tl;dr, I think that the new mount apis in the filesystem context patches could allow an alternative to marking. I think we should be able to arrange it so that the "host" context sets up a mount fd for shiftfs mounting a sepecific subtree then passes that fd into the container. The container can then use the fd to attach the mount to its filesystem tree. This will provide all the benefits of marking without that awkward intermediate mount point. Of course those patches haven't been merged yet, but based on the discussion I've seen their prospects look good. > This scheme is ephemeral (the marked mount has to be recreated on every > boot) and rather complex, so the alternative is to add a permanent mark > to <underlying location> so that regular tenant access can be secured > (or even prohibited) but the tenant can still do > > mount -t shiftfs <underlying location> <container location> This of course would not be possible in my proposed mount fd scheme. > To get the shifting properties in the container. In this version of > the scheme, the shift mountable directory is marked with a security > xattr that is permanent (survives reboot) but requires that the > filesystem support xattrs, of course. > > The down side of the xattr scheme is that the securing against the > tenant part becomes an xattr enforced thing rather than a shiftfs > enforced thing, so it has to be an additional patch to the kernel > itself rather than being inside a self contained module. Would this work for nested containers? I guess it should be fine to allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so probably so. Thanks, Seth ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 20:47 ` Seth Forshee @ 2018-06-15 21:09 ` James Bottomley 2018-06-15 21:35 ` Seth Forshee 0 siblings, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-15 21:09 UTC (permalink / raw) To: Seth Forshee Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner, Tyler Hicks, linux-fsdevel On Fri, 2018-06-15 at 15:47 -0500, Seth Forshee wrote: > On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote: > > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote: > > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership > > > .com > > > > wrote: > > > > > > - Supports any id maps possible for a user namespace > > > > > > > > > > Have we already ruled out storing the container's > > > > > UID/GID/perms in an extended attribute, and having all the > > > > > files owned by the owner of the container from the > > > > > perspective of the unshifted fs. Then shiftfs reads the > > > > > xattr and presents the files with the container's idea of > > > > > what the UID is? > > > > > > > > I've got an experimental patch set that does the *mark* as an > > > > xattr. > > > > > > I forgot to ask you about this when we all met face-to-face -- > > > can you go over what the purpose of marking the mounts before > > > being able to shifts is? When I saw your demo at LPC I was quite > > > confused about what it was doing (I think you mentioned it was a > > > security feature, but I must admit I didn't follow the > > > explanation). > > > > OK, so the basic security problem is that an unprivileged tenant > > cannot be allowed arbitrary access to both the shifted and > > underlying unshifted locations because they can do writes to the > > shifted mount that appear at real uid/gid 0 in the underlying > > unshifted location, setting up all sorts of unpleasant threats of > > which suid execution is just the most obvious one. > > > > My mount marking solution, which the v2 (and forthcoming v3) has is > > the idea that the admin buries the real underlying location deep in > > a path inaccessible (to the tenant) part of the filesystem and then > > exposes a marked mount point to the tenant by doing > > > > mount -t shiftfs -o mark <underlying location> <tenant visible> > > > > Then in the <tenant visible> location we can block the potential > > exploits. When the tenant is building an unprivileged container, > > it can do > > > > mount -t shiftfs <tenant visible> <container location> > > > > And the <container location> will now have the shifting in place. > > More generally, we can't allow an unprivileged user ns to mount any > subtree with an id shift unless the context that controls that > subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it > would be a simple matter for any user to create a user ns and make an > id shifted mount of /. The marking in shiftfs is one way of solving > this problem. > > I don't know if you saw my comments about marking earlier in the > thread. Tl;dr, I think that the new mount apis in the filesystem > context patches could allow an alternative to marking. Yes, I read it an I have a draft reply except that I need to investigate the new mount API. My suspicion is that it won't work over standard mount(8) so now we'll need a tool to set it up. The other point is that it's a temporary mark (again) so it has to be renewed over every reboot. The impression I picked up was this was the least liked part of the current marking scheme (well, possibly first equally disliked with the complexity) so the xattr (or some other) scheme might be better received. > I think we should be able to arrange it so that the "host" context > sets up a mount fd for shiftfs mounting a sepecific subtree then > passes that fd into the container. The container can then use the fd > to attach the mount to its filesystem tree. This will provide all the > benefits of marking without that awkward intermediate mount point. Doesn't this mean that whatever sits on the host end has to be privileged enough to open the fd in the first place, meaning either a suid/CAP_SYS_<something> binary or root itself? > Of course those patches haven't been merged yet, but based on the > discussion I've seen their prospects look good. > > > This scheme is ephemeral (the marked mount has to be recreated on > > every boot) and rather complex, so the alternative is to add a > > permanent mark to <underlying location> so that regular tenant > > access can be secured (or even prohibited) but the tenant can still > > do > > > > mount -t shiftfs <underlying location> <container location> > > This of course would not be possible in my proposed mount fd scheme. > > > To get the shifting properties in the container. In this version > > of the scheme, the shift mountable directory is marked with a > > security xattr that is permanent (survives reboot) but requires > > that the filesystem support xattrs, of course. > > > > The down side of the xattr scheme is that the securing against the > > tenant part becomes an xattr enforced thing rather than a shiftfs > > enforced thing, so it has to be an additional patch to the kernel > > itself rather than being inside a self contained module. > > Would this work for nested containers? I guess it should be fine to > allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so > probably so. Yes, the s_user_ns is simply a copy of the user_ns in operation when the mount was created, so if user_ns is nested then s_user_ns will be a reverse nesting. This should work independently of the marking scheme. I'm just in the middle of reposting. The cc list is getting unwieldy; is it OK if I the fsdevel, kernel, security and containers lists and drop everyone except for a couple of containers people who aren't on kernel lists? James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 21:09 ` James Bottomley @ 2018-06-15 21:35 ` Seth Forshee 0 siblings, 0 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-15 21:35 UTC (permalink / raw) To: James Bottomley Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner, Tyler Hicks, linux-fsdevel On Fri, Jun 15, 2018 at 02:09:52PM -0700, James Bottomley wrote: > On Fri, 2018-06-15 at 15:47 -0500, Seth Forshee wrote: > > On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote: > > > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote: > > > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership > > > > .com > > > > > wrote: > > > > > > > - Supports any id maps possible for a user namespace > > > > > > > > > > > > Have we already ruled out storing the container's > > > > > > UID/GID/perms in an extended attribute, and having all the > > > > > > files owned by the owner of the container from the > > > > > > perspective of the unshifted fs. Then shiftfs reads the > > > > > > xattr and presents the files with the container's idea of > > > > > > what the UID is? > > > > > > > > > > I've got an experimental patch set that does the *mark* as an > > > > > xattr. > > > > > > > > I forgot to ask you about this when we all met face-to-face -- > > > > can you go over what the purpose of marking the mounts before > > > > being able to shifts is? When I saw your demo at LPC I was quite > > > > confused about what it was doing (I think you mentioned it was a > > > > security feature, but I must admit I didn't follow the > > > > explanation). > > > > > > OK, so the basic security problem is that an unprivileged tenant > > > cannot be allowed arbitrary access to both the shifted and > > > underlying unshifted locations because they can do writes to the > > > shifted mount that appear at real uid/gid 0 in the underlying > > > unshifted location, setting up all sorts of unpleasant threats of > > > which suid execution is just the most obvious one. > > > > > > My mount marking solution, which the v2 (and forthcoming v3) has is > > > the idea that the admin buries the real underlying location deep in > > > a path inaccessible (to the tenant) part of the filesystem and then > > > exposes a marked mount point to the tenant by doing > > > > > > mount -t shiftfs -o mark <underlying location> <tenant visible> > > > > > > Then in the <tenant visible> location we can block the potential > > > exploits. When the tenant is building an unprivileged container, > > > it can do > > > > > > mount -t shiftfs <tenant visible> <container location> > > > > > > And the <container location> will now have the shifting in place. > > > > More generally, we can't allow an unprivileged user ns to mount any > > subtree with an id shift unless the context that controls that > > subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it > > would be a simple matter for any user to create a user ns and make an > > id shifted mount of /. The marking in shiftfs is one way of solving > > this problem. > > > > I don't know if you saw my comments about marking earlier in the > > thread. Tl;dr, I think that the new mount apis in the filesystem > > context patches could allow an alternative to marking. > > Yes, I read it an I have a draft reply except that I need to > investigate the new mount API. My suspicion is that it won't work over > standard mount(8) so now we'll need a tool to set it up. > > The other point is that it's a temporary mark (again) so it has to be > renewed over every reboot. The impression I picked up was this was the > least liked part of the current marking scheme (well, possibly first > equally disliked with the complexity) so the xattr (or some other) > scheme might be better received. >From my perspective the primary benefit of a permanent mark is that the kernel could implement some protections for marked subtrees, and presumably we could do away with the intermediate mount. That sounds much nicer than the current marking scheme, imo. I do understand that for others less admin-side setup work is also a benefit. > > I think we should be able to arrange it so that the "host" context > > sets up a mount fd for shiftfs mounting a sepecific subtree then > > passes that fd into the container. The container can then use the fd > > to attach the mount to its filesystem tree. This will provide all the > > benefits of marking without that awkward intermediate mount point. > > Doesn't this mean that whatever sits on the host end has to be > privileged enough to open the fd in the first place, meaning either a > suid/CAP_SYS_<something> binary or root itself? Yes to the "something privieleged enough" question. It should not be a suid/setcap executable; an unprivileged user must not be able to pick what paths are eligible for id shifting. That has to be an admin operation. > > Of course those patches haven't been merged yet, but based on the > > discussion I've seen their prospects look good. > > > > > This scheme is ephemeral (the marked mount has to be recreated on > > > every boot) and rather complex, so the alternative is to add a > > > permanent mark to <underlying location> so that regular tenant > > > access can be secured (or even prohibited) but the tenant can still > > > do > > > > > > mount -t shiftfs <underlying location> <container location> > > > > This of course would not be possible in my proposed mount fd scheme. > > > > > To get the shifting properties in the container. In this version > > > of the scheme, the shift mountable directory is marked with a > > > security xattr that is permanent (survives reboot) but requires > > > that the filesystem support xattrs, of course. > > > > > > The down side of the xattr scheme is that the securing against the > > > tenant part becomes an xattr enforced thing rather than a shiftfs > > > enforced thing, so it has to be an additional patch to the kernel > > > itself rather than being inside a self contained module. > > > > Would this work for nested containers? I guess it should be fine to > > allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so > > probably so. > > Yes, the s_user_ns is simply a copy of the user_ns in operation when > the mount was created, so if user_ns is nested then s_user_ns will be a > reverse nesting. This should work independently of the marking scheme. > > I'm just in the middle of reposting. The cc list is getting unwieldy; > is it OK if I the fsdevel, kernel, security and containers lists and > drop everyone except for a couple of containers people who aren't on > kernel lists? You don't have to include me on the Cc. Thanks, Seth ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 14:59 ` Seth Forshee 2018-06-15 15:25 ` Matthew Wilcox @ 2018-06-16 3:03 ` James Bottomley 2018-06-18 13:40 ` Seth Forshee 1 sibling, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-16 3:03 UTC (permalink / raw) To: Seth Forshee, Serge E. Hallyn Cc: linux-fsdevel, containers, Tyler Hicks, Christian Brauner On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote: > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > > Quoting Seth Forshee (seth.forshee@canonical.com): > > > I wanted to inquire about the current status of shiftfs and the > > > plans for it moving forward. We'd like to have this functionality > > > available for use in lxd, and I'm interesetd in helping with > > > development (or picking up development if it's stalled). > > > > > > To start, is anyone still working on shiftfs or similar > > > functionality? I haven't found it in any git tree on kernel.org, > > > and as far as mailing list activity the last submission I can > > > find is [1]. Is there anything newer than this? > > > > > > Based on past mailing list discussions, it seems like there was > > > still debate as to whether this feature should be an overlay > > > filesystem or something supported at the vfs level. Was this ever > > > resolved? > > > > > > Thanks, > > > Seth > > > > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn > > > ership.com > > > > Hey Seth, > > > > I haven't heard anything in a long time. But if this is going to > > pick back up, can we come up with a detailed set of goals and > > requirements? That would actually help. > I was planning to follow up later with some discussion of > requirements. > Here are some of ours: > > - Supports any id maps possible for a user namespace Could you clarify: right at the moment, it basically reverses the namespace ID mapping when it does on to the filesystem using the superblock user namespace, so, in theory you can have an arbitrary mapping simply by changing the s_userns. The problem here is that you don't have a lot of tools for manipulating the s_userns. > - Does not break inotify I don't expect it does, but I haven't checked. > - Passes accurate disk usage and source information from the > "underlay" mounts of this type don't currently show up in df > - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) yes > - Works with nested containers yes > I'm also interested in collecting any requirements others might have. > > > I don't recall whether the last version still worked like this, but > > I'm still not comfortable with the idea of a system where after a > > reboot, container-created root-owned files are owned by host root > > until a path is specially marked. Enforcing that the "source" > > directory is itself uid-shifted would greatly ease my mind. And I believe we're discussing everything below in a different subthread. James > I understand the concern and share the discomfort to some degree, but > I'm not convinced that requiring the source subtree be shifted is the > right approach. > > First, let's address the marking question. As you stated, an approach > that leaves the subree unmarked for a period of time is problematic, > and imo this is a fatal flaw with marking as a protection for e.g. > execing some suid root file written by a container. Writing some such > mark to the filesystem would make it persistent, but it could also > limit the support to a limited set of filesystems. > > However, I do think it's necessary for a user with sufficient > capabilities to "bless" a subtree for mounting in a less privileged > context, so this is a feature of marking that I would like to keep. I > think the new mount apis in David Howells' filesystem context patches > [1] might give us a nicer way to do this. For example, root in > init_user_ns could set up a mount fd which specifies the source > subtree for the id shift. At that time the kernel could check for > ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem > containing the source subtree. Then the fd could be passed to a > container in a user namespace, who could use it to attach the mount > to its filesystem tree. The same concept could be extended to nested > containers, as long as the user setting the source subtree has > CAP_SYS_ADMIN towards sb->s_user_ns for the subtree. > > Now back to reuiring the srouce subtree be id shifted. I understand > the motivation for wanting this, but I'm not sure I'm in favor of it. > To start, there are other ways to ensure that id shifted mounts don't > lead to problems, such as putting the subtree under a directory > accessible only by root or putting it in a nosuid or noexec mount. > For some implementations those sorts of protections are going to make > sense. > > Having this requirement may also add significant time to mounting, as > I assume it would involve iterating through all filesystem objects. > > Additionally, that requirement is likely to significantly complicate > the implementation. The simplest implementation would just translate > the k[ug]ids in the inodes to a target user ns. A slightly more > complicated approach might translate them based on a source and > destination user ns. If it's implemented based on passing in an > arbitrary id map at mount time it will be more complex and duplicate > functionality that user namespaces already give us. > > Thanks, > Seth > > [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.st > git@warthog.procyon.org.uk > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-16 3:03 ` James Bottomley @ 2018-06-18 13:40 ` Seth Forshee 2018-06-18 13:49 ` Amir Goldstein 2018-06-18 14:56 ` James Bottomley 0 siblings, 2 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-18 13:40 UTC (permalink / raw) To: James Bottomley Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks, Christian Brauner On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote: > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote: > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > > > Quoting Seth Forshee (seth.forshee@canonical.com): > > > > I wanted to inquire about the current status of shiftfs and the > > > > plans for it moving forward. We'd like to have this functionality > > > > available for use in lxd, and I'm interesetd in helping with > > > > development (or picking up development if it's stalled). > > > > > > > > To start, is anyone still working on shiftfs or similar > > > > functionality? I haven't found it in any git tree on kernel.org, > > > > and as far as mailing list activity the last submission I can > > > > find is [1]. Is there anything newer than this? > > > > > > > > Based on past mailing list discussions, it seems like there was > > > > still debate as to whether this feature should be an overlay > > > > filesystem or something supported at the vfs level. Was this ever > > > > resolved? > > > > > > > > Thanks, > > > > Seth > > > > > > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn > > > > ership.com > > > > > > Hey Seth, > > > > > > I haven't heard anything in a long time. But if this is going to > > > pick back up, can we come up with a detailed set of goals and > > > requirements? > > That would actually help. > > > I was planning to follow up later with some discussion of > > requirements. > > Here are some of ours: > > > > - Supports any id maps possible for a user namespace > > Could you clarify: right at the moment, it basically reverses the > namespace ID mapping when it does on to the filesystem using the > superblock user namespace, so, in theory you can have an arbitrary > mapping simply by changing the s_userns. The problem here is that you > don't have a lot of tools for manipulating the s_userns. For our purposes the way you're shifting with s_user_ns works fine. I know that Serge would prefer a more arbitrary shift so that an arbitrary, unprivileged range in the source fs could be use (e.g. use ids 100000 - 101000 in the source instead of 0 - 1000), and my thoughts on that are quoted below. > > - Does not break inotify > > I don't expect it does, but I haven't checked. I haven't checked either; I'm planning to do so soon. This is a concern that was expressed to me by others, I think because inotify doesn't work with overlayfs. > > - Passes accurate disk usage and source information from the > > "underlay" > > mounts of this type don't currently show up in df > > > - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) > > yes > > > - Works with nested containers > > yes I'd say not so much: /* to mark a mount point, must be real root */ if (ssi->mark && !capable(CAP_SYS_ADMIN)) goto out; So within a container I cannot mark a range to be shiftfs-mountable within a container I create. I'd argue that as long as a user has CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it should be safe to allow this as it implies privleges wrt all ids found in the source mount. This will likely lead to stacked shiftfs mounts, not sure yet whether or not this works in the current code. > > I'm also interested in collecting any requirements others might have. > > > > > I don't recall whether the last version still worked like this, but > > > I'm still not comfortable with the idea of a system where after a > > > reboot, container-created root-owned files are owned by host root > > > until a path is specially marked. Enforcing that the "source" > > > directory is itself uid-shifted would greatly ease my mind. > > And I believe we're discussing everything below in a different > subthread. > > James > > > > I understand the concern and share the discomfort to some degree, but > > I'm not convinced that requiring the source subtree be shifted is the > > right approach. > > > > First, let's address the marking question. As you stated, an approach > > that leaves the subree unmarked for a period of time is problematic, > > and imo this is a fatal flaw with marking as a protection for e.g. > > execing some suid root file written by a container. Writing some such > > mark to the filesystem would make it persistent, but it could also > > limit the support to a limited set of filesystems. > > > > However, I do think it's necessary for a user with sufficient > > capabilities to "bless" a subtree for mounting in a less privileged > > context, so this is a feature of marking that I would like to keep. I > > think the new mount apis in David Howells' filesystem context patches > > [1] might give us a nicer way to do this. For example, root in > > init_user_ns could set up a mount fd which specifies the source > > subtree for the id shift. At that time the kernel could check for > > ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem > > containing the source subtree. Then the fd could be passed to a > > container in a user namespace, who could use it to attach the mount > > to its filesystem tree. The same concept could be extended to nested > > containers, as long as the user setting the source subtree has > > CAP_SYS_ADMIN towards sb->s_user_ns for the subtree. > > > > Now back to reuiring the srouce subtree be id shifted. I understand > > the motivation for wanting this, but I'm not sure I'm in favor of it. > > To start, there are other ways to ensure that id shifted mounts don't > > lead to problems, such as putting the subtree under a directory > > accessible only by root or putting it in a nosuid or noexec mount. > > For some implementations those sorts of protections are going to make > > sense. > > > > Having this requirement may also add significant time to mounting, as > > I assume it would involve iterating through all filesystem objects. > > > > Additionally, that requirement is likely to significantly complicate > > the implementation. The simplest implementation would just translate > > the k[ug]ids in the inodes to a target user ns. A slightly more > > complicated approach might translate them based on a source and > > destination user ns. If it's implemented based on passing in an > > arbitrary id map at mount time it will be more complex and duplicate > > functionality that user namespaces already give us. > > > > Thanks, > > Seth > > > > [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.st > > git@warthog.procyon.org.uk > > _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/containers > > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 13:40 ` Seth Forshee @ 2018-06-18 13:49 ` Amir Goldstein 2018-06-18 14:56 ` James Bottomley 1 sibling, 0 replies; 33+ messages in thread From: Amir Goldstein @ 2018-06-18 13:49 UTC (permalink / raw) To: Seth Forshee Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner On Mon, Jun 18, 2018 at 4:40 PM, Seth Forshee <seth.forshee@canonical.com> wrote: >> > - Does not break inotify >> >> I don't expect it does, but I haven't checked. > > I haven't checked either; I'm planning to do so soon. This is a concern > that was expressed to me by others, I think because inotify doesn't work > with overlayfs. > Should be working these days except for corner cases (copy up of hardlink). There are two regression tests in LTP: https://github.com/linux-test-project/ltp/pull/246 Thanks, Amir. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 13:40 ` Seth Forshee 2018-06-18 13:49 ` Amir Goldstein @ 2018-06-18 14:56 ` James Bottomley 2018-06-18 16:03 ` Seth Forshee 2018-06-18 17:11 ` Amir Goldstein 1 sibling, 2 replies; 33+ messages in thread From: James Bottomley @ 2018-06-18 14:56 UTC (permalink / raw) To: Seth Forshee Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks, Christian Brauner On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote: > On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote: > > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote: > > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > > > > Quoting Seth Forshee (seth.forshee@canonical.com): > > > > > I wanted to inquire about the current status of shiftfs and > > > > > the plans for it moving forward. We'd like to have this > > > > > functionality available for use in lxd, and I'm interesetd in > > > > > helping with development (or picking up development if it's > > > > > stalled). > > > > > > > > > > To start, is anyone still working on shiftfs or similar > > > > > functionality? I haven't found it in any git tree on > > > > > kernel.org, and as far as mailing list activity the last > > > > > submission I can find is [1]. Is there anything newer than > > > > > this? > > > > > > > > > > Based on past mailing list discussions, it seems like there > > > > > was still debate as to whether this feature should be an > > > > > overlay filesystem or something supported at the vfs level. > > > > > Was this ever resolved? > > > > > > > > > > Thanks, > > > > > Seth > > > > > > > > > > [1] > > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn > > > > > ership.com > > > > > > > > Hey Seth, > > > > > > > > I haven't heard anything in a long time. But if this is going > > > > to pick back up, can we come up with a detailed set of goals > > > > and requirements? > > > > That would actually help. > > > > > I was planning to follow up later with some discussion of > > > requirements. Here are some of ours: > > > > > > - Supports any id maps possible for a user namespace > > > > Could you clarify: right at the moment, it basically reverses the > > namespace ID mapping when it does on to the filesystem using the > > superblock user namespace, so, in theory you can have an arbitrary > > mapping simply by changing the s_userns. The problem here is that > > you don't have a lot of tools for manipulating the s_userns. > > For our purposes the way you're shifting with s_user_ns works fine. I > know that Serge would prefer a more arbitrary shift so that an > arbitrary, unprivileged range in the source fs could be use (e.g. use > ids 100000 - 101000 in the source instead of 0 - 1000), and my > thoughts on that are quoted below. The original (v1) shiftfs did simply take a range of ids to shift as an argument. However, that one could only be set up by root and Eric expressed a desire that it use the s_user_ns. > > > - Does not break inotify > > > > I don't expect it does, but I haven't checked. > > I haven't checked either; I'm planning to do so soon. This is a > concern that was expressed to me by others, I think because inotify > doesn't work with overlayfs. I think shiftfs does work simply because it doesn't really do overlays, so lots of stuff that doesn't work with overlays does work with it. > > > - Passes accurate disk usage and source information from the > > > "underlay" > > > > mounts of this type don't currently show up in df > > > > > - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) > > > > yes > > > > > - Works with nested containers > > > > yes > > I'd say not so much: > > /* to mark a mount point, must be real root */ > if (ssi->mark && !capable(CAP_SYS_ADMIN)) > goto out; > > So within a container I cannot mark a range to be shiftfs-mountable > within a container I create. I'd argue that as long as a user has > CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it > should be safe to allow this as it implies privleges wrt all ids > found in the source mount. This will likely lead to stacked shiftfs > mounts, not sure yet whether or not this works in the current code. Um, I think we have different definitions of "works with nested containers". Recall that for a nested container the s_user_ns is also nested, so we shift all the way back to the uid in the root. That means if the check for marking is not capable(CAP_SYS_ADMIN) then an unprivileged user would be able to gain root write access by setting up a nested shift. If your definition of nested means we only shift back one level of user_ns nesting then this could become ns_capable(), so I think we need to add "what is the desired nesting behaviour?" to the questions to be answered by the requirements. James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 14:56 ` James Bottomley @ 2018-06-18 16:03 ` Seth Forshee 2018-06-18 17:11 ` Amir Goldstein 1 sibling, 0 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-18 16:03 UTC (permalink / raw) To: James Bottomley Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks, Christian Brauner On Mon, Jun 18, 2018 at 07:56:59AM -0700, James Bottomley wrote: > On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote: > > On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote: > > > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote: > > > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > > > > > Quoting Seth Forshee (seth.forshee@canonical.com): > > > > > > I wanted to inquire about the current status of shiftfs and > > > > > > the plans for it moving forward. We'd like to have this > > > > > > functionality available for use in lxd, and I'm interesetd in > > > > > > helping with development (or picking up development if it's > > > > > > stalled). > > > > > > > > > > > > To start, is anyone still working on shiftfs or similar > > > > > > functionality? I haven't found it in any git tree on > > > > > > kernel.org, and as far as mailing list activity the last > > > > > > submission I can find is [1]. Is there anything newer than > > > > > > this? > > > > > > > > > > > > Based on past mailing list discussions, it seems like there > > > > > > was still debate as to whether this feature should be an > > > > > > overlay filesystem or something supported at the vfs level. > > > > > > Was this ever resolved? > > > > > > > > > > > > Thanks, > > > > > > Seth > > > > > > > > > > > > [1] > > > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn > > > > > > ership.com > > > > > > > > > > Hey Seth, > > > > > > > > > > I haven't heard anything in a long time. But if this is going > > > > > to pick back up, can we come up with a detailed set of goals > > > > > and requirements? > > > > > > That would actually help. > > > > > > > I was planning to follow up later with some discussion of > > > > requirements. Here are some of ours: > > > > > > > > - Supports any id maps possible for a user namespace > > > > > > Could you clarify: right at the moment, it basically reverses the > > > namespace ID mapping when it does on to the filesystem using the > > > superblock user namespace, so, in theory you can have an arbitrary > > > mapping simply by changing the s_userns. The problem here is that > > > you don't have a lot of tools for manipulating the s_userns. > > > > For our purposes the way you're shifting with s_user_ns works fine. I > > know that Serge would prefer a more arbitrary shift so that an > > arbitrary, unprivileged range in the source fs could be use (e.g. use > > ids 100000 - 101000 in the source instead of 0 - 1000), and my > > thoughts on that are quoted below. > > The original (v1) shiftfs did simply take a range of ids to shift as an > argument. However, that one could only be set up by root and Eric > expressed a desire that it use the s_user_ns. I like using s_user_ns too, just pointing out that it does likely preclude using a shifted source. > > > > - Does not break inotify > > > > > > I don't expect it does, but I haven't checked. > > > > I haven't checked either; I'm planning to do so soon. This is a > > concern that was expressed to me by others, I think because inotify > > doesn't work with overlayfs. > > I think shiftfs does work simply because it doesn't really do overlays, > so lots of stuff that doesn't work with overlays does work with it. > > > > > - Passes accurate disk usage and source information from the > > > > "underlay" > > > > > > mounts of this type don't currently show up in df > > > > > > > - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) > > > > > > yes > > > > > > > - Works with nested containers > > > > > > yes > > > > I'd say not so much: > > > > /* to mark a mount point, must be real root */ > > if (ssi->mark && !capable(CAP_SYS_ADMIN)) > > goto out; > > > > So within a container I cannot mark a range to be shiftfs-mountable > > within a container I create. I'd argue that as long as a user has > > CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it > > should be safe to allow this as it implies privleges wrt all ids > > found in the source mount. This will likely lead to stacked shiftfs > > mounts, not sure yet whether or not this works in the current code. > > Um, I think we have different definitions of "works with nested > containers". Ultimately what I mean is that it should work the same way in a container as in the host, given that the container has the necessary capabilities towards the source subtree that it wants to id shift. So my container should be able to mark a subtree over which I have ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) and then create a container that can shiftfs-mount that subtree. > Recall that for a nested container the s_user_ns is also > nested, so we shift all the way back to the uid in the root. That > means if the check for marking is not capable(CAP_SYS_ADMIN) then an > unprivileged user would be able to gain root write access by setting up > a nested shift. This is true. However, real root already delegated the ability to write as root within a subtree to the container by marking that subtree. Since that container can already write as root to that subtree, what problem is created by letting it create a nested container that can also write to that subtree as root? Either the host already set things up so that writes as root to that subtree are not an issue, or it didn't and you already have a problem. For non-shiftfs filesystems, inodes from the filesystem cannot have any ownership by ids not in s_user_ns, so the nested container could not write any ids not already under the control of the first-level container. > If your definition of nested means we only shift back > one level of user_ns nesting then this could become ns_capable(), so I > think we need to add "what is the desired nesting behaviour?" to the > questions to be answered by the requirements. No, I think in the shiftfs-over-shiftfs use case it does shift all the way back to the host. My argument is that this isn't actually a problem. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 14:56 ` James Bottomley 2018-06-18 16:03 ` Seth Forshee @ 2018-06-18 17:11 ` Amir Goldstein 2018-06-18 19:53 ` Phil Estes ` (3 more replies) 1 sibling, 4 replies; 33+ messages in thread From: Amir Goldstein @ 2018-06-18 17:11 UTC (permalink / raw) To: James Bottomley Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: [...] >> > > - Does not break inotify >> > >> > I don't expect it does, but I haven't checked. >> >> I haven't checked either; I'm planning to do so soon. This is a >> concern that was expressed to me by others, I think because inotify >> doesn't work with overlayfs. > > I think shiftfs does work simply because it doesn't really do overlays, > so lots of stuff that doesn't work with overlays does work with it. > I'm afraid shiftfs suffers from the same problems that the old naiive overlayfs inode implementation suffered from. This problem is demonstrated with LTP tests inotify08 inotify09. shiftfs_new_inode() is called on every lookup, so inotify watch may be set on an inode object, then dentry is evicted from cache and then all events on new dentry are not reported on the watched inode. You will need to implement hashed inodes to solve it. Can be done as overlay does - hashing by real inode pointer. This is just one of those subtle things about stacked fs and there may be other in present and more in future - if we don't have a shared code base for the two stacked fs, I wager you are going to end up "cherry picking" fixes often. IMO, an important question to ask is, since both shiftfs and overlayfs are strongly coupled with container use cases, are there users that are interested in both layering AND shifting? on the same "mark"? If the answer is yes, then this may be an argument in favor of integrating at least some of shittfs functionality into overlayfs. Another argument is that shiftfs itself takes the maximum allowed 2 levels of s_stack_depth for it's 2 mounts, so it is actually not possible with current VFS limitation to combine shiftfs with overlayfs. This could be solved relatively easily by adding "-o mark" support to overlayfs and allowing to mount shiftfs also over "marked" overlayfs inside container. Anyway, I'm just playing devil's advocate to the idea of two stacked fs implementation, so presenting this point of view. I am fully aware that there are also plenty of disadvantages to couple two unrelated functionalities together. Cheers, Amir. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 17:11 ` Amir Goldstein @ 2018-06-18 19:53 ` Phil Estes 2018-06-21 20:16 ` Seth Forshee ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: Phil Estes @ 2018-06-18 19:53 UTC (permalink / raw) To: Amir Goldstein Cc: James Bottomley, linux-fsdevel, Seth Forshee, Linux Containers, Tyler Hicks, Christian Brauner Amir Goldstein wrote on 6/18/18 1:11 PM: > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > [...] >>>>> - Does not break inotify >>>> I don't expect it does, but I haven't checked. >>> I haven't checked either; I'm planning to do so soon. This is a >>> concern that was expressed to me by others, I think because inotify >>> doesn't work with overlayfs. >> I think shiftfs does work simply because it doesn't really do overlays, >> so lots of stuff that doesn't work with overlays does work with it. >> > I'm afraid shiftfs suffers from the same problems that the old naiive > overlayfs inode implementation suffered from. > > This problem is demonstrated with LTP tests inotify08 inotify09. > shiftfs_new_inode() is called on every lookup, so inotify watch > may be set on an inode object, then dentry is evicted from cache > and then all events on new dentry are not reported on the watched > inode. You will need to implement hashed inodes to solve it. > Can be done as overlay does - hashing by real inode pointer. > > This is just one of those subtle things about stacked fs and there may > be other in present and more in future - if we don't have a shared code > base for the two stacked fs, I wager you are going to end up "cherry > picking" fixes often. > > IMO, an important question to ask is, since both shiftfs and overlayfs > are strongly coupled with container use cases, are there users that > are interested in both layering AND shifting? on the same "mark"? > If the answer is yes, then this may be an argument in favor of > integrating at least some of shittfs functionality into overlayfs. > > Another argument is that shiftfs itself takes the maximum allowed > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > possible with current VFS limitation to combine shiftfs with overlayfs. I'm not sure if this was my problem or not, but I did try the v2 patchset with overlay (by marking and mounting a filesystem tree with shiftfs, and then using it as the component of an overlay mount) and could not get it to work. I was trying to prepare a demo, but with limited time gave up and wasn't able to find time to debug with possible help from James. I think for shiftfs to be useful in certain container contexts, combination with overlay is definitely a desirable and/or required feature. Of course if it is limited to overlay (e.g. implemented within overlayfs) that would be limiting for container use cases for other contexts (zfs, btrfs, etc.). > > This could be solved relatively easily by adding "-o mark" support > to overlayfs and allowing to mount shiftfs also over "marked" overlayfs > inside container. > > Anyway, I'm just playing devil's advocate to the idea of two stacked fs > implementation, so presenting this point of view. I am fully aware that > there are also plenty of disadvantages to couple two unrelated > functionalities together. > > Cheers, > Amir. > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 17:11 ` Amir Goldstein 2018-06-18 19:53 ` Phil Estes @ 2018-06-21 20:16 ` Seth Forshee 2018-06-24 11:32 ` Amir Goldstein 2018-06-25 11:19 ` Christian Brauner 2018-06-27 7:48 ` James Bottomley 3 siblings, 1 reply; 33+ messages in thread From: Seth Forshee @ 2018-06-21 20:16 UTC (permalink / raw) To: Amir Goldstein Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner On Mon, Jun 18, 2018 at 08:11:52PM +0300, Amir Goldstein wrote: > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > [...] > >> > > - Does not break inotify > >> > > >> > I don't expect it does, but I haven't checked. > >> > >> I haven't checked either; I'm planning to do so soon. This is a > >> concern that was expressed to me by others, I think because inotify > >> doesn't work with overlayfs. > > > > I think shiftfs does work simply because it doesn't really do overlays, > > so lots of stuff that doesn't work with overlays does work with it. > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > overlayfs inode implementation suffered from. > > This problem is demonstrated with LTP tests inotify08 inotify09. > shiftfs_new_inode() is called on every lookup, so inotify watch > may be set on an inode object, then dentry is evicted from cache > and then all events on new dentry are not reported on the watched > inode. You will need to implement hashed inodes to solve it. > Can be done as overlay does - hashing by real inode pointer. Thanks for the pointer. I modified the LTP inotify08 test to use shiftfs instead of overlayfs, and I can confirm that it fails to see the events. > This is just one of those subtle things about stacked fs and there may > be other in present and more in future - if we don't have a shared code > base for the two stacked fs, I wager you are going to end up "cherry > picking" fixes often. > > IMO, an important question to ask is, since both shiftfs and overlayfs > are strongly coupled with container use cases, are there users that > are interested in both layering AND shifting? on the same "mark"? > If the answer is yes, then this may be an argument in favor of > integrating at least some of shittfs functionality into overlayfs. > > Another argument is that shiftfs itself takes the maximum allowed > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > possible with current VFS limitation to combine shiftfs with overlayfs. That's unfortunate -- it will prevent nested use in addition to use with overlayfs. I have been wondering if shiftfs could be made to detect when it was being layered on top of a shiftfs mount, and do something to make it behave like it was a single shift on top of the original subtree. I may look into this. Of course it doesn't help with overlayfs. > This could be solved relatively easily by adding "-o mark" support > to overlayfs and allowing to mount shiftfs also over "marked" overlayfs > inside container. > > Anyway, I'm just playing devil's advocate to the idea of two stacked fs > implementation, so presenting this point of view. I am fully aware that > there are also plenty of disadvantages to couple two unrelated > functionalities together. > > Cheers, > Amir. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-21 20:16 ` Seth Forshee @ 2018-06-24 11:32 ` Amir Goldstein 0 siblings, 0 replies; 33+ messages in thread From: Amir Goldstein @ 2018-06-24 11:32 UTC (permalink / raw) To: Seth Forshee Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner On Thu, Jun 21, 2018 at 11:16 PM, Seth Forshee <seth.forshee@canonical.com> wrote: [...] >> >> Another argument is that shiftfs itself takes the maximum allowed >> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not >> possible with current VFS limitation to combine shiftfs with overlayfs. > > That's unfortunate -- it will prevent nested use in addition to use with > overlayfs. > > I have been wondering if shiftfs could be made to detect when it was > being layered on top of a shiftfs mount, and do something to make it > behave like it was a single shift on top of the original subtree. I may > look into this. Of course it doesn't help with overlayfs. > Here's a crazy idea for you - after first mount with -o mark, any shiftfs_lookup() will return -EIO, i.e. you start the mount inaccesible to any userns. Then, inside (maybe nested) userns, you allow userns admin to remount -o shift, which resets s_user_ns (allowed only once). Now, any shiftfs operation from the owner userns is allowed and any operation not from owner userns is denied. It may be just a crazy idea and I have no idea how hard it would be to implement, but the upside is that it removes to problem of being able to access the marked and shifted fs at the same time because they simply don't exist at the same time - one transforms into the other in what is hopefully an atomic and secure manner. Note that at no time, should there exist shiftfs dentries nor inodes in cache whose sb->s_user_ns is anything other than the shifted userns (except maybe for the root dentry/inode). Thanks, Amir. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 17:11 ` Amir Goldstein 2018-06-18 19:53 ` Phil Estes 2018-06-21 20:16 ` Seth Forshee @ 2018-06-25 11:19 ` Christian Brauner 2018-06-27 7:48 ` James Bottomley 3 siblings, 0 replies; 33+ messages in thread From: Christian Brauner @ 2018-06-25 11:19 UTC (permalink / raw) To: Amir Goldstein Cc: James Bottomley, linux-fsdevel, Seth Forshee, Linux Containers, Tyler Hicks, Christian Brauner On Mon, Jun 18, 2018 at 08:11:52PM +0300, Amir Goldstein wrote: > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > [...] > >> > > - Does not break inotify > >> > > >> > I don't expect it does, but I haven't checked. > >> > >> I haven't checked either; I'm planning to do so soon. This is a > >> concern that was expressed to me by others, I think because inotify > >> doesn't work with overlayfs. > > > > I think shiftfs does work simply because it doesn't really do overlays, > > so lots of stuff that doesn't work with overlays does work with it. > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > overlayfs inode implementation suffered from. > > This problem is demonstrated with LTP tests inotify08 inotify09. > shiftfs_new_inode() is called on every lookup, so inotify watch > may be set on an inode object, then dentry is evicted from cache > and then all events on new dentry are not reported on the watched > inode. You will need to implement hashed inodes to solve it. > Can be done as overlay does - hashing by real inode pointer. > > This is just one of those subtle things about stacked fs and there may > be other in present and more in future - if we don't have a shared code > base for the two stacked fs, I wager you are going to end up "cherry > picking" fixes often. > > IMO, an important question to ask is, since both shiftfs and overlayfs > are strongly coupled with container use cases, are there users that > are interested in both layering AND shifting? on the same "mark"? > If the answer is yes, then this may be an argument in favor of > integrating at least some of shittfs functionality into overlayfs. I think that there are use-cases for shiftfs and overlayfs. Especially since shifts and overlayfs have decent conceptual overlap. They both provide a mechanism to isolate or at least protect a filesystem tree. The mechanisms differ but I see no obvious reasons why they couldn't be combined. That is one of the reasons why we considered it an option to make shiftfs an extension of overlayfs. But this has been just a thought experiment so far. One of the benefits of combining shiftfs and overlayfs I see could be that the underlay could be made read-only and when combined with shifts writing a file (e.g. as root) would only hit the upperdir. Although, for shifts we really also want to allow an option to write through to the underlay. If done right we also get the nice features of merging multiple underlays together which seems a good feature to have for containers. Additionally, it would also allow us to not have to push another filesystem upstream. (That of course also would not be necessary if this would be a pure vfs feature which Seth already mentioned in a prior mail.) > > Another argument is that shiftfs itself takes the maximum allowed > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > possible with current VFS limitation to combine shiftfs with overlayfs. > > This could be solved relatively easily by adding "-o mark" support > to overlayfs and allowing to mount shiftfs also over "marked" overlayfs > inside container. > > Anyway, I'm just playing devil's advocate to the idea of two stacked fs > implementation, so presenting this point of view. I am fully aware that > there are also plenty of disadvantages to couple two unrelated > functionalities together. > > Cheers, > Amir. > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-18 17:11 ` Amir Goldstein ` (2 preceding siblings ...) 2018-06-25 11:19 ` Christian Brauner @ 2018-06-27 7:48 ` James Bottomley 2018-06-27 10:17 ` Amir Goldstein 2018-07-03 16:54 ` Serge E. Hallyn 3 siblings, 2 replies; 33+ messages in thread From: James Bottomley @ 2018-06-27 7:48 UTC (permalink / raw) To: Amir Goldstein Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > [...] > > > > > - Does not break inotify > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > concern that was expressed to me by others, I think because > > > inotify doesn't work with overlayfs. > > > > I think shiftfs does work simply because it doesn't really do > > overlays, so lots of stuff that doesn't work with overlays does > > work with it. > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > overlayfs inode implementation suffered from. > > This problem is demonstrated with LTP tests inotify08 inotify09. > shiftfs_new_inode() is called on every lookup, so inotify watch > may be set on an inode object, then dentry is evicted from cache > and then all events on new dentry are not reported on the watched > inode. You will need to implement hashed inodes to solve it. > Can be done as overlay does - hashing by real inode pointer. > > This is just one of those subtle things about stacked fs and there > may be other in present and more in future - if we don't have a > shared code base for the two stacked fs, I wager you are going to end > up "cherry picking" fixes often. > > IMO, an important question to ask is, since both shiftfs and > overlayfs are strongly coupled with container use cases, are there > users that are interested in both layering AND shifting? on the same > "mark"? If the answer is yes, then this may be an argument in favor > of integrating at least some of shittfs functionality into overlayfs. My container use case is interested in shifting but not layering. Even the docker use case would only mix the two with the overlay graph driver. There seem to be quite a few clouds using non overlayfs graph drivers (the dm one being the most popular). > Another argument is that shiftfs itself takes the maximum allowed > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > possible with current VFS limitation to combine shiftfs with > overlayfs. That's an artificial, not an inherent, restriction that was introduced to keep the call stack small. It can be increased or even eliminated (although then we'd risk a real run off the end of the kernel stack problem). > This could be solved relatively easily by adding "-o mark" support > to overlayfs and allowing to mount shiftfs also over "marked" > overlayfs inside container. Can we please decided whether the temporary mark, as implemented in the current patch set or a more permanent security.<something> xattr type mark is preferred for this? It's an important question that's been asked, but we have no resolution on. > Anyway, I'm just playing devil's advocate to the idea of two stacked > fs implementation, so presenting this point of view. I am fully aware > that there are also plenty of disadvantages to couple two unrelated > functionalities together. The biggest one seems to be that the points at which overlayfs and shiftfs do credential shifting are subtly different. That's not to say they can't be unified, but there's some work to do to prove it's possible. James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-27 7:48 ` James Bottomley @ 2018-06-27 10:17 ` Amir Goldstein 2018-07-03 16:54 ` Serge E. Hallyn 1 sibling, 0 replies; 33+ messages in thread From: Amir Goldstein @ 2018-06-27 10:17 UTC (permalink / raw) To: James Bottomley Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers, Christian Brauner, Mark Salyzyn, Miklos Szeredi, Vivek Goyal [cc some overlayfs folks] On Wed, Jun 27, 2018 at 10:48 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: >> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >> [...] >> > > > > - Does not break inotify >> > > > >> > > > I don't expect it does, but I haven't checked. >> > > >> > > I haven't checked either; I'm planning to do so soon. This is a >> > > concern that was expressed to me by others, I think because >> > > inotify doesn't work with overlayfs. >> > >> > I think shiftfs does work simply because it doesn't really do >> > overlays, so lots of stuff that doesn't work with overlays does >> > work with it. >> > >> >> I'm afraid shiftfs suffers from the same problems that the old naiive >> overlayfs inode implementation suffered from. >> >> This problem is demonstrated with LTP tests inotify08 inotify09. >> shiftfs_new_inode() is called on every lookup, so inotify watch >> may be set on an inode object, then dentry is evicted from cache >> and then all events on new dentry are not reported on the watched >> inode. You will need to implement hashed inodes to solve it. >> Can be done as overlay does - hashing by real inode pointer. >> >> This is just one of those subtle things about stacked fs and there >> may be other in present and more in future - if we don't have a >> shared code base for the two stacked fs, I wager you are going to end >> up "cherry picking" fixes often. >> >> IMO, an important question to ask is, since both shiftfs and >> overlayfs are strongly coupled with container use cases, are there >> users that are interested in both layering AND shifting? on the same >> "mark"? If the answer is yes, then this may be an argument in favor >> of integrating at least some of shittfs functionality into overlayfs. > > My container use case is interested in shifting but not layering. Even > the docker use case would only mix the two with the overlay graph > driver. There seem to be quite a few clouds using non overlayfs graph > drivers (the dm one being the most popular). To be clear, I did not mean that shifting will be possible only for overlayfs users, I meant that code base could be shared. One option is what I did with 'snapshot' fs. It creates a new filesystem_type, reuses most of the overlayfs internal structures and many of the overlayfs operations and implements some of its own: https://github.com/amir73il/linux/commit/49081195e1c085b0f02d73f619de5ec0f1113f08 One point of similarity with shiftfs, is that snapshot fs has no lowerdir, only upperdir, e.g.: mount -t snapshot none none -oupperdir=/foo And it only took a single line of change in overlayfs code (in ovl_posix_acl_xattr_set) to support an upper-only configuration. > >> Another argument is that shiftfs itself takes the maximum allowed >> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not >> possible with current VFS limitation to combine shiftfs with >> overlayfs. > > That's an artificial, not an inherent, restriction that was introduced > to keep the call stack small. It can be increased or even eliminated > (although then we'd risk a real run off the end of the kernel stack > problem). > Yeh, only need to get that change passed Al ;-) In an earlier reply I proposed a solution to save 1 stacking layer: mount -oremount,shift To change credentials shifting from 'mark' (=deny all access) to 'shift' (=shift to re-mounter user_ns). >> This could be solved relatively easily by adding "-o mark" support >> to overlayfs and allowing to mount shiftfs also over "marked" >> overlayfs inside container. > > Can we please decided whether the temporary mark, as implemented in the > current patch set or a more permanent security.<something> xattr type > mark is preferred for this? It's an important question that's been > asked, but we have no resolution on. > My 2 cents: it is generally considered a bad idea to let users mess around with overlay upper/work dirs directly, but kernel doesn't do anything to prevent that. It is completely up to the machine admin or container runtime to setup security policies to prevent that. >> Anyway, I'm just playing devil's advocate to the idea of two stacked >> fs implementation, so presenting this point of view. I am fully aware >> that there are also plenty of disadvantages to couple two unrelated >> functionalities together. > > The biggest one seems to be that the points at which overlayfs and > shiftfs do credential shifting are subtly different. That's not to say > they can't be unified, but there's some work to do to prove it's > possible. > That is true. I'd like to point your attention to this patch from Android developers that intends to modify the existing credential shitfting behavior of overlayfs: https://marc.info/?l=linux-unionfs&m=153003238029749&w=2 If we accept a second credential shifting paradigm, its an opening to accept the third one... Thanks, Amir. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-27 7:48 ` James Bottomley 2018-06-27 10:17 ` Amir Goldstein @ 2018-07-03 16:54 ` Serge E. Hallyn 2018-07-03 17:08 ` Stéphane Graber 1 sibling, 1 reply; 33+ messages in thread From: Serge E. Hallyn @ 2018-07-03 16:54 UTC (permalink / raw) To: James Bottomley Cc: Amir Goldstein, linux-fsdevel, Seth Forshee, Linux Containers, Tyler Hicks, Christian Brauner Quoting James Bottomley (James.Bottomley@HansenPartnership.com): > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > [...] > > > > > > �- Does not break inotify > > > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > concern that was expressed to me by others, I think because > > > > inotify doesn't work with overlayfs. > > > > > > I think shiftfs does work simply because it doesn't really do > > > overlays, so lots of stuff that doesn't work with overlays does > > > work with it. > > > > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > overlayfs inode implementation suffered from. > > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > shiftfs_new_inode() is called on every lookup, so inotify watch > > may be set on an inode object, then dentry is evicted from cache > > and then all events on new dentry are not reported on the watched > > inode. You will need to implement hashed inodes to solve it. > > Can be done as overlay does - hashing by real inode pointer. > > > > This is just one of those subtle things about stacked fs and there > > may be other in present and more in future - if we don't have a > > shared code base for the two stacked fs, I wager you are going to end > > up "cherry picking" fixes often. > > > > IMO, an important question to ask is, since both shiftfs and > > overlayfs are strongly coupled with container use cases, are there > > users that are interested in both layering AND shifting? on the same > > "mark"? If the answer is yes, then this may be an argument in favor > > of integrating at least some of shittfs functionality into overlayfs. > > My container use case is interested in shifting but not layering. Even > the docker use case would only mix the two with the overlay graph > driver. There seem to be quite a few clouds using non overlayfs graph > drivers (the dm one being the most popular). > > > Another argument is that shiftfs itself takes the maximum allowed > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > possible with current VFS limitation to combine shiftfs with > > overlayfs. > > That's an artificial, not an inherent, restriction that was introduced > to keep the call stack small. It can be increased or even eliminated > (although then we'd risk a real run off the end of the kernel stack > problem). > > > This could be solved relatively easily by adding "-o mark" support > > to overlayfs and allowing to mount shiftfs also over "marked" > > overlayfs inside container. > > Can we please decided whether the temporary mark, as implemented in the > current patch set or a more permanent security.<something> xattr type > mark is preferred for this? It's an important question that's been > asked, but we have no resolution on. I think something permanent is mandatory. Otherwise users may be able to induce a reboot into a state where the temp mark isn't made. A security.<something> xattr has the problem that an older kernel may not know about it. Two possibilities which have been mentioned before: 1. just demand that the *source* idmap doesn't start at 0. Ideally it would be something like 100k uids starting at 100k. The kernel would refuse to do a shiftfs mount if the source idmap includes uid 0. I suppose the "let-them-shoot-themselves-in-the-foot" approach would be to just strongly recommend using such a source uid mapping, but not enforce it. 2. Enforce that the base directory have perms 700 for shiftfs-mount to be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to 100000:0:100000. Yes, root could stupidly change the perms later, but failing that uid 1000 can never exploit a setuid-root script under /var/lib/shiftfs/base-rootfs -serge ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-07-03 16:54 ` Serge E. Hallyn @ 2018-07-03 17:08 ` Stéphane Graber 2018-07-03 22:05 ` Serge E. Hallyn 0 siblings, 1 reply; 33+ messages in thread From: Stéphane Graber @ 2018-07-03 17:08 UTC (permalink / raw) To: Serge E. Hallyn Cc: James Bottomley, Tyler Hicks, Linux Containers, Seth Forshee, Christian Brauner, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5243 bytes --] On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote: > Quoting James Bottomley (James.Bottomley@HansenPartnership.com): > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > [...] > > > > > > > - Does not break inotify > > > > > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > > concern that was expressed to me by others, I think because > > > > > inotify doesn't work with overlayfs. > > > > > > > > I think shiftfs does work simply because it doesn't really do > > > > overlays, so lots of stuff that doesn't work with overlays does > > > > work with it. > > > > > > > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > > overlayfs inode implementation suffered from. > > > > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > > shiftfs_new_inode() is called on every lookup, so inotify watch > > > may be set on an inode object, then dentry is evicted from cache > > > and then all events on new dentry are not reported on the watched > > > inode. You will need to implement hashed inodes to solve it. > > > Can be done as overlay does - hashing by real inode pointer. > > > > > > This is just one of those subtle things about stacked fs and there > > > may be other in present and more in future - if we don't have a > > > shared code base for the two stacked fs, I wager you are going to end > > > up "cherry picking" fixes often. > > > > > > IMO, an important question to ask is, since both shiftfs and > > > overlayfs are strongly coupled with container use cases, are there > > > users that are interested in both layering AND shifting? on the same > > > "mark"? If the answer is yes, then this may be an argument in favor > > > of integrating at least some of shittfs functionality into overlayfs. > > > > My container use case is interested in shifting but not layering. Even > > the docker use case would only mix the two with the overlay graph > > driver. There seem to be quite a few clouds using non overlayfs graph > > drivers (the dm one being the most popular). > > > > > Another argument is that shiftfs itself takes the maximum allowed > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > > possible with current VFS limitation to combine shiftfs with > > > overlayfs. > > > > That's an artificial, not an inherent, restriction that was introduced > > to keep the call stack small. It can be increased or even eliminated > > (although then we'd risk a real run off the end of the kernel stack > > problem). > > > > > This could be solved relatively easily by adding "-o mark" support > > > to overlayfs and allowing to mount shiftfs also over "marked" > > > overlayfs inside container. > > > > Can we please decided whether the temporary mark, as implemented in the > > current patch set or a more permanent security.<something> xattr type > > mark is preferred for this? It's an important question that's been > > asked, but we have no resolution on. > > I think something permanent is mandatory. Otherwise users may be able > to induce a reboot into a state where the temp mark isn't made. A > security.<something> xattr has the problem that an older kernel may not > know about it. > > Two possibilities which have been mentioned before: > > 1. just demand that the *source* idmap doesn't start at 0. Ideally it > would be something like 100k uids starting at 100k. The kernel would > refuse to do a shiftfs mount if the source idmap includes uid 0. > > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be > to just strongly recommend using such a source uid mapping, but not > enforce it. > > 2. Enforce that the base directory have perms 700 for shiftfs-mount to > be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with > 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to > 100000:0:100000. Yes, root could stupidly change the perms later, but > failing that uid 1000 can never exploit a setuid-root script under > /var/lib/shiftfs/base-rootfs > > -serge I'm personally in favor of this second approach and is what I was hoping to use for LXD in the future. It's trivial for us to apply that to the parent directory of the container (so that the rootfs itself can have a different mode) and is something we're already doing today for privileged containers anyway (as those have the exact same issue with setuid binaries). Having the data on the host shifted to a map which is never used by any user on the system would make mistakes less likely to happen but it would also require you to know ahead of time how many uid/gid you'll need for all future containers that will use that filesystem. 100k for example would effectively prevent the use of many network authentication systems inside the container as those tend to map to the 200k+ uid range. -- Stéphane Graber Ubuntu developer http://www.ubuntu.com [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-07-03 17:08 ` Stéphane Graber @ 2018-07-03 22:05 ` Serge E. Hallyn 0 siblings, 0 replies; 33+ messages in thread From: Serge E. Hallyn @ 2018-07-03 22:05 UTC (permalink / raw) To: Stéphane Graber Cc: Serge E. Hallyn, James Bottomley, Tyler Hicks, Linux Containers, Seth Forshee, Christian Brauner, linux-fsdevel Quoting St�phane Graber (stgraber@ubuntu.com): > On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote: > > Quoting James Bottomley (James.Bottomley@HansenPartnership.com): > > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > [...] > > > > > > > > �- Does not break inotify > > > > > > > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > > > concern that was expressed to me by others, I think because > > > > > > inotify doesn't work with overlayfs. > > > > > > > > > > I think shiftfs does work simply because it doesn't really do > > > > > overlays, so lots of stuff that doesn't work with overlays does > > > > > work with it. > > > > > > > > > > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > > > overlayfs inode implementation suffered from. > > > > > > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > > > shiftfs_new_inode() is called on every lookup, so inotify watch > > > > may be set on an inode object, then dentry is evicted from cache > > > > and then all events on new dentry are not reported on the watched > > > > inode. You will need to implement hashed inodes to solve it. > > > > Can be done as overlay does - hashing by real inode pointer. > > > > > > > > This is just one of those subtle things about stacked fs and there > > > > may be other in present and more in future - if we don't have a > > > > shared code base for the two stacked fs, I wager you are going to end > > > > up "cherry picking" fixes often. > > > > > > > > IMO, an important question to ask is, since both shiftfs and > > > > overlayfs are strongly coupled with container use cases, are there > > > > users that are interested in both layering AND shifting? on the same > > > > "mark"? If the answer is yes, then this may be an argument in favor > > > > of integrating at least some of shittfs functionality into overlayfs. > > > > > > My container use case is interested in shifting but not layering. Even > > > the docker use case would only mix the two with the overlay graph > > > driver. There seem to be quite a few clouds using non overlayfs graph > > > drivers (the dm one being the most popular). > > > > > > > Another argument is that shiftfs itself takes the maximum allowed > > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > > > possible with current VFS limitation to combine shiftfs with > > > > overlayfs. > > > > > > That's an artificial, not an inherent, restriction that was introduced > > > to keep the call stack small. It can be increased or even eliminated > > > (although then we'd risk a real run off the end of the kernel stack > > > problem). > > > > > > > This could be solved relatively easily by adding "-o mark" support > > > > to overlayfs and allowing to mount shiftfs also over "marked" > > > > overlayfs inside container. > > > > > > Can we please decided whether the temporary mark, as implemented in the > > > current patch set or a more permanent security.<something> xattr type > > > mark is preferred for this? It's an important question that's been > > > asked, but we have no resolution on. > > > > I think something permanent is mandatory. Otherwise users may be able > > to induce a reboot into a state where the temp mark isn't made. A > > security.<something> xattr has the problem that an older kernel may not > > know about it. > > > > Two possibilities which have been mentioned before: > > > > 1. just demand that the *source* idmap doesn't start at 0. Ideally it > > would be something like 100k uids starting at 100k. The kernel would > > refuse to do a shiftfs mount if the source idmap includes uid 0. > > > > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be > > to just strongly recommend using such a source uid mapping, but not > > enforce it. > > > > 2. Enforce that the base directory have perms 700 for shiftfs-mount to > > be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with > > 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to > > 100000:0:100000. Yes, root could stupidly change the perms later, but > > failing that uid 1000 can never exploit a setuid-root script under > > /var/lib/shiftfs/base-rootfs > > > > -serge > > I'm personally in favor of this second approach and is what I was hoping to use > for LXD in the future. It's trivial for us to apply that to the parent > directory of the container (so that the rootfs itself can have a > different mode) and is something we're already doing today for > privileged containers anyway (as those have the exact same issue with > setuid binaries). > > Having the data on the host shifted to a map which is never used by any > user on the system would make mistakes less likely to happen but it > would also require you to know ahead of time how many uid/gid you'll > need for all future containers that will use that filesystem. 100k for > example would effectively prevent the use of many network authentication > systems inside the container as those tend to map to the 200k+ uid > range. Well, maybe not if it's overlay-based. In that case we could do something where the base fs might only have 10k uids mapped, but the overlay has 300k. But that gets a bit funky :) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-14 18:44 shiftfs status and future development Seth Forshee 2018-06-15 13:56 ` Serge E. Hallyn @ 2018-06-15 14:54 ` Aleksa Sarai 2018-06-15 15:05 ` Seth Forshee 2018-06-15 15:28 ` James Bottomley 2 siblings, 1 reply; 33+ messages in thread From: Aleksa Sarai @ 2018-06-15 14:54 UTC (permalink / raw) To: Seth Forshee Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks, Christian Brauner [-- Attachment #1: Type: text/plain, Size: 1311 bytes --] On 2018-06-14, Seth Forshee <seth.forshee@canonical.com> wrote: > I wanted to inquire about the current status of shiftfs and the plans > for it moving forward. We'd like to have this functionality available > for use in lxd, and I'm interesetd in helping with development (or > picking up development if it's stalled). > > To start, is anyone still working on shiftfs or similar functionality? I > haven't found it in any git tree on kernel.org, and as far as mailing > list activity the last submission I can find is [1]. Is there anything > newer than this? James Bottomley demoed the current status of shiftfs at the last Linux Plumbers' Conference. Personally, it looked like some of the motivations behind why we needed a shiftfs (and what it should look like) were lost over time, and the result is that the patchset mutated and has effectively stalled development for over a year (as far as I know it hasn't been posted again in over a year, and nobody is carrying it in their tree). I agree with Serge that if we want to restart its development someone should write down what the requirements and goals are to avoid having a patchset which mutates over many review cycles. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 14:54 ` Aleksa Sarai @ 2018-06-15 15:05 ` Seth Forshee 0 siblings, 0 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-15 15:05 UTC (permalink / raw) To: Aleksa Sarai Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Sat, Jun 16, 2018 at 12:54:38AM +1000, Aleksa Sarai wrote: > On 2018-06-14, Seth Forshee <seth.forshee@canonical.com> wrote: > > I wanted to inquire about the current status of shiftfs and the plans > > for it moving forward. We'd like to have this functionality available > > for use in lxd, and I'm interesetd in helping with development (or > > picking up development if it's stalled). > > > > To start, is anyone still working on shiftfs or similar functionality? I > > haven't found it in any git tree on kernel.org, and as far as mailing > > list activity the last submission I can find is [1]. Is there anything > > newer than this? > > James Bottomley demoed the current status of shiftfs at the last Linux > Plumbers' Conference. Personally, it looked like some of the motivations > behind why we needed a shiftfs (and what it should look like) were lost > over time, and the result is that the patchset mutated and has > effectively stalled development for over a year (as far as I know it > hasn't been posted again in over a year, and nobody is carrying it in > their tree). That's my impression too, which I was attempting to confirm with my mail :-) > I agree with Serge that if we want to restart its development someone > should write down what the requirements and goals are to avoid having a > patchset which mutates over many review cycles. I agree, though some evolution during review is inevitable. I just responed to Serge with some of our requirements, if you have some to add I'd love to see them. Thanks, Seth ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-14 18:44 shiftfs status and future development Seth Forshee 2018-06-15 13:56 ` Serge E. Hallyn 2018-06-15 14:54 ` Aleksa Sarai @ 2018-06-15 15:28 ` James Bottomley 2018-06-15 15:46 ` Seth Forshee 2 siblings, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-15 15:28 UTC (permalink / raw) To: Seth Forshee, containers, linux-fsdevel; +Cc: Tyler Hicks, Christian Brauner On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote: > I wanted to inquire about the current status of shiftfs and the plans > for it moving forward. We'd like to have this functionality available > for use in lxd, and I'm interesetd in helping with development (or > picking up development if it's stalled). Well, I'm still using it and technically still developing it; it's just that its self contained and there haven't been any interface changes necessitating an update that I've noticed, so the last published patch still applies (well, unless my git rebasing quietly changed something and I didn't notice). > To start, is anyone still working on shiftfs or similar > functionality? I haven't found it in any git tree on kernel.org, and > as far as mailing list activity the last submission I can find is > [1]. Is there anything newer than this? I'm working on it, but it mostly works for my use case. > Based on past mailing list discussions, it seems like there was still > debate as to whether this feature should be an overlay filesystem or > something supported at the vfs level. Was this ever resolved? I think that's the main problem: I don't really have anything actionable to work on for upstreaming. I suspect growing more consumers would help with this. James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 15:28 ` James Bottomley @ 2018-06-15 15:46 ` Seth Forshee 2018-06-15 16:16 ` Christian Brauner 2018-06-15 16:35 ` James Bottomley 0 siblings, 2 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-15 15:46 UTC (permalink / raw) To: James Bottomley; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Fri, Jun 15, 2018 at 08:28:54AM -0700, James Bottomley wrote: > On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote: > > I wanted to inquire about the current status of shiftfs and the plans > > for it moving forward. We'd like to have this functionality available > > for use in lxd, and I'm interesetd in helping with development (or > > picking up development if it's stalled). > > Well, I'm still using it and technically still developing it; it's just > that its self contained and there haven't been any interface changes > necessitating an update that I've noticed, so the last published patch > still applies (well, unless my git rebasing quietly changed something > and I didn't notice). I did have to make some changes when I tried it out with 4.17, see below. > > To start, is anyone still working on shiftfs or similar > > functionality? I haven't found it in any git tree on kernel.org, and > > as far as mailing list activity the last submission I can find is > > [1]. Is there anything newer than this? > > I'm working on it, but it mostly works for my use case. > > > Based on past mailing list discussions, it seems like there was still > > debate as to whether this feature should be an overlay filesystem or > > something supported at the vfs level. Was this ever resolved? > > I think that's the main problem: I don't really have anything > actionable to work on for upstreaming. I suspect growing more > consumers would help with this. Kind of a chicken/egg problem, as it's difficult to grow consumers when it's not upstream. I'm starting to work with it though. However it seems to me that the question of whether or not this should be an overlay filesystem is pretty fundamental. There was a fairly long thread between you and Christoph Hellwig on the topic that didn't really seem to come to a resolution. Thanks, Seth >From 70352c549e9c023cd0c0732329a44f5bbcac0ace Mon Sep 17 00:00:00 2001 From: Seth Forshee <seth.forshee@canonical.com> Date: Thu, 24 May 2018 09:40:13 -0500 Subject: [PATCH] shiftfs: Forward port to 4.17-rc6 Signed-off-by: Seth Forshee <seth.forshee@canonical.com> --- fs/shiftfs.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index ea8ac57b3ce1..fa0797bd7880 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry *dentry) static struct dentry *shiftfs_d_real(struct dentry *dentry, const struct inode *inode, - unsigned int flags) + unsigned int open_flags, unsigned int flags) { struct dentry *real = dentry->d_fsdata; if (unlikely(real->d_flags & DCACHE_OP_REAL)) - return real->d_op->d_real(real, real->d_inode, flags); + return real->d_op->d_real(real, real->d_inode, open_flags, flags); return real; } @@ -545,19 +545,20 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr) return 0; } -static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry, - struct kstat *stat) +static int shiftfs_getattr(const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) { + struct dentry *dentry = path->dentry; + struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info; struct inode *inode = dentry->d_inode; struct dentry *real = inode->i_private; struct inode *reali = real->d_inode; const struct inode_operations *iop = reali->i_op; + struct path realpath = { .mnt = ssi->mnt, .dentry = real }; int err = 0; - mnt = dentry->d_sb->s_fs_info; - if (iop->getattr) - err = iop->getattr(mnt, real, stat); + err = iop->getattr(&realpath, stat, request_mask, flags); else generic_fillattr(reali, stat); @@ -625,7 +626,7 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode, * may be aliases plus a few others. */ if (reali) - use_inode_hash = ACCESS_ONCE(reali->i_nlink) > 1 && + use_inode_hash = READ_ONCE(reali->i_nlink) > 1 && !S_ISDIR(reali->i_mode); if (use_inode_hash) { -- 2.17.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 15:46 ` Seth Forshee @ 2018-06-15 16:16 ` Christian Brauner 2018-06-15 16:35 ` James Bottomley 1 sibling, 0 replies; 33+ messages in thread From: Christian Brauner @ 2018-06-15 16:16 UTC (permalink / raw) To: Seth Forshee; +Cc: James Bottomley, linux-fsdevel, Tyler Hicks, containers, hch On Fri, Jun 15, 2018 at 10:46:41AM -0500, Seth Forshee wrote: > On Fri, Jun 15, 2018 at 08:28:54AM -0700, James Bottomley wrote: > > On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote: > > > I wanted to inquire about the current status of shiftfs and the plans > > > for it moving forward. We'd like to have this functionality available > > > for use in lxd, and I'm interesetd in helping with development (or > > > picking up development if it's stalled). > > > > Well, I'm still using it and technically still developing it; it's just > > that its self contained and there haven't been any interface changes > > necessitating an update that I've noticed, so the last published patch > > still applies (well, unless my git rebasing quietly changed something > > and I didn't notice). > > I did have to make some changes when I tried it out with 4.17, see > below. > > > > To start, is anyone still working on shiftfs or similar > > > functionality? I haven't found it in any git tree on kernel.org, and > > > as far as mailing list activity the last submission I can find is > > > [1]. Is there anything newer than this? > > > > I'm working on it, but it mostly works for my use case. > > > > > Based on past mailing list discussions, it seems like there was still > > > debate as to whether this feature should be an overlay filesystem or > > > something supported at the vfs level. Was this ever resolved? > > > > I think that's the main problem: I don't really have anything > > actionable to work on for upstreaming. I suspect growing more > > consumers would help with this. > > Kind of a chicken/egg problem, as it's difficult to grow consumers when Exactly, but if we have it it'll basically enable a whole bunch of usecases all at once. This is an essential feature to make user namespaces easier to use without compromising on security. Right now, to be able to write to a mapped fstree the easiest way is to punch a hole for the {g,u}id in question into the idmap but that is obviously not ideal. It won't matter that much if the uid in question is 1000 but if your uid is 0 it very much does. > it's not upstream. I'm starting to work with it though. > > However it seems to me that the question of whether or not this should > be an overlay filesystem is pretty fundamental. There was a fairly long Yes, I think this is something we need to settle first and we have already been going back and forth on this. An rough agreement on what the preferred solution is would be good. There'll surely also be a conference where we can present this in more detail and also have a face-to-face discussion. > thread between you and Christoph Hellwig on the topic that didn't really I'M CCing Christoph. He had some good insights in prior threads. Christian > seem to come to a resolution. > > Thanks, > Seth > > > From 70352c549e9c023cd0c0732329a44f5bbcac0ace Mon Sep 17 00:00:00 2001 > From: Seth Forshee <seth.forshee@canonical.com> > Date: Thu, 24 May 2018 09:40:13 -0500 > Subject: [PATCH] shiftfs: Forward port to 4.17-rc6 > > Signed-off-by: Seth Forshee <seth.forshee@canonical.com> > --- > fs/shiftfs.c | 17 +++++++++-------- > 1 file changed, 9 insertions(+), 8 deletions(-) > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > index ea8ac57b3ce1..fa0797bd7880 100644 > --- a/fs/shiftfs.c > +++ b/fs/shiftfs.c > @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry *dentry) > > static struct dentry *shiftfs_d_real(struct dentry *dentry, > const struct inode *inode, > - unsigned int flags) > + unsigned int open_flags, unsigned int flags) > { > struct dentry *real = dentry->d_fsdata; > > if (unlikely(real->d_flags & DCACHE_OP_REAL)) > - return real->d_op->d_real(real, real->d_inode, flags); > + return real->d_op->d_real(real, real->d_inode, open_flags, flags); > > return real; > } > @@ -545,19 +545,20 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr) > return 0; > } > > -static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry, > - struct kstat *stat) > +static int shiftfs_getattr(const struct path *path, struct kstat *stat, > + u32 request_mask, unsigned int flags) > { > + struct dentry *dentry = path->dentry; > + struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info; > struct inode *inode = dentry->d_inode; > struct dentry *real = inode->i_private; > struct inode *reali = real->d_inode; > const struct inode_operations *iop = reali->i_op; > + struct path realpath = { .mnt = ssi->mnt, .dentry = real }; > int err = 0; > > - mnt = dentry->d_sb->s_fs_info; > - > if (iop->getattr) > - err = iop->getattr(mnt, real, stat); > + err = iop->getattr(&realpath, stat, request_mask, flags); > else > generic_fillattr(reali, stat); > > @@ -625,7 +626,7 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode, > * may be aliases plus a few others. > */ > if (reali) > - use_inode_hash = ACCESS_ONCE(reali->i_nlink) > 1 && > + use_inode_hash = READ_ONCE(reali->i_nlink) > 1 && > !S_ISDIR(reali->i_mode); > > if (use_inode_hash) { > -- > 2.17.0 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 15:46 ` Seth Forshee 2018-06-15 16:16 ` Christian Brauner @ 2018-06-15 16:35 ` James Bottomley 2018-06-15 20:17 ` Seth Forshee 1 sibling, 1 reply; 33+ messages in thread From: James Bottomley @ 2018-06-15 16:35 UTC (permalink / raw) To: Seth Forshee; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Fri, 2018-06-15 at 10:46 -0500, Seth Forshee wrote: [...] > Subject: [PATCH] shiftfs: Forward port to 4.17-rc6 > > Signed-off-by: Seth Forshee <seth.forshee@canonical.com> > --- > fs/shiftfs.c | 17 +++++++++-------- > 1 file changed, 9 insertions(+), 8 deletions(-) > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > index ea8ac57b3ce1..fa0797bd7880 100644 > --- a/fs/shiftfs.c > +++ b/fs/shiftfs.c > @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry > *dentry) > > static struct dentry *shiftfs_d_real(struct dentry *dentry, > const struct inode *inode, > - unsigned int flags) > + unsigned int open_flag Well, that's an oopsie: I already have this change in my code base. I think I probably did it at the last minute for a conference presentation (the log indicates it was done in the 4.14 timeframe) and then forgot to repost the patches, which would explain why it's still nicely compiling for me ... I can repost so we've got code to hang the discussion on. James ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: shiftfs status and future development 2018-06-15 16:35 ` James Bottomley @ 2018-06-15 20:17 ` Seth Forshee 0 siblings, 0 replies; 33+ messages in thread From: Seth Forshee @ 2018-06-15 20:17 UTC (permalink / raw) To: James Bottomley; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner On Fri, Jun 15, 2018 at 09:35:49AM -0700, James Bottomley wrote: > On Fri, 2018-06-15 at 10:46 -0500, Seth Forshee wrote: > [...] > > Subject: [PATCH] shiftfs: Forward port to 4.17-rc6 > > > > Signed-off-by: Seth Forshee <seth.forshee@canonical.com> > > --- > > fs/shiftfs.c | 17 +++++++++-------- > > 1 file changed, 9 insertions(+), 8 deletions(-) > > > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > > index ea8ac57b3ce1..fa0797bd7880 100644 > > --- a/fs/shiftfs.c > > +++ b/fs/shiftfs.c > > @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry > > *dentry) > > > > static struct dentry *shiftfs_d_real(struct dentry *dentry, > > const struct inode *inode, > > - unsigned int flags) > > + unsigned int open_flag > > Well, that's an oopsie: I already have this change in my code base. I > think I probably did it at the last minute for a conference > presentation (the log indicates it was done in the 4.14 timeframe) and > then forgot to repost the patches, which would explain why it's still > nicely compiling for me ... > > I can repost so we've got code to hang the discussion on. That would be great, thanks! Seth ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2018-07-03 22:05 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-14 18:44 shiftfs status and future development Seth Forshee 2018-06-15 13:56 ` Serge E. Hallyn 2018-06-15 14:59 ` Seth Forshee 2018-06-15 15:25 ` Matthew Wilcox 2018-06-15 15:56 ` Aleksa Sarai 2018-06-15 16:09 ` James Bottomley 2018-06-15 17:04 ` Aleksa Sarai 2018-06-15 17:22 ` James Bottomley 2018-06-15 20:47 ` Seth Forshee 2018-06-15 21:09 ` James Bottomley 2018-06-15 21:35 ` Seth Forshee 2018-06-16 3:03 ` James Bottomley 2018-06-18 13:40 ` Seth Forshee 2018-06-18 13:49 ` Amir Goldstein 2018-06-18 14:56 ` James Bottomley 2018-06-18 16:03 ` Seth Forshee 2018-06-18 17:11 ` Amir Goldstein 2018-06-18 19:53 ` Phil Estes 2018-06-21 20:16 ` Seth Forshee 2018-06-24 11:32 ` Amir Goldstein 2018-06-25 11:19 ` Christian Brauner 2018-06-27 7:48 ` James Bottomley 2018-06-27 10:17 ` Amir Goldstein 2018-07-03 16:54 ` Serge E. Hallyn 2018-07-03 17:08 ` Stéphane Graber 2018-07-03 22:05 ` Serge E. Hallyn 2018-06-15 14:54 ` Aleksa Sarai 2018-06-15 15:05 ` Seth Forshee 2018-06-15 15:28 ` James Bottomley 2018-06-15 15:46 ` Seth Forshee 2018-06-15 16:16 ` Christian Brauner 2018-06-15 16:35 ` James Bottomley 2018-06-15 20:17 ` Seth Forshee
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).