linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* shiftfs status and future development
@ 2018-06-14 18:44 Seth Forshee
  2018-06-15 13:56 ` Serge E. Hallyn
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-14 18:44 UTC (permalink / raw)
  To: James Bottomley, containers, linux-fsdevel
  Cc: Stéphane Graber, Christian Brauner, Tyler Hicks

I wanted to inquire about the current status of shiftfs and the plans
for it moving forward. We'd like to have this functionality available
for use in lxd, and I'm interesetd in helping with development (or
picking up development if it's stalled).

To start, is anyone still working on shiftfs or similar functionality? I
haven't found it in any git tree on kernel.org, and as far as mailing
list activity the last submission I can find is [1]. Is there anything
newer than this?

Based on past mailing list discussions, it seems like there was still
debate as to whether this feature should be an overlay filesystem or
something supported at the vfs level. Was this ever resolved?

Thanks,
Seth

[1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-14 18:44 shiftfs status and future development Seth Forshee
@ 2018-06-15 13:56 ` Serge E. Hallyn
  2018-06-15 14:59   ` Seth Forshee
  2018-06-15 14:54 ` Aleksa Sarai
  2018-06-15 15:28 ` James Bottomley
  2 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2018-06-15 13:56 UTC (permalink / raw)
  To: Seth Forshee
  Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks,
	Christian Brauner

Quoting Seth Forshee (seth.forshee@canonical.com):
> I wanted to inquire about the current status of shiftfs and the plans
> for it moving forward. We'd like to have this functionality available
> for use in lxd, and I'm interesetd in helping with development (or
> picking up development if it's stalled).
> 
> To start, is anyone still working on shiftfs or similar functionality? I
> haven't found it in any git tree on kernel.org, and as far as mailing
> list activity the last submission I can find is [1]. Is there anything
> newer than this?
> 
> Based on past mailing list discussions, it seems like there was still
> debate as to whether this feature should be an overlay filesystem or
> something supported at the vfs level. Was this ever resolved?
> 
> Thanks,
> Seth
> 
> [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com

Hey Seth,

I haven't heard anything in a long time.  But if this is going to pick
back up, can we come up with a detailed set of goals and requirements?
I don't recall whether the last version still worked like this, but I'm
still not comfortable with the idea of a system where after a reboot,
container-created root-owned files are owned by host root until a path
is specially marked.  Enforcing that the "source" directory is itself
uid-shifted would greatly ease my mind.

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-14 18:44 shiftfs status and future development Seth Forshee
  2018-06-15 13:56 ` Serge E. Hallyn
@ 2018-06-15 14:54 ` Aleksa Sarai
  2018-06-15 15:05   ` Seth Forshee
  2018-06-15 15:28 ` James Bottomley
  2 siblings, 1 reply; 33+ messages in thread
From: Aleksa Sarai @ 2018-06-15 14:54 UTC (permalink / raw)
  To: Seth Forshee
  Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks,
	Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1311 bytes --]

On 2018-06-14, Seth Forshee <seth.forshee@canonical.com> wrote:
> I wanted to inquire about the current status of shiftfs and the plans
> for it moving forward. We'd like to have this functionality available
> for use in lxd, and I'm interesetd in helping with development (or
> picking up development if it's stalled).
> 
> To start, is anyone still working on shiftfs or similar functionality? I
> haven't found it in any git tree on kernel.org, and as far as mailing
> list activity the last submission I can find is [1]. Is there anything
> newer than this?

James Bottomley demoed the current status of shiftfs at the last Linux
Plumbers' Conference. Personally, it looked like some of the motivations
behind why we needed a shiftfs (and what it should look like) were lost
over time, and the result is that the patchset mutated and has
effectively stalled development for over a year (as far as I know it
hasn't been posted again in over a year, and nobody is carrying it in
their tree).

I agree with Serge that if we want to restart its development someone
should write down what the requirements and goals are to avoid having a
patchset which mutates over many review cycles.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 13:56 ` Serge E. Hallyn
@ 2018-06-15 14:59   ` Seth Forshee
  2018-06-15 15:25     ` Matthew Wilcox
  2018-06-16  3:03     ` James Bottomley
  0 siblings, 2 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 14:59 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks,
	Christian Brauner

On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> Quoting Seth Forshee (seth.forshee@canonical.com):
> > I wanted to inquire about the current status of shiftfs and the plans
> > for it moving forward. We'd like to have this functionality available
> > for use in lxd, and I'm interesetd in helping with development (or
> > picking up development if it's stalled).
> > 
> > To start, is anyone still working on shiftfs or similar functionality? I
> > haven't found it in any git tree on kernel.org, and as far as mailing
> > list activity the last submission I can find is [1]. Is there anything
> > newer than this?
> > 
> > Based on past mailing list discussions, it seems like there was still
> > debate as to whether this feature should be an overlay filesystem or
> > something supported at the vfs level. Was this ever resolved?
> > 
> > Thanks,
> > Seth
> > 
> > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartnership.com
> 
> Hey Seth,
> 
> I haven't heard anything in a long time.  But if this is going to pick
> back up, can we come up with a detailed set of goals and requirements?

I was planning to follow up later with some discussion of requirements.
Here are some of ours:

 - Supports any id maps possible for a user namespace

 - Does not break inotify

 - Passes accurate disk usage and source information from the "underlay"

 - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)

 - Works with nested containers

I'm also interested in collecting any requirements others might have.

> I don't recall whether the last version still worked like this, but I'm
> still not comfortable with the idea of a system where after a reboot,
> container-created root-owned files are owned by host root until a path
> is specially marked.  Enforcing that the "source" directory is itself
> uid-shifted would greatly ease my mind.

I understand the concern and share the discomfort to some degree, but
I'm not convinced that requiring the source subtree be shifted is the
right approach.

First, let's address the marking question. As you stated, an approach
that leaves the subree unmarked for a period of time is problematic, and
imo this is a fatal flaw with marking as a protection for e.g. execing
some suid root file written by a container. Writing some such mark to
the filesystem would make it persistent, but it could also limit the
support to a limited set of filesystems.

However, I do think it's necessary for a user with sufficient
capabilities to "bless" a subtree for mounting in a less privileged
context, so this is a feature of marking that I would like to keep. I
think the new mount apis in David Howells' filesystem context patches [1]
might give us a nicer way to do this. For example, root in init_user_ns
could set up a mount fd which specifies the source subtree for the id
shift. At that time the kernel could check for ns_capable(sb->s_user_ns,
CAP_SYS_ADMIN) for the filesystem containing the source subtree. Then
the fd could be passed to a container in a user namespace, who could use
it to attach the mount to its filesystem tree.  The same concept could
be extended to nested containers, as long as the user setting the source
subtree has CAP_SYS_ADMIN towards sb->s_user_ns for the subtree.

Now back to reuiring the srouce subtree be id shifted. I understand the
motivation for wanting this, but I'm not sure I'm in favor of it. To
start, there are other ways to ensure that id shifted mounts don't lead
to problems, such as putting the subtree under a directory accessible
only by root or putting it in a nosuid or noexec mount. For some
implementations those sorts of protections are going to make sense.

Having this requirement may also add significant time to mounting, as I
assume it would involve iterating through all filesystem objects.

Additionally, that requirement is likely to significantly complicate the
implementation. The simplest implementation would just translate the
k[ug]ids in the inodes to a target user ns. A slightly more complicated
approach might translate them based on a source and destination user ns.
If it's implemented based on passing in an arbitrary id map at mount
time it will be more complex and duplicate functionality that user
namespaces already give us.

Thanks,
Seth

[1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.stgit@warthog.procyon.org.uk

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 14:54 ` Aleksa Sarai
@ 2018-06-15 15:05   ` Seth Forshee
  0 siblings, 0 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 15:05 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: James Bottomley, containers, linux-fsdevel, Tyler Hicks,
	Christian Brauner

On Sat, Jun 16, 2018 at 12:54:38AM +1000, Aleksa Sarai wrote:
> On 2018-06-14, Seth Forshee <seth.forshee@canonical.com> wrote:
> > I wanted to inquire about the current status of shiftfs and the plans
> > for it moving forward. We'd like to have this functionality available
> > for use in lxd, and I'm interesetd in helping with development (or
> > picking up development if it's stalled).
> > 
> > To start, is anyone still working on shiftfs or similar functionality? I
> > haven't found it in any git tree on kernel.org, and as far as mailing
> > list activity the last submission I can find is [1]. Is there anything
> > newer than this?
> 
> James Bottomley demoed the current status of shiftfs at the last Linux
> Plumbers' Conference. Personally, it looked like some of the motivations
> behind why we needed a shiftfs (and what it should look like) were lost
> over time, and the result is that the patchset mutated and has
> effectively stalled development for over a year (as far as I know it
> hasn't been posted again in over a year, and nobody is carrying it in
> their tree).

That's my impression too, which I was attempting to confirm with my mail
:-)

> I agree with Serge that if we want to restart its development someone
> should write down what the requirements and goals are to avoid having a
> patchset which mutates over many review cycles.

I agree, though some evolution during review is inevitable. I just
responed to Serge with some of our requirements, if you have some to add
I'd love to see them.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 14:59   ` Seth Forshee
@ 2018-06-15 15:25     ` Matthew Wilcox
  2018-06-15 15:56       ` Aleksa Sarai
  2018-06-15 16:09       ` James Bottomley
  2018-06-16  3:03     ` James Bottomley
  1 sibling, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2018-06-15 15:25 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Serge E. Hallyn, James Bottomley, containers, linux-fsdevel,
	Tyler Hicks, Christian Brauner

On Fri, Jun 15, 2018 at 09:59:17AM -0500, Seth Forshee wrote:
>  - Supports any id maps possible for a user namespace

Have we already ruled out storing the container's UID/GID/perms in an
extended attribute, and having all the files owned by the owner of the
container from the perspective of the unshifted fs.  Then shiftfs reads
the xattr and presents the files with the container's idea of what the
UID is?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-14 18:44 shiftfs status and future development Seth Forshee
  2018-06-15 13:56 ` Serge E. Hallyn
  2018-06-15 14:54 ` Aleksa Sarai
@ 2018-06-15 15:28 ` James Bottomley
  2018-06-15 15:46   ` Seth Forshee
  2 siblings, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-15 15:28 UTC (permalink / raw)
  To: Seth Forshee, containers, linux-fsdevel; +Cc: Tyler Hicks, Christian Brauner

On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote:
> I wanted to inquire about the current status of shiftfs and the plans
> for it moving forward. We'd like to have this functionality available
> for use in lxd, and I'm interesetd in helping with development (or
> picking up development if it's stalled).

Well, I'm still using it and technically still developing it; it's just
that its self contained and there haven't been any interface changes
necessitating an update that I've noticed, so the last published patch
still applies (well, unless my git rebasing quietly changed something
and I didn't notice).

> To start, is anyone still working on shiftfs or similar
> functionality? I haven't found it in any git tree on kernel.org, and
> as far as mailing list activity the last submission I can find is
> [1]. Is there anything newer than this?

I'm working on it, but it mostly works for my use case.

> Based on past mailing list discussions, it seems like there was still
> debate as to whether this feature should be an overlay filesystem or
> something supported at the vfs level. Was this ever resolved?

I think that's the main problem: I don't really have anything
actionable to work on for upstreaming.  I suspect growing more
consumers would help with this.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 15:28 ` James Bottomley
@ 2018-06-15 15:46   ` Seth Forshee
  2018-06-15 16:16     ` Christian Brauner
  2018-06-15 16:35     ` James Bottomley
  0 siblings, 2 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 15:46 UTC (permalink / raw)
  To: James Bottomley; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner

On Fri, Jun 15, 2018 at 08:28:54AM -0700, James Bottomley wrote:
> On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote:
> > I wanted to inquire about the current status of shiftfs and the plans
> > for it moving forward. We'd like to have this functionality available
> > for use in lxd, and I'm interesetd in helping with development (or
> > picking up development if it's stalled).
> 
> Well, I'm still using it and technically still developing it; it's just
> that its self contained and there haven't been any interface changes
> necessitating an update that I've noticed, so the last published patch
> still applies (well, unless my git rebasing quietly changed something
> and I didn't notice).

I did have to make some changes when I tried it out with 4.17, see
below.

> > To start, is anyone still working on shiftfs or similar
> > functionality? I haven't found it in any git tree on kernel.org, and
> > as far as mailing list activity the last submission I can find is
> > [1]. Is there anything newer than this?
> 
> I'm working on it, but it mostly works for my use case.
> 
> > Based on past mailing list discussions, it seems like there was still
> > debate as to whether this feature should be an overlay filesystem or
> > something supported at the vfs level. Was this ever resolved?
> 
> I think that's the main problem: I don't really have anything
> actionable to work on for upstreaming.  I suspect growing more
> consumers would help with this.

Kind of a chicken/egg problem, as it's difficult to grow consumers when
it's not upstream. I'm starting to work with it though.

However it seems to me that the question of whether or not this should
be an overlay filesystem is pretty fundamental. There was a fairly long
thread between you and Christoph Hellwig on the topic that didn't really
seem to come to a resolution.

Thanks,
Seth


>From 70352c549e9c023cd0c0732329a44f5bbcac0ace Mon Sep 17 00:00:00 2001
From: Seth Forshee <seth.forshee@canonical.com>
Date: Thu, 24 May 2018 09:40:13 -0500
Subject: [PATCH] shiftfs: Forward port to 4.17-rc6

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
---
 fs/shiftfs.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index ea8ac57b3ce1..fa0797bd7880 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry *dentry)
 
 static struct dentry *shiftfs_d_real(struct dentry *dentry,
 				     const struct inode *inode,
-				     unsigned int flags)
+				     unsigned int open_flags, unsigned int flags)
 {
 	struct dentry *real = dentry->d_fsdata;
 
 	if (unlikely(real->d_flags & DCACHE_OP_REAL))
-		return real->d_op->d_real(real, real->d_inode, flags);
+		return real->d_op->d_real(real, real->d_inode, open_flags, flags);
 
 	return real;
 }
@@ -545,19 +545,20 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
 	return 0;
 }
 
-static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
-			   struct kstat *stat)
+static int shiftfs_getattr(const struct path *path, struct kstat *stat,
+			   u32 request_mask, unsigned int flags)
 {
+	struct dentry *dentry = path->dentry;
+	struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info;
 	struct inode *inode = dentry->d_inode;
 	struct dentry *real = inode->i_private;
 	struct inode *reali = real->d_inode;
 	const struct inode_operations *iop = reali->i_op;
+	struct path realpath = { .mnt = ssi->mnt, .dentry = real };
 	int err = 0;
 
-	mnt = dentry->d_sb->s_fs_info;
-
 	if (iop->getattr)
-		err = iop->getattr(mnt, real, stat);
+		err = iop->getattr(&realpath, stat, request_mask, flags);
 	else
 		generic_fillattr(reali, stat);
 
@@ -625,7 +626,7 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
 	 * may be aliases plus a few others.
 	 */
 	if (reali)
-		use_inode_hash = ACCESS_ONCE(reali->i_nlink) > 1 &&
+		use_inode_hash = READ_ONCE(reali->i_nlink) > 1 &&
 			!S_ISDIR(reali->i_mode);
 
 	if (use_inode_hash) {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 15:25     ` Matthew Wilcox
@ 2018-06-15 15:56       ` Aleksa Sarai
  2018-06-15 16:09       ` James Bottomley
  1 sibling, 0 replies; 33+ messages in thread
From: Aleksa Sarai @ 2018-06-15 15:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Seth Forshee, containers, James Bottomley, Tyler Hicks,
	Christian Brauner, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

On 2018-06-15, Matthew Wilcox <willy@infradead.org> wrote:
> >  - Supports any id maps possible for a user namespace
> 
> Have we already ruled out storing the container's UID/GID/perms in an
> extended attribute, and having all the files owned by the owner of the
> container from the perspective of the unshifted fs.  Then shiftfs reads
> the xattr and presents the files with the container's idea of what the
> UID is?

I think, while simple, this idea has the problem that you couldn't
really have a single directory be shifted more than once without copying
it (or using an overlayfs which is then shiftfs'd). So for the usecase
of giving each container on a system a unique allocation of host uids
and gids (while using the same image storage) you would run into some
issues.

It does remind me of something similar we do as part of the "rootless
containers" project -- we have "user.rootlesscontainers" which contains a
protobuf payload with the "owner" information. Though in rootless
containers we are using this xattr for something quite different: faking
chown(2) and similar operations to make it look as though an
unprivileged user namespace contains more than one user.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 15:25     ` Matthew Wilcox
  2018-06-15 15:56       ` Aleksa Sarai
@ 2018-06-15 16:09       ` James Bottomley
  2018-06-15 17:04         ` Aleksa Sarai
  1 sibling, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-15 16:09 UTC (permalink / raw)
  To: Matthew Wilcox, Seth Forshee
  Cc: containers, Tyler Hicks, Christian Brauner, linux-fsdevel

On Fri, 2018-06-15 at 08:25 -0700, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 09:59:17AM -0500, Seth Forshee wrote:
> >  - Supports any id maps possible for a user namespace
> 
> Have we already ruled out storing the container's UID/GID/perms in an
> extended attribute, and having all the files owned by the owner of
> the container from the perspective of the unshifted fs.  Then shiftfs
> reads the xattr and presents the files with the container's idea of
> what the UID is?

I've got an experimental patch set that does the *mark* as an xattr. 
However the shift is still done through s_userns, which allows for
multiple shifts.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 15:46   ` Seth Forshee
@ 2018-06-15 16:16     ` Christian Brauner
  2018-06-15 16:35     ` James Bottomley
  1 sibling, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2018-06-15 16:16 UTC (permalink / raw)
  To: Seth Forshee; +Cc: James Bottomley, linux-fsdevel, Tyler Hicks, containers, hch

On Fri, Jun 15, 2018 at 10:46:41AM -0500, Seth Forshee wrote:
> On Fri, Jun 15, 2018 at 08:28:54AM -0700, James Bottomley wrote:
> > On Thu, 2018-06-14 at 13:44 -0500, Seth Forshee wrote:
> > > I wanted to inquire about the current status of shiftfs and the plans
> > > for it moving forward. We'd like to have this functionality available
> > > for use in lxd, and I'm interesetd in helping with development (or
> > > picking up development if it's stalled).
> > 
> > Well, I'm still using it and technically still developing it; it's just
> > that its self contained and there haven't been any interface changes
> > necessitating an update that I've noticed, so the last published patch
> > still applies (well, unless my git rebasing quietly changed something
> > and I didn't notice).
> 
> I did have to make some changes when I tried it out with 4.17, see
> below.
> 
> > > To start, is anyone still working on shiftfs or similar
> > > functionality? I haven't found it in any git tree on kernel.org, and
> > > as far as mailing list activity the last submission I can find is
> > > [1]. Is there anything newer than this?
> > 
> > I'm working on it, but it mostly works for my use case.
> > 
> > > Based on past mailing list discussions, it seems like there was still
> > > debate as to whether this feature should be an overlay filesystem or
> > > something supported at the vfs level. Was this ever resolved?
> > 
> > I think that's the main problem: I don't really have anything
> > actionable to work on for upstreaming.  I suspect growing more
> > consumers would help with this.
> 
> Kind of a chicken/egg problem, as it's difficult to grow consumers when

Exactly, but if we have it it'll basically enable a whole bunch of
usecases all at once. This is an essential feature to make user
namespaces easier to use without compromising on security. Right now, to
be able to write to a mapped fstree the easiest way is to punch a hole
for the {g,u}id in question into the idmap but that is obviously not
ideal. It won't matter that much if the uid in question is 1000 but if
your uid is 0 it very much does.

> it's not upstream. I'm starting to work with it though.
> 
> However it seems to me that the question of whether or not this should
> be an overlay filesystem is pretty fundamental. There was a fairly long

Yes, I think this is something we need to settle first and we have
already been going back and forth on this. An rough agreement on what
the preferred solution is would be good. There'll surely also be a
conference where we can present this in more detail and also have a
face-to-face discussion.

> thread between you and Christoph Hellwig on the topic that didn't really

I'M CCing Christoph. He had some good insights in prior threads.

Christian

> seem to come to a resolution.
> 
> Thanks,
> Seth
> 
> 
> From 70352c549e9c023cd0c0732329a44f5bbcac0ace Mon Sep 17 00:00:00 2001
> From: Seth Forshee <seth.forshee@canonical.com>
> Date: Thu, 24 May 2018 09:40:13 -0500
> Subject: [PATCH] shiftfs: Forward port to 4.17-rc6
> 
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> ---
>  fs/shiftfs.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index ea8ac57b3ce1..fa0797bd7880 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry *dentry)
>  
>  static struct dentry *shiftfs_d_real(struct dentry *dentry,
>  				     const struct inode *inode,
> -				     unsigned int flags)
> +				     unsigned int open_flags, unsigned int flags)
>  {
>  	struct dentry *real = dentry->d_fsdata;
>  
>  	if (unlikely(real->d_flags & DCACHE_OP_REAL))
> -		return real->d_op->d_real(real, real->d_inode, flags);
> +		return real->d_op->d_real(real, real->d_inode, open_flags, flags);
>  
>  	return real;
>  }
> @@ -545,19 +545,20 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr)
>  	return 0;
>  }
>  
> -static int shiftfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
> -			   struct kstat *stat)
> +static int shiftfs_getattr(const struct path *path, struct kstat *stat,
> +			   u32 request_mask, unsigned int flags)
>  {
> +	struct dentry *dentry = path->dentry;
> +	struct shiftfs_super_info *ssi = dentry->d_sb->s_fs_info;
>  	struct inode *inode = dentry->d_inode;
>  	struct dentry *real = inode->i_private;
>  	struct inode *reali = real->d_inode;
>  	const struct inode_operations *iop = reali->i_op;
> +	struct path realpath = { .mnt = ssi->mnt, .dentry = real };
>  	int err = 0;
>  
> -	mnt = dentry->d_sb->s_fs_info;
> -
>  	if (iop->getattr)
> -		err = iop->getattr(mnt, real, stat);
> +		err = iop->getattr(&realpath, stat, request_mask, flags);
>  	else
>  		generic_fillattr(reali, stat);
>  
> @@ -625,7 +626,7 @@ static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
>  	 * may be aliases plus a few others.
>  	 */
>  	if (reali)
> -		use_inode_hash = ACCESS_ONCE(reali->i_nlink) > 1 &&
> +		use_inode_hash = READ_ONCE(reali->i_nlink) > 1 &&
>  			!S_ISDIR(reali->i_mode);
>  
>  	if (use_inode_hash) {
> -- 
> 2.17.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 15:46   ` Seth Forshee
  2018-06-15 16:16     ` Christian Brauner
@ 2018-06-15 16:35     ` James Bottomley
  2018-06-15 20:17       ` Seth Forshee
  1 sibling, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-15 16:35 UTC (permalink / raw)
  To: Seth Forshee; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner

On Fri, 2018-06-15 at 10:46 -0500, Seth Forshee wrote:
[...]
> Subject: [PATCH] shiftfs: Forward port to 4.17-rc6
> 
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> ---
>  fs/shiftfs.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index ea8ac57b3ce1..fa0797bd7880 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry
> *dentry)
>  
>  static struct dentry *shiftfs_d_real(struct dentry *dentry,
>  				     const struct inode *inode,
> -				     unsigned int flags)
> +				     unsigned int open_flag

Well, that's an oopsie: I already have this change in my code base.  I
think I probably did it at the last minute for a conference
presentation (the log indicates it was done in the 4.14 timeframe) and
then forgot to repost the patches, which would explain why it's still
nicely compiling for me ...

I can repost so we've got code to hang the discussion on.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 16:09       ` James Bottomley
@ 2018-06-15 17:04         ` Aleksa Sarai
  2018-06-15 17:22           ` James Bottomley
  0 siblings, 1 reply; 33+ messages in thread
From: Aleksa Sarai @ 2018-06-15 17:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: Matthew Wilcox, Seth Forshee, Tyler Hicks, linux-fsdevel,
	containers, Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > >  - Supports any id maps possible for a user namespace
> > 
> > Have we already ruled out storing the container's UID/GID/perms in an
> > extended attribute, and having all the files owned by the owner of
> > the container from the perspective of the unshifted fs.  Then shiftfs
> > reads the xattr and presents the files with the container's idea of
> > what the UID is?
> 
> I've got an experimental patch set that does the *mark* as an xattr. 

I forgot to ask you about this when we all met face-to-face -- can you
go over what the purpose of marking the mounts before being able to
shifts is? When I saw your demo at LPC I was quite confused about what
it was doing (I think you mentioned it was a security feature, but I
must admit I didn't follow the explanation).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 17:04         ` Aleksa Sarai
@ 2018-06-15 17:22           ` James Bottomley
  2018-06-15 20:47             ` Seth Forshee
  0 siblings, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-15 17:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: containers, Matthew Wilcox, Seth Forshee, Christian Brauner,
	Tyler Hicks, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote:
> On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com
> > wrote:
> > > >  - Supports any id maps possible for a user namespace
> > > 
> > > Have we already ruled out storing the container's UID/GID/perms
> > > in an extended attribute, and having all the files owned by the
> > > owner of the container from the perspective of the unshifted
> > > fs.  Then shiftfs reads the xattr and presents the files with the
> > > container's idea of what the UID is?
> > 
> > I've got an experimental patch set that does the *mark* as an
> > xattr. 
> 
> I forgot to ask you about this when we all met face-to-face -- can
> you go over what the purpose of marking the mounts before being able
> to shifts is? When I saw your demo at LPC I was quite confused about
> what it was doing (I think you mentioned it was a security feature,
> but I must admit I didn't follow the explanation).

OK, so the basic security problem is that an unprivileged tenant cannot
be allowed arbitrary access to both the shifted and underlying
unshifted locations because they can do writes to the shifted mount
that appear at real uid/gid 0 in the underlying unshifted location,
setting up all sorts of unpleasant threats of which suid execution is
just the most obvious one.

My mount marking solution, which the v2 (and forthcoming v3) has is the
idea that the admin buries the real underlying location deep in a path
inaccessible (to the tenant) part of the filesystem and then exposes a
marked mount point to the tenant by doing

mount -t shiftfs -o mark <underlying location> <tenant visible>

Then in the <tenant visible> location we can block the potential
exploits.  When the tenant is building an unprivileged container, it
can do

mount -t shiftfs <tenant visible> <container location>

And the <container location> will now have the shifting in place.

This scheme is ephemeral (the marked mount has to be recreated on every
boot) and rather complex, so the alternative is to add a permanent mark
to <underlying location> so that regular tenant access can be secured
(or even prohibited) but the tenant can still do

mount -t shiftfs <underlying location> <container location>

To get the shifting properties in the container.  In this version of
the scheme, the shift mountable directory is marked with a security
xattr that is permanent (survives reboot) but requires that the
filesystem support xattrs, of course.

The down side of the xattr scheme is that the securing against the
tenant part becomes an xattr enforced thing rather than a shiftfs
enforced thing, so it has to be an additional patch to the kernel
itself rather than being inside a self contained module.

James

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 16:35     ` James Bottomley
@ 2018-06-15 20:17       ` Seth Forshee
  0 siblings, 0 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 20:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: containers, linux-fsdevel, Tyler Hicks, Christian Brauner

On Fri, Jun 15, 2018 at 09:35:49AM -0700, James Bottomley wrote:
> On Fri, 2018-06-15 at 10:46 -0500, Seth Forshee wrote:
> [...]
> > Subject: [PATCH] shiftfs: Forward port to 4.17-rc6
> > 
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > ---
> >  fs/shiftfs.c | 17 +++++++++--------
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index ea8ac57b3ce1..fa0797bd7880 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -109,12 +109,12 @@ static void shiftfs_d_release(struct dentry
> > *dentry)
> >  
> >  static struct dentry *shiftfs_d_real(struct dentry *dentry,
> >  				     const struct inode *inode,
> > -				     unsigned int flags)
> > +				     unsigned int open_flag
> 
> Well, that's an oopsie: I already have this change in my code base.  I
> think I probably did it at the last minute for a conference
> presentation (the log indicates it was done in the 4.14 timeframe) and
> then forgot to repost the patches, which would explain why it's still
> nicely compiling for me ...
> 
> I can repost so we've got code to hang the discussion on.

That would be great, thanks!

Seth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 17:22           ` James Bottomley
@ 2018-06-15 20:47             ` Seth Forshee
  2018-06-15 21:09               ` James Bottomley
  0 siblings, 1 reply; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 20:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner,
	Tyler Hicks, linux-fsdevel

On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote:
> On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote:
> > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership.com
> > > wrote:
> > > > >  - Supports any id maps possible for a user namespace
> > > > 
> > > > Have we already ruled out storing the container's UID/GID/perms
> > > > in an extended attribute, and having all the files owned by the
> > > > owner of the container from the perspective of the unshifted
> > > > fs.  Then shiftfs reads the xattr and presents the files with the
> > > > container's idea of what the UID is?
> > > 
> > > I've got an experimental patch set that does the *mark* as an
> > > xattr. 
> > 
> > I forgot to ask you about this when we all met face-to-face -- can
> > you go over what the purpose of marking the mounts before being able
> > to shifts is? When I saw your demo at LPC I was quite confused about
> > what it was doing (I think you mentioned it was a security feature,
> > but I must admit I didn't follow the explanation).
> 
> OK, so the basic security problem is that an unprivileged tenant cannot
> be allowed arbitrary access to both the shifted and underlying
> unshifted locations because they can do writes to the shifted mount
> that appear at real uid/gid 0 in the underlying unshifted location,
> setting up all sorts of unpleasant threats of which suid execution is
> just the most obvious one.
> 
> My mount marking solution, which the v2 (and forthcoming v3) has is the
> idea that the admin buries the real underlying location deep in a path
> inaccessible (to the tenant) part of the filesystem and then exposes a
> marked mount point to the tenant by doing
> 
> mount -t shiftfs -o mark <underlying location> <tenant visible>
> 
> Then in the <tenant visible> location we can block the potential
> exploits.  When the tenant is building an unprivileged container, it
> can do
> 
> mount -t shiftfs <tenant visible> <container location>
> 
> And the <container location> will now have the shifting in place.

More generally, we can't allow an unprivileged user ns to mount any
subtree with an id shift unless the context that controls that subtree
(i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it would be a
simple matter for any user to create a user ns and make an id shifted
mount of /. The marking in shiftfs is one way of solving this problem.

I don't know if you saw my comments about marking earlier in the thread.
Tl;dr, I think that the new mount apis in the filesystem context patches
could allow an alternative to marking. I think we should be able to
arrange it so that the "host" context sets up a mount fd for shiftfs
mounting a sepecific subtree then passes that fd into the container. The
container can then use the fd to attach the mount to its filesystem
tree. This will provide all the benefits of marking without that awkward
intermediate mount point.

Of course those patches haven't been merged yet, but based on the
discussion I've seen their prospects look good.

> This scheme is ephemeral (the marked mount has to be recreated on every
> boot) and rather complex, so the alternative is to add a permanent mark
> to <underlying location> so that regular tenant access can be secured
> (or even prohibited) but the tenant can still do
> 
> mount -t shiftfs <underlying location> <container location>

This of course would not be possible in my proposed mount fd scheme.

> To get the shifting properties in the container.  In this version of
> the scheme, the shift mountable directory is marked with a security
> xattr that is permanent (survives reboot) but requires that the
> filesystem support xattrs, of course.
> 
> The down side of the xattr scheme is that the securing against the
> tenant part becomes an xattr enforced thing rather than a shiftfs
> enforced thing, so it has to be an additional patch to the kernel
> itself rather than being inside a self contained module.

Would this work for nested containers? I guess it should be fine to
allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so probably
so.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 20:47             ` Seth Forshee
@ 2018-06-15 21:09               ` James Bottomley
  2018-06-15 21:35                 ` Seth Forshee
  0 siblings, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-15 21:09 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner,
	Tyler Hicks, linux-fsdevel

On Fri, 2018-06-15 at 15:47 -0500, Seth Forshee wrote:
> On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote:
> > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote:
> > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership
> > > .com
> > > > wrote:
> > > > > >  - Supports any id maps possible for a user namespace
> > > > > 
> > > > > Have we already ruled out storing the container's
> > > > > UID/GID/perms in an extended attribute, and having all the
> > > > > files owned by the owner of the container from the
> > > > > perspective of the unshifted fs.  Then shiftfs reads the
> > > > > xattr and presents the files with the container's idea of
> > > > > what the UID is?
> > > > 
> > > > I've got an experimental patch set that does the *mark* as an
> > > > xattr. 
> > > 
> > > I forgot to ask you about this when we all met face-to-face --
> > > can you go over what the purpose of marking the mounts before
> > > being able to shifts is? When I saw your demo at LPC I was quite
> > > confused about what it was doing (I think you mentioned it was a
> > > security feature, but I must admit I didn't follow the
> > > explanation).
> > 
> > OK, so the basic security problem is that an unprivileged tenant
> > cannot be allowed arbitrary access to both the shifted and
> > underlying unshifted locations because they can do writes to the
> > shifted mount that appear at real uid/gid 0 in the underlying
> > unshifted location, setting up all sorts of unpleasant threats of
> > which suid execution is just the most obvious one.
> > 
> > My mount marking solution, which the v2 (and forthcoming v3) has is
> > the idea that the admin buries the real underlying location deep in
> > a path inaccessible (to the tenant) part of the filesystem and then
> > exposes a marked mount point to the tenant by doing
> > 
> > mount -t shiftfs -o mark <underlying location> <tenant visible>
> > 
> > Then in the <tenant visible> location we can block the potential
> > exploits.  When the tenant is building an unprivileged container,
> > it can do
> > 
> > mount -t shiftfs <tenant visible> <container location>
> > 
> > And the <container location> will now have the shifting in place.
> 
> More generally, we can't allow an unprivileged user ns to mount any
> subtree with an id shift unless the context that controls that
> subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it
> would be a simple matter for any user to create a user ns and make an
> id shifted mount of /. The marking in shiftfs is one way of solving
> this problem.
> 
> I don't know if you saw my comments about marking earlier in the
> thread. Tl;dr, I think that the new mount apis in the filesystem
> context patches could allow an alternative to marking.

Yes, I read it an I have a draft reply except that I need to
investigate the new mount API.  My suspicion is that it won't work over
standard mount(8) so now we'll need a tool to set it up.

The other point is that it's a temporary mark (again) so it has to be
renewed over every reboot.  The impression I picked up was this was the
least liked part of the current marking scheme (well, possibly first
equally disliked with the complexity) so the xattr (or some other)
scheme might be better received.

>  I think we should be able to arrange it so that the "host" context
> sets up a mount fd for shiftfs mounting a sepecific subtree then
> passes that fd into the container. The container can then use the fd
> to attach the mount to its filesystem tree. This will provide all the
> benefits of marking without that awkward intermediate mount point.

Doesn't this mean that whatever sits on the host end has to be
privileged enough to open the fd in the first place, meaning either a
suid/CAP_SYS_<something> binary or root itself?

> Of course those patches haven't been merged yet, but based on the
> discussion I've seen their prospects look good.
> 
> > This scheme is ephemeral (the marked mount has to be recreated on
> > every boot) and rather complex, so the alternative is to add a
> > permanent mark to <underlying location> so that regular tenant
> > access can be secured (or even prohibited) but the tenant can still
> > do
> > 
> > mount -t shiftfs <underlying location> <container location>
> 
> This of course would not be possible in my proposed mount fd scheme.
> 
> > To get the shifting properties in the container.  In this version
> > of the scheme, the shift mountable directory is marked with a
> > security xattr that is permanent (survives reboot) but requires
> > that the filesystem support xattrs, of course.
> > 
> > The down side of the xattr scheme is that the securing against the
> > tenant part becomes an xattr enforced thing rather than a shiftfs
> > enforced thing, so it has to be an additional patch to the kernel
> > itself rather than being inside a self contained module.
> 
> Would this work for nested containers? I guess it should be fine to
> allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so
> probably so.

Yes, the s_user_ns is simply a copy of the user_ns in operation when
the mount was created, so if user_ns is nested then s_user_ns will be a
reverse nesting.  This should work independently of the marking scheme.

I'm just in the middle of reposting.  The cc list is getting unwieldy;
is it OK if I the fsdevel, kernel, security and containers lists and
drop everyone except for a couple of containers people who aren't on
kernel lists?

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 21:09               ` James Bottomley
@ 2018-06-15 21:35                 ` Seth Forshee
  0 siblings, 0 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-15 21:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Aleksa Sarai, containers, Matthew Wilcox, Christian Brauner,
	Tyler Hicks, linux-fsdevel

On Fri, Jun 15, 2018 at 02:09:52PM -0700, James Bottomley wrote:
> On Fri, 2018-06-15 at 15:47 -0500, Seth Forshee wrote:
> > On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote:
> > > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote:
> > > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership
> > > > .com
> > > > > wrote:
> > > > > > >  - Supports any id maps possible for a user namespace
> > > > > > 
> > > > > > Have we already ruled out storing the container's
> > > > > > UID/GID/perms in an extended attribute, and having all the
> > > > > > files owned by the owner of the container from the
> > > > > > perspective of the unshifted fs.  Then shiftfs reads the
> > > > > > xattr and presents the files with the container's idea of
> > > > > > what the UID is?
> > > > > 
> > > > > I've got an experimental patch set that does the *mark* as an
> > > > > xattr. 
> > > > 
> > > > I forgot to ask you about this when we all met face-to-face --
> > > > can you go over what the purpose of marking the mounts before
> > > > being able to shifts is? When I saw your demo at LPC I was quite
> > > > confused about what it was doing (I think you mentioned it was a
> > > > security feature, but I must admit I didn't follow the
> > > > explanation).
> > > 
> > > OK, so the basic security problem is that an unprivileged tenant
> > > cannot be allowed arbitrary access to both the shifted and
> > > underlying unshifted locations because they can do writes to the
> > > shifted mount that appear at real uid/gid 0 in the underlying
> > > unshifted location, setting up all sorts of unpleasant threats of
> > > which suid execution is just the most obvious one.
> > > 
> > > My mount marking solution, which the v2 (and forthcoming v3) has is
> > > the idea that the admin buries the real underlying location deep in
> > > a path inaccessible (to the tenant) part of the filesystem and then
> > > exposes a marked mount point to the tenant by doing
> > > 
> > > mount -t shiftfs -o mark <underlying location> <tenant visible>
> > > 
> > > Then in the <tenant visible> location we can block the potential
> > > exploits.  When the tenant is building an unprivileged container,
> > > it can do
> > > 
> > > mount -t shiftfs <tenant visible> <container location>
> > > 
> > > And the <container location> will now have the shifting in place.
> > 
> > More generally, we can't allow an unprivileged user ns to mount any
> > subtree with an id shift unless the context that controls that
> > subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it
> > would be a simple matter for any user to create a user ns and make an
> > id shifted mount of /. The marking in shiftfs is one way of solving
> > this problem.
> > 
> > I don't know if you saw my comments about marking earlier in the
> > thread. Tl;dr, I think that the new mount apis in the filesystem
> > context patches could allow an alternative to marking.
> 
> Yes, I read it an I have a draft reply except that I need to
> investigate the new mount API.  My suspicion is that it won't work over
> standard mount(8) so now we'll need a tool to set it up.
> 
> The other point is that it's a temporary mark (again) so it has to be
> renewed over every reboot.  The impression I picked up was this was the
> least liked part of the current marking scheme (well, possibly first
> equally disliked with the complexity) so the xattr (or some other)
> scheme might be better received.

>From my perspective the primary benefit of a permanent mark is that the
kernel could implement some protections for marked subtrees, and
presumably we could do away with the intermediate mount. That sounds
much nicer than the current marking scheme, imo.

I do understand that for others less admin-side setup work is also a
benefit.

> >  I think we should be able to arrange it so that the "host" context
> > sets up a mount fd for shiftfs mounting a sepecific subtree then
> > passes that fd into the container. The container can then use the fd
> > to attach the mount to its filesystem tree. This will provide all the
> > benefits of marking without that awkward intermediate mount point.
> 
> Doesn't this mean that whatever sits on the host end has to be
> privileged enough to open the fd in the first place, meaning either a
> suid/CAP_SYS_<something> binary or root itself?

Yes to the "something privieleged enough" question. It should not be a
suid/setcap executable; an unprivileged user must not be able to pick
what paths are eligible for id shifting. That has to be an admin
operation.

> > Of course those patches haven't been merged yet, but based on the
> > discussion I've seen their prospects look good.
> > 
> > > This scheme is ephemeral (the marked mount has to be recreated on
> > > every boot) and rather complex, so the alternative is to add a
> > > permanent mark to <underlying location> so that regular tenant
> > > access can be secured (or even prohibited) but the tenant can still
> > > do
> > > 
> > > mount -t shiftfs <underlying location> <container location>
> > 
> > This of course would not be possible in my proposed mount fd scheme.
> > 
> > > To get the shifting properties in the container.  In this version
> > > of the scheme, the shift mountable directory is marked with a
> > > security xattr that is permanent (survives reboot) but requires
> > > that the filesystem support xattrs, of course.
> > > 
> > > The down side of the xattr scheme is that the securing against the
> > > tenant part becomes an xattr enforced thing rather than a shiftfs
> > > enforced thing, so it has to be an additional patch to the kernel
> > > itself rather than being inside a self contained module.
> > 
> > Would this work for nested containers? I guess it should be fine to
> > allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so
> > probably so.
> 
> Yes, the s_user_ns is simply a copy of the user_ns in operation when
> the mount was created, so if user_ns is nested then s_user_ns will be a
> reverse nesting.  This should work independently of the marking scheme.
> 
> I'm just in the middle of reposting.  The cc list is getting unwieldy;
> is it OK if I the fsdevel, kernel, security and containers lists and
> drop everyone except for a couple of containers people who aren't on
> kernel lists?

You don't have to include me on the Cc.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-15 14:59   ` Seth Forshee
  2018-06-15 15:25     ` Matthew Wilcox
@ 2018-06-16  3:03     ` James Bottomley
  2018-06-18 13:40       ` Seth Forshee
  1 sibling, 1 reply; 33+ messages in thread
From: James Bottomley @ 2018-06-16  3:03 UTC (permalink / raw)
  To: Seth Forshee, Serge E. Hallyn
  Cc: linux-fsdevel, containers, Tyler Hicks, Christian Brauner

On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > Quoting Seth Forshee (seth.forshee@canonical.com):
> > > I wanted to inquire about the current status of shiftfs and the
> > > plans for it moving forward. We'd like to have this functionality
> > > available for use in lxd, and I'm interesetd in helping with
> > > development (or picking up development if it's stalled).
> > > 
> > > To start, is anyone still working on shiftfs or similar
> > > functionality? I haven't found it in any git tree on kernel.org,
> > > and as far as mailing list activity the last submission I can
> > > find is [1]. Is there anything newer than this?
> > > 
> > > Based on past mailing list discussions, it seems like there was
> > > still debate as to whether this feature should be an overlay
> > > filesystem or something supported at the vfs level. Was this ever
> > > resolved?
> > > 
> > > Thanks,
> > > Seth
> > > 
> > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > ership.com
> > 
> > Hey Seth,
> > 
> > I haven't heard anything in a long time.  But if this is going to
> > pick back up, can we come up with a detailed set of goals and
> > requirements?

That would actually help.

> I was planning to follow up later with some discussion of
> requirements.
> Here are some of ours:
> 
>  - Supports any id maps possible for a user namespace

Could you clarify: right at the moment, it basically reverses the
namespace ID mapping when it does on to the filesystem using the
superblock user namespace, so, in theory you can have an arbitrary
mapping simply by changing the s_userns.  The problem here is that you
don't have a lot of tools for manipulating the s_userns.

>  - Does not break inotify

I don't expect it does, but I haven't checked.

>  - Passes accurate disk usage and source information from the
> "underlay"

mounts of this type don't currently show up in df

>  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)

yes

>  - Works with nested containers

yes

> I'm also interested in collecting any requirements others might have.
> 
> > I don't recall whether the last version still worked like this, but
> > I'm still not comfortable with the idea of a system where after a
> > reboot, container-created root-owned files are owned by host root
> > until a path is specially marked.  Enforcing that the "source"
> > directory is itself uid-shifted would greatly ease my mind.

And I believe we're discussing everything below in a different
subthread.

James


> I understand the concern and share the discomfort to some degree, but
> I'm not convinced that requiring the source subtree be shifted is the
> right approach.
> 
> First, let's address the marking question. As you stated, an approach
> that leaves the subree unmarked for a period of time is problematic,
> and imo this is a fatal flaw with marking as a protection for e.g.
> execing some suid root file written by a container. Writing some such
> mark to the filesystem would make it persistent, but it could also
> limit the support to a limited set of filesystems.
> 
> However, I do think it's necessary for a user with sufficient
> capabilities to "bless" a subtree for mounting in a less privileged
> context, so this is a feature of marking that I would like to keep. I
> think the new mount apis in David Howells' filesystem context patches
> [1] might give us a nicer way to do this. For example, root in
> init_user_ns could set up a mount fd which specifies the source
> subtree for the id shift. At that time the kernel could check for
> ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem
> containing the source subtree. Then the fd could be passed to a
> container in a user namespace, who could use it to attach the mount
> to its filesystem tree.  The same concept could be extended to nested
> containers, as long as the user setting the source subtree has
> CAP_SYS_ADMIN towards sb->s_user_ns for the subtree.
> 
> Now back to reuiring the srouce subtree be id shifted. I understand
> the motivation for wanting this, but I'm not sure I'm in favor of it.
> To start, there are other ways to ensure that id shifted mounts don't
> lead to problems, such as putting the subtree under a directory
> accessible only by root or putting it in a nosuid or noexec mount.
> For some implementations those sorts of protections are going to make
> sense.
> 
> Having this requirement may also add significant time to mounting, as
> I assume it would involve iterating through all filesystem objects.
> 
> Additionally, that requirement is likely to significantly complicate
> the implementation. The simplest implementation would just translate
> the k[ug]ids in the inodes to a target user ns. A slightly more
> complicated approach might translate them based on a source and
> destination user ns. If it's implemented based on passing in an
> arbitrary id map at mount time it will be more complex and duplicate
> functionality that user namespaces already give us.
> 
> Thanks,
> Seth
> 
> [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.st
> git@warthog.procyon.org.uk
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-16  3:03     ` James Bottomley
@ 2018-06-18 13:40       ` Seth Forshee
  2018-06-18 13:49         ` Amir Goldstein
  2018-06-18 14:56         ` James Bottomley
  0 siblings, 2 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-18 13:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks,
	Christian Brauner

On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > Quoting Seth Forshee (seth.forshee@canonical.com):
> > > > I wanted to inquire about the current status of shiftfs and the
> > > > plans for it moving forward. We'd like to have this functionality
> > > > available for use in lxd, and I'm interesetd in helping with
> > > > development (or picking up development if it's stalled).
> > > > 
> > > > To start, is anyone still working on shiftfs or similar
> > > > functionality? I haven't found it in any git tree on kernel.org,
> > > > and as far as mailing list activity the last submission I can
> > > > find is [1]. Is there anything newer than this?
> > > > 
> > > > Based on past mailing list discussions, it seems like there was
> > > > still debate as to whether this feature should be an overlay
> > > > filesystem or something supported at the vfs level. Was this ever
> > > > resolved?
> > > > 
> > > > Thanks,
> > > > Seth
> > > > 
> > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > ership.com
> > > 
> > > Hey Seth,
> > > 
> > > I haven't heard anything in a long time.  But if this is going to
> > > pick back up, can we come up with a detailed set of goals and
> > > requirements?
> 
> That would actually help.
> 
> > I was planning to follow up later with some discussion of
> > requirements.
> > Here are some of ours:
> > 
> >  - Supports any id maps possible for a user namespace
> 
> Could you clarify: right at the moment, it basically reverses the
> namespace ID mapping when it does on to the filesystem using the
> superblock user namespace, so, in theory you can have an arbitrary
> mapping simply by changing the s_userns.  The problem here is that you
> don't have a lot of tools for manipulating the s_userns.

For our purposes the way you're shifting with s_user_ns works fine. I
know that Serge would prefer a more arbitrary shift so that an
arbitrary, unprivileged range in the source fs could be use (e.g. use
ids 100000 - 101000 in the source instead of 0 - 1000), and my thoughts
on that are quoted below.

> >  - Does not break inotify
> 
> I don't expect it does, but I haven't checked.

I haven't checked either; I'm planning to do so soon. This is a concern
that was expressed to me by others, I think because inotify doesn't work
with overlayfs.

> >  - Passes accurate disk usage and source information from the
> > "underlay"
> 
> mounts of this type don't currently show up in df
> 
> >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> 
> yes
> 
> >  - Works with nested containers
> 
> yes

I'd say not so much:

        /* to mark a mount point, must be real root */
        if (ssi->mark && !capable(CAP_SYS_ADMIN))
                goto out;

So within a container I cannot mark a range to be shiftfs-mountable
within a container I create. I'd argue that as long as a user has
CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it should
be safe to allow this as it implies privleges wrt all ids found in the
source mount. This will likely lead to stacked shiftfs mounts, not sure
yet whether or not this works in the current code.

> > I'm also interested in collecting any requirements others might have.
> > 
> > > I don't recall whether the last version still worked like this, but
> > > I'm still not comfortable with the idea of a system where after a
> > > reboot, container-created root-owned files are owned by host root
> > > until a path is specially marked.  Enforcing that the "source"
> > > directory is itself uid-shifted would greatly ease my mind.
> 
> And I believe we're discussing everything below in a different
> subthread.
> 
> James
> 
> 
> > I understand the concern and share the discomfort to some degree, but
> > I'm not convinced that requiring the source subtree be shifted is the
> > right approach.
> > 
> > First, let's address the marking question. As you stated, an approach
> > that leaves the subree unmarked for a period of time is problematic,
> > and imo this is a fatal flaw with marking as a protection for e.g.
> > execing some suid root file written by a container. Writing some such
> > mark to the filesystem would make it persistent, but it could also
> > limit the support to a limited set of filesystems.
> > 
> > However, I do think it's necessary for a user with sufficient
> > capabilities to "bless" a subtree for mounting in a less privileged
> > context, so this is a feature of marking that I would like to keep. I
> > think the new mount apis in David Howells' filesystem context patches
> > [1] might give us a nicer way to do this. For example, root in
> > init_user_ns could set up a mount fd which specifies the source
> > subtree for the id shift. At that time the kernel could check for
> > ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem
> > containing the source subtree. Then the fd could be passed to a
> > container in a user namespace, who could use it to attach the mount
> > to its filesystem tree.  The same concept could be extended to nested
> > containers, as long as the user setting the source subtree has
> > CAP_SYS_ADMIN towards sb->s_user_ns for the subtree.
> > 
> > Now back to reuiring the srouce subtree be id shifted. I understand
> > the motivation for wanting this, but I'm not sure I'm in favor of it.
> > To start, there are other ways to ensure that id shifted mounts don't
> > lead to problems, such as putting the subtree under a directory
> > accessible only by root or putting it in a nosuid or noexec mount.
> > For some implementations those sorts of protections are going to make
> > sense.
> > 
> > Having this requirement may also add significant time to mounting, as
> > I assume it would involve iterating through all filesystem objects.
> > 
> > Additionally, that requirement is likely to significantly complicate
> > the implementation. The simplest implementation would just translate
> > the k[ug]ids in the inodes to a target user ns. A slightly more
> > complicated approach might translate them based on a source and
> > destination user ns. If it's implemented based on passing in an
> > arbitrary id map at mount time it will be more complex and duplicate
> > functionality that user namespaces already give us.
> > 
> > Thanks,
> > Seth
> > 
> > [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.st
> > git@warthog.procyon.org.uk
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 13:40       ` Seth Forshee
@ 2018-06-18 13:49         ` Amir Goldstein
  2018-06-18 14:56         ` James Bottomley
  1 sibling, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2018-06-18 13:49 UTC (permalink / raw)
  To: Seth Forshee
  Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner

On Mon, Jun 18, 2018 at 4:40 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:

>> >  - Does not break inotify
>>
>> I don't expect it does, but I haven't checked.
>
> I haven't checked either; I'm planning to do so soon. This is a concern
> that was expressed to me by others, I think because inotify doesn't work
> with overlayfs.
>

Should be working these days except for corner cases
(copy up of hardlink).
There are two regression tests in LTP:
https://github.com/linux-test-project/ltp/pull/246

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 13:40       ` Seth Forshee
  2018-06-18 13:49         ` Amir Goldstein
@ 2018-06-18 14:56         ` James Bottomley
  2018-06-18 16:03           ` Seth Forshee
  2018-06-18 17:11           ` Amir Goldstein
  1 sibling, 2 replies; 33+ messages in thread
From: James Bottomley @ 2018-06-18 14:56 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks,
	Christian Brauner

On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote:
> On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > > Quoting Seth Forshee (seth.forshee@canonical.com):
> > > > > I wanted to inquire about the current status of shiftfs and
> > > > > the plans for it moving forward. We'd like to have this
> > > > > functionality available for use in lxd, and I'm interesetd in
> > > > > helping with development (or picking up development if it's
> > > > > stalled).
> > > > > 
> > > > > To start, is anyone still working on shiftfs or similar
> > > > > functionality? I haven't found it in any git tree on
> > > > > kernel.org, and as far as mailing list activity the last
> > > > > submission I can find is [1]. Is there anything newer than
> > > > > this?
> > > > > 
> > > > > Based on past mailing list discussions, it seems like there
> > > > > was still debate as to whether this feature should be an
> > > > > overlay filesystem or something supported at the vfs level.
> > > > > Was this ever resolved?
> > > > > 
> > > > > Thanks,
> > > > > Seth
> > > > > 
> > > > > [1]
> > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > > ership.com
> > > > 
> > > > Hey Seth,
> > > > 
> > > > I haven't heard anything in a long time.  But if this is going
> > > > to pick back up, can we come up with a detailed set of goals
> > > > and requirements?
> > 
> > That would actually help.
> > 
> > > I was planning to follow up later with some discussion of
> > > requirements. Here are some of ours:
> > > 
> > >  - Supports any id maps possible for a user namespace
> > 
> > Could you clarify: right at the moment, it basically reverses the
> > namespace ID mapping when it does on to the filesystem using the
> > superblock user namespace, so, in theory you can have an arbitrary
> > mapping simply by changing the s_userns.  The problem here is that
> > you don't have a lot of tools for manipulating the s_userns.
> 
> For our purposes the way you're shifting with s_user_ns works fine. I
> know that Serge would prefer a more arbitrary shift so that an
> arbitrary, unprivileged range in the source fs could be use (e.g. use
> ids 100000 - 101000 in the source instead of 0 - 1000), and my
> thoughts on that are quoted below.

The original (v1) shiftfs did simply take a range of ids to shift as an
argument.  However, that one could only be set up by root and Eric
expressed a desire that it use the s_user_ns.

> > >  - Does not break inotify
> > 
> > I don't expect it does, but I haven't checked.
> 
> I haven't checked either; I'm planning to do so soon. This is a
> concern that was expressed to me by others, I think because inotify
> doesn't work with overlayfs.

I think shiftfs does work simply because it doesn't really do overlays,
so lots of stuff that doesn't work with overlays does work with it.

> > >  - Passes accurate disk usage and source information from the
> > > "underlay"
> > 
> > mounts of this type don't currently show up in df
> > 
> > >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> > 
> > yes
> > 
> > >  - Works with nested containers
> > 
> > yes
> 
> I'd say not so much:
> 
>         /* to mark a mount point, must be real root */
>         if (ssi->mark && !capable(CAP_SYS_ADMIN))
>                 goto out;
> 
> So within a container I cannot mark a range to be shiftfs-mountable
> within a container I create. I'd argue that as long as a user has
> CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it
> should be safe to allow this as it implies privleges wrt all ids
> found in the source mount. This will likely lead to stacked shiftfs
> mounts, not sure yet whether or not this works in the current code.

Um, I think we have different definitions of "works with nested
containers".  Recall that for a nested container the s_user_ns is also
nested, so we shift all the way back to the uid in the root.  That
means if the check for marking is not capable(CAP_SYS_ADMIN) then an
unprivileged user would be able to gain root write access by setting up
a nested shift.  If your definition of nested means we only shift back
one level of user_ns nesting then this could become ns_capable(), so I
think we need to add "what is the desired nesting behaviour?" to the
questions to be answered by the requirements.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 14:56         ` James Bottomley
@ 2018-06-18 16:03           ` Seth Forshee
  2018-06-18 17:11           ` Amir Goldstein
  1 sibling, 0 replies; 33+ messages in thread
From: Seth Forshee @ 2018-06-18 16:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Tyler Hicks,
	Christian Brauner

On Mon, Jun 18, 2018 at 07:56:59AM -0700, James Bottomley wrote:
> On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote:
> > On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> > > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > > > Quoting Seth Forshee (seth.forshee@canonical.com):
> > > > > > I wanted to inquire about the current status of shiftfs and
> > > > > > the plans for it moving forward. We'd like to have this
> > > > > > functionality available for use in lxd, and I'm interesetd in
> > > > > > helping with development (or picking up development if it's
> > > > > > stalled).
> > > > > > 
> > > > > > To start, is anyone still working on shiftfs or similar
> > > > > > functionality? I haven't found it in any git tree on
> > > > > > kernel.org, and as far as mailing list activity the last
> > > > > > submission I can find is [1]. Is there anything newer than
> > > > > > this?
> > > > > > 
> > > > > > Based on past mailing list discussions, it seems like there
> > > > > > was still debate as to whether this feature should be an
> > > > > > overlay filesystem or something supported at the vfs level.
> > > > > > Was this ever resolved?
> > > > > > 
> > > > > > Thanks,
> > > > > > Seth
> > > > > > 
> > > > > > [1]
> > > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > > > ership.com
> > > > > 
> > > > > Hey Seth,
> > > > > 
> > > > > I haven't heard anything in a long time.  But if this is going
> > > > > to pick back up, can we come up with a detailed set of goals
> > > > > and requirements?
> > > 
> > > That would actually help.
> > > 
> > > > I was planning to follow up later with some discussion of
> > > > requirements. Here are some of ours:
> > > > 
> > > >  - Supports any id maps possible for a user namespace
> > > 
> > > Could you clarify: right at the moment, it basically reverses the
> > > namespace ID mapping when it does on to the filesystem using the
> > > superblock user namespace, so, in theory you can have an arbitrary
> > > mapping simply by changing the s_userns.  The problem here is that
> > > you don't have a lot of tools for manipulating the s_userns.
> > 
> > For our purposes the way you're shifting with s_user_ns works fine. I
> > know that Serge would prefer a more arbitrary shift so that an
> > arbitrary, unprivileged range in the source fs could be use (e.g. use
> > ids 100000 - 101000 in the source instead of 0 - 1000), and my
> > thoughts on that are quoted below.
> 
> The original (v1) shiftfs did simply take a range of ids to shift as an
> argument.  However, that one could only be set up by root and Eric
> expressed a desire that it use the s_user_ns.

I like using s_user_ns too, just pointing out that it does likely
preclude using a shifted source.

> > > >  - Does not break inotify
> > > 
> > > I don't expect it does, but I haven't checked.
> > 
> > I haven't checked either; I'm planning to do so soon. This is a
> > concern that was expressed to me by others, I think because inotify
> > doesn't work with overlayfs.
> 
> I think shiftfs does work simply because it doesn't really do overlays,
> so lots of stuff that doesn't work with overlays does work with it.
> 
> > > >  - Passes accurate disk usage and source information from the
> > > > "underlay"
> > > 
> > > mounts of this type don't currently show up in df
> > > 
> > > >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> > > 
> > > yes
> > > 
> > > >  - Works with nested containers
> > > 
> > > yes
> > 
> > I'd say not so much:
> > 
> >         /* to mark a mount point, must be real root */
> >         if (ssi->mark && !capable(CAP_SYS_ADMIN))
> >                 goto out;
> > 
> > So within a container I cannot mark a range to be shiftfs-mountable
> > within a container I create. I'd argue that as long as a user has
> > CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it
> > should be safe to allow this as it implies privleges wrt all ids
> > found in the source mount. This will likely lead to stacked shiftfs
> > mounts, not sure yet whether or not this works in the current code.
> 
> Um, I think we have different definitions of "works with nested
> containers".

Ultimately what I mean is that it should work the same way in a
container as in the host, given that the container has the necessary
capabilities towards the source subtree that it wants to id shift. So my
container should be able to mark a subtree over which I have
ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) and then create a container
that can shiftfs-mount that subtree.

> Recall that for a nested container the s_user_ns is also
> nested, so we shift all the way back to the uid in the root.  That
> means if the check for marking is not capable(CAP_SYS_ADMIN) then an
> unprivileged user would be able to gain root write access by setting up
> a nested shift.

This is true. However, real root already delegated the ability to write
as root within a subtree to the container by marking that subtree.
Since that container can already write as root to that subtree, what
problem is created by letting it create a nested container that can also
write to that subtree as root? Either the host already set things up so
that writes as root to that subtree are not an issue, or it didn't and
you already have a problem.

For non-shiftfs filesystems, inodes from the filesystem cannot have any
ownership by ids not in s_user_ns, so the nested container could not
write any ids not already under the control of the first-level
container.

> If your definition of nested means we only shift back
> one level of user_ns nesting then this could become ns_capable(), so I
> think we need to add "what is the desired nesting behaviour?" to the
> questions to be answered by the requirements.

No, I think in the shiftfs-over-shiftfs use case it does shift all the
way back to the host. My argument is that this isn't actually a problem.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 14:56         ` James Bottomley
  2018-06-18 16:03           ` Seth Forshee
@ 2018-06-18 17:11           ` Amir Goldstein
  2018-06-18 19:53             ` Phil Estes
                               ` (3 more replies)
  1 sibling, 4 replies; 33+ messages in thread
From: Amir Goldstein @ 2018-06-18 17:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner

On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
[...]
>> > >  - Does not break inotify
>> >
>> > I don't expect it does, but I haven't checked.
>>
>> I haven't checked either; I'm planning to do so soon. This is a
>> concern that was expressed to me by others, I think because inotify
>> doesn't work with overlayfs.
>
> I think shiftfs does work simply because it doesn't really do overlays,
> so lots of stuff that doesn't work with overlays does work with it.
>

I'm afraid shiftfs suffers from the same problems that the old naiive
overlayfs inode implementation suffered from.

This problem is demonstrated with LTP tests inotify08 inotify09.
shiftfs_new_inode() is called on every lookup, so inotify watch
may be set on an inode object, then dentry is evicted from cache
and then all events on new dentry are not reported on the watched
inode. You will need to implement hashed inodes to solve it.
Can be done as overlay does - hashing by real inode pointer.

This is just one of those subtle things about stacked fs and there may
be other in present and more in future - if we don't have a shared code
base for the two stacked fs, I wager you are going to end up "cherry
picking" fixes often.

IMO, an important question to ask is, since both shiftfs and overlayfs
are strongly coupled with container use cases, are there users that
are interested in both layering AND shifting? on the same "mark"?
If the answer is yes, then this may be an argument in favor of
integrating at least some of shittfs functionality into overlayfs.

Another argument is that shiftfs itself takes the maximum allowed
2 levels of s_stack_depth for it's 2 mounts, so it is actually not
possible with current VFS limitation to combine shiftfs with overlayfs.

This could be solved relatively easily by adding "-o mark" support
to overlayfs and allowing to mount shiftfs also over "marked" overlayfs
inside container.

Anyway, I'm just playing devil's advocate to the idea of two stacked fs
implementation, so presenting this point of view. I am fully aware that
there are also plenty of disadvantages to couple two unrelated
functionalities together.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 17:11           ` Amir Goldstein
@ 2018-06-18 19:53             ` Phil Estes
  2018-06-21 20:16             ` Seth Forshee
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Phil Estes @ 2018-06-18 19:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: James Bottomley, linux-fsdevel, Seth Forshee, Linux Containers,
	Tyler Hicks, Christian Brauner

Amir Goldstein wrote on 6/18/18 1:11 PM:
> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> [...]
>>>>>   - Does not break inotify
>>>> I don't expect it does, but I haven't checked.
>>> I haven't checked either; I'm planning to do so soon. This is a
>>> concern that was expressed to me by others, I think because inotify
>>> doesn't work with overlayfs.
>> I think shiftfs does work simply because it doesn't really do overlays,
>> so lots of stuff that doesn't work with overlays does work with it.
>>
> I'm afraid shiftfs suffers from the same problems that the old naiive
> overlayfs inode implementation suffered from.
>
> This problem is demonstrated with LTP tests inotify08 inotify09.
> shiftfs_new_inode() is called on every lookup, so inotify watch
> may be set on an inode object, then dentry is evicted from cache
> and then all events on new dentry are not reported on the watched
> inode. You will need to implement hashed inodes to solve it.
> Can be done as overlay does - hashing by real inode pointer.
>
> This is just one of those subtle things about stacked fs and there may
> be other in present and more in future - if we don't have a shared code
> base for the two stacked fs, I wager you are going to end up "cherry
> picking" fixes often.
>
> IMO, an important question to ask is, since both shiftfs and overlayfs
> are strongly coupled with container use cases, are there users that
> are interested in both layering AND shifting? on the same "mark"?
> If the answer is yes, then this may be an argument in favor of
> integrating at least some of shittfs functionality into overlayfs.
>
> Another argument is that shiftfs itself takes the maximum allowed
> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> possible with current VFS limitation to combine shiftfs with overlayfs.

I'm not sure if this was my problem or not, but I did try the v2 
patchset with overlay (by marking and mounting a filesystem tree with 
shiftfs, and then using it as the component of an overlay mount) and 
could not get it to work. I was trying to prepare a demo, but with 
limited time gave up and wasn't able to find time to debug with possible 
help from James.

I think for shiftfs to be useful in certain container contexts, 
combination with overlay is definitely a desirable and/or required 
feature. Of course if it is limited to overlay (e.g. implemented within 
overlayfs) that would be limiting for container use cases for other 
contexts (zfs, btrfs, etc.).
>
> This could be solved relatively easily by adding "-o mark" support
> to overlayfs and allowing to mount shiftfs also over "marked" overlayfs
> inside container.
>
> Anyway, I'm just playing devil's advocate to the idea of two stacked fs
> implementation, so presenting this point of view. I am fully aware that
> there are also plenty of disadvantages to couple two unrelated
> functionalities together.
>
> Cheers,
> Amir.
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 17:11           ` Amir Goldstein
  2018-06-18 19:53             ` Phil Estes
@ 2018-06-21 20:16             ` Seth Forshee
  2018-06-24 11:32               ` Amir Goldstein
  2018-06-25 11:19             ` Christian Brauner
  2018-06-27  7:48             ` James Bottomley
  3 siblings, 1 reply; 33+ messages in thread
From: Seth Forshee @ 2018-06-21 20:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner

On Mon, Jun 18, 2018 at 08:11:52PM +0300, Amir Goldstein wrote:
> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> [...]
> >> > >  - Does not break inotify
> >> >
> >> > I don't expect it does, but I haven't checked.
> >>
> >> I haven't checked either; I'm planning to do so soon. This is a
> >> concern that was expressed to me by others, I think because inotify
> >> doesn't work with overlayfs.
> >
> > I think shiftfs does work simply because it doesn't really do overlays,
> > so lots of stuff that doesn't work with overlays does work with it.
> >
> 
> I'm afraid shiftfs suffers from the same problems that the old naiive
> overlayfs inode implementation suffered from.
> 
> This problem is demonstrated with LTP tests inotify08 inotify09.
> shiftfs_new_inode() is called on every lookup, so inotify watch
> may be set on an inode object, then dentry is evicted from cache
> and then all events on new dentry are not reported on the watched
> inode. You will need to implement hashed inodes to solve it.
> Can be done as overlay does - hashing by real inode pointer.

Thanks for the pointer. I modified the LTP inotify08 test to use shiftfs
instead of overlayfs, and I can confirm that it fails to see the events.

> This is just one of those subtle things about stacked fs and there may
> be other in present and more in future - if we don't have a shared code
> base for the two stacked fs, I wager you are going to end up "cherry
> picking" fixes often.
> 
> IMO, an important question to ask is, since both shiftfs and overlayfs
> are strongly coupled with container use cases, are there users that
> are interested in both layering AND shifting? on the same "mark"?
> If the answer is yes, then this may be an argument in favor of
> integrating at least some of shittfs functionality into overlayfs.
> 
> Another argument is that shiftfs itself takes the maximum allowed
> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> possible with current VFS limitation to combine shiftfs with overlayfs.

That's unfortunate -- it will prevent nested use in addition to use with
overlayfs.

I have been wondering if shiftfs could be made to detect when it was
being layered on top of a shiftfs mount, and do something to make it
behave like it was a single shift on top of the original subtree. I may
look into this. Of course it doesn't help with overlayfs.

> This could be solved relatively easily by adding "-o mark" support
> to overlayfs and allowing to mount shiftfs also over "marked" overlayfs
> inside container.
> 
> Anyway, I'm just playing devil's advocate to the idea of two stacked fs
> implementation, so presenting this point of view. I am fully aware that
> there are also plenty of disadvantages to couple two unrelated
> functionalities together.
> 
> Cheers,
> Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-21 20:16             ` Seth Forshee
@ 2018-06-24 11:32               ` Amir Goldstein
  0 siblings, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2018-06-24 11:32 UTC (permalink / raw)
  To: Seth Forshee
  Cc: James Bottomley, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner

On Thu, Jun 21, 2018 at 11:16 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
[...]
>>
>> Another argument is that shiftfs itself takes the maximum allowed
>> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
>> possible with current VFS limitation to combine shiftfs with overlayfs.
>
> That's unfortunate -- it will prevent nested use in addition to use with
> overlayfs.
>
> I have been wondering if shiftfs could be made to detect when it was
> being layered on top of a shiftfs mount, and do something to make it
> behave like it was a single shift on top of the original subtree. I may
> look into this. Of course it doesn't help with overlayfs.
>

Here's a crazy idea for you - after first mount with -o mark, any
shiftfs_lookup() will return -EIO, i.e. you start the mount inaccesible
to any userns.
Then, inside (maybe nested) userns, you allow userns admin
to remount -o shift, which resets s_user_ns (allowed only once).
Now, any shiftfs operation from the owner userns is allowed and
any operation not from owner userns is denied.

It may be just a crazy idea and I have no idea how hard it would
be to implement, but the upside is that it removes to problem of
being able to access the marked and shifted fs at the same time
because they simply don't exist at the same time - one transforms
into the other in what is hopefully an atomic and secure manner.
Note that at no time, should there exist shiftfs dentries nor inodes
in cache whose sb->s_user_ns is anything other than the shifted
userns (except maybe for the root dentry/inode).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 17:11           ` Amir Goldstein
  2018-06-18 19:53             ` Phil Estes
  2018-06-21 20:16             ` Seth Forshee
@ 2018-06-25 11:19             ` Christian Brauner
  2018-06-27  7:48             ` James Bottomley
  3 siblings, 0 replies; 33+ messages in thread
From: Christian Brauner @ 2018-06-25 11:19 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: James Bottomley, linux-fsdevel, Seth Forshee, Linux Containers,
	Tyler Hicks, Christian Brauner

On Mon, Jun 18, 2018 at 08:11:52PM +0300, Amir Goldstein wrote:
> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> [...]
> >> > >  - Does not break inotify
> >> >
> >> > I don't expect it does, but I haven't checked.
> >>
> >> I haven't checked either; I'm planning to do so soon. This is a
> >> concern that was expressed to me by others, I think because inotify
> >> doesn't work with overlayfs.
> >
> > I think shiftfs does work simply because it doesn't really do overlays,
> > so lots of stuff that doesn't work with overlays does work with it.
> >
> 
> I'm afraid shiftfs suffers from the same problems that the old naiive
> overlayfs inode implementation suffered from.
> 
> This problem is demonstrated with LTP tests inotify08 inotify09.
> shiftfs_new_inode() is called on every lookup, so inotify watch
> may be set on an inode object, then dentry is evicted from cache
> and then all events on new dentry are not reported on the watched
> inode. You will need to implement hashed inodes to solve it.
> Can be done as overlay does - hashing by real inode pointer.
> 
> This is just one of those subtle things about stacked fs and there may
> be other in present and more in future - if we don't have a shared code
> base for the two stacked fs, I wager you are going to end up "cherry
> picking" fixes often.
> 
> IMO, an important question to ask is, since both shiftfs and overlayfs
> are strongly coupled with container use cases, are there users that
> are interested in both layering AND shifting? on the same "mark"?
> If the answer is yes, then this may be an argument in favor of
> integrating at least some of shittfs functionality into overlayfs.

I think that there are use-cases for shiftfs and overlayfs. Especially
since shifts and overlayfs have decent conceptual overlap. They both
provide a mechanism to isolate or at least protect a filesystem tree.
The mechanisms differ but I see no obvious reasons why they couldn't be
combined. That is one of the reasons why we considered it an option to
make shiftfs an extension of overlayfs. But this has been just a thought
experiment so far.
One of the benefits of combining shiftfs and overlayfs I see could be
that the underlay could be made read-only and when combined with shifts
writing a file (e.g. as root) would only hit the upperdir. Although, for
shifts we really also want to allow an option to write through to the
underlay. If done right we also get the nice features of merging
multiple underlays together which seems a good feature to have for
containers. Additionally, it would also allow us to not have to push
another filesystem upstream. (That of course also would not be necessary
if this would be a pure vfs feature which Seth already mentioned in a
prior mail.)

> 
> Another argument is that shiftfs itself takes the maximum allowed
> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> possible with current VFS limitation to combine shiftfs with overlayfs.
> 
> This could be solved relatively easily by adding "-o mark" support
> to overlayfs and allowing to mount shiftfs also over "marked" overlayfs
> inside container.
> 
> Anyway, I'm just playing devil's advocate to the idea of two stacked fs
> implementation, so presenting this point of view. I am fully aware that
> there are also plenty of disadvantages to couple two unrelated
> functionalities together.
> 
> Cheers,
> Amir.
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-18 17:11           ` Amir Goldstein
                               ` (2 preceding siblings ...)
  2018-06-25 11:19             ` Christian Brauner
@ 2018-06-27  7:48             ` James Bottomley
  2018-06-27 10:17               ` Amir Goldstein
  2018-07-03 16:54               ` Serge E. Hallyn
  3 siblings, 2 replies; 33+ messages in thread
From: James Bottomley @ 2018-06-27  7:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner

On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> [...]
> > > > >  - Does not break inotify
> > > > 
> > > > I don't expect it does, but I haven't checked.
> > > 
> > > I haven't checked either; I'm planning to do so soon. This is a
> > > concern that was expressed to me by others, I think because
> > > inotify doesn't work with overlayfs.
> > 
> > I think shiftfs does work simply because it doesn't really do
> > overlays, so lots of stuff that doesn't work with overlays does
> > work with it.
> > 
> 
> I'm afraid shiftfs suffers from the same problems that the old naiive
> overlayfs inode implementation suffered from.
> 
> This problem is demonstrated with LTP tests inotify08 inotify09.
> shiftfs_new_inode() is called on every lookup, so inotify watch
> may be set on an inode object, then dentry is evicted from cache
> and then all events on new dentry are not reported on the watched
> inode. You will need to implement hashed inodes to solve it.
> Can be done as overlay does - hashing by real inode pointer.
> 
> This is just one of those subtle things about stacked fs and there
> may be other in present and more in future - if we don't have a
> shared code base for the two stacked fs, I wager you are going to end
> up "cherry picking" fixes often.
> 
> IMO, an important question to ask is, since both shiftfs and
> overlayfs are strongly coupled with container use cases, are there
> users that are interested in both layering AND shifting? on the same
> "mark"? If the answer is yes, then this may be an argument in favor
> of integrating at least some of shittfs functionality into overlayfs.

My container use case is interested in shifting but not layering.  Even
the docker use case would only mix the two with the overlay graph
driver.  There seem to be quite a few clouds using non overlayfs graph
drivers (the dm one being the most popular).

> Another argument is that shiftfs itself takes the maximum allowed
> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> possible with current VFS limitation to combine shiftfs with
> overlayfs.

That's an artificial, not an inherent, restriction that was introduced
to keep the call stack small.  It can be increased or even eliminated
(although then we'd risk a real run off the end of the kernel stack
problem).

> This could be solved relatively easily by adding "-o mark" support
> to overlayfs and allowing to mount shiftfs also over "marked"
> overlayfs inside container.

Can we please decided whether the temporary mark, as implemented in the
current patch set or a more permanent security.<something> xattr type
mark is preferred for this?  It's an important question that's been
asked, but we have no resolution on.

> Anyway, I'm just playing devil's advocate to the idea of two stacked
> fs implementation, so presenting this point of view. I am fully aware
> that there are also plenty of disadvantages to couple two unrelated
> functionalities together.

The biggest one seems to be that the points at which overlayfs and
shiftfs do credential shifting are subtly different.  That's not to say
they can't be unified, but there's some work to do to prove it's
possible.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-27  7:48             ` James Bottomley
@ 2018-06-27 10:17               ` Amir Goldstein
  2018-07-03 16:54               ` Serge E. Hallyn
  1 sibling, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2018-06-27 10:17 UTC (permalink / raw)
  To: James Bottomley
  Cc: Seth Forshee, linux-fsdevel, Tyler Hicks, Linux Containers,
	Christian Brauner, Mark Salyzyn, Miklos Szeredi, Vivek Goyal

[cc some overlayfs folks]

On Wed, Jun 27, 2018 at 10:48 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
>> On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
>> <James.Bottomley@hansenpartnership.com> wrote:
>> [...]
>> > > > >  - Does not break inotify
>> > > >
>> > > > I don't expect it does, but I haven't checked.
>> > >
>> > > I haven't checked either; I'm planning to do so soon. This is a
>> > > concern that was expressed to me by others, I think because
>> > > inotify doesn't work with overlayfs.
>> >
>> > I think shiftfs does work simply because it doesn't really do
>> > overlays, so lots of stuff that doesn't work with overlays does
>> > work with it.
>> >
>>
>> I'm afraid shiftfs suffers from the same problems that the old naiive
>> overlayfs inode implementation suffered from.
>>
>> This problem is demonstrated with LTP tests inotify08 inotify09.
>> shiftfs_new_inode() is called on every lookup, so inotify watch
>> may be set on an inode object, then dentry is evicted from cache
>> and then all events on new dentry are not reported on the watched
>> inode. You will need to implement hashed inodes to solve it.
>> Can be done as overlay does - hashing by real inode pointer.
>>
>> This is just one of those subtle things about stacked fs and there
>> may be other in present and more in future - if we don't have a
>> shared code base for the two stacked fs, I wager you are going to end
>> up "cherry picking" fixes often.
>>
>> IMO, an important question to ask is, since both shiftfs and
>> overlayfs are strongly coupled with container use cases, are there
>> users that are interested in both layering AND shifting? on the same
>> "mark"? If the answer is yes, then this may be an argument in favor
>> of integrating at least some of shittfs functionality into overlayfs.
>
> My container use case is interested in shifting but not layering.  Even
> the docker use case would only mix the two with the overlay graph
> driver.  There seem to be quite a few clouds using non overlayfs graph
> drivers (the dm one being the most popular).

To be clear, I did not mean that shifting will be possible only for
overlayfs users, I meant that code base could be shared.

One option is what I did with 'snapshot' fs.
It creates a new filesystem_type, reuses most of the overlayfs
internal structures and many of the overlayfs operations and
implements some of its own:
https://github.com/amir73il/linux/commit/49081195e1c085b0f02d73f619de5ec0f1113f08

One point of similarity with shiftfs, is that snapshot fs
has no lowerdir, only upperdir, e.g.:
   mount -t snapshot none none -oupperdir=/foo

And it only took a single line of change in overlayfs code
(in ovl_posix_acl_xattr_set) to support an upper-only configuration.

>
>> Another argument is that shiftfs itself takes the maximum allowed
>> 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
>> possible with current VFS limitation to combine shiftfs with
>> overlayfs.
>
> That's an artificial, not an inherent, restriction that was introduced
> to keep the call stack small.  It can be increased or even eliminated
> (although then we'd risk a real run off the end of the kernel stack
> problem).
>

Yeh, only need to get that change passed Al ;-)

In an earlier reply I proposed a solution to save 1 stacking layer:
   mount -oremount,shift

To change credentials shifting from 'mark' (=deny all access)
to 'shift' (=shift to re-mounter user_ns).

>> This could be solved relatively easily by adding "-o mark" support
>> to overlayfs and allowing to mount shiftfs also over "marked"
>> overlayfs inside container.
>
> Can we please decided whether the temporary mark, as implemented in the
> current patch set or a more permanent security.<something> xattr type
> mark is preferred for this?  It's an important question that's been
> asked, but we have no resolution on.
>

My 2 cents: it is generally considered a bad idea to let users
mess around with overlay upper/work dirs directly, but kernel doesn't do
anything to prevent that. It is completely up to the machine
admin or container runtime to setup security policies to prevent that.

>> Anyway, I'm just playing devil's advocate to the idea of two stacked
>> fs implementation, so presenting this point of view. I am fully aware
>> that there are also plenty of disadvantages to couple two unrelated
>> functionalities together.
>
> The biggest one seems to be that the points at which overlayfs and
> shiftfs do credential shifting are subtly different.  That's not to say
> they can't be unified, but there's some work to do to prove it's
> possible.
>

That is true. I'd like to point your attention to this patch from
Android developers that intends to modify the existing credential
shitfting behavior of overlayfs:
https://marc.info/?l=linux-unionfs&m=153003238029749&w=2

If we accept a second credential shifting paradigm, its an opening
to accept the third one...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-06-27  7:48             ` James Bottomley
  2018-06-27 10:17               ` Amir Goldstein
@ 2018-07-03 16:54               ` Serge E. Hallyn
  2018-07-03 17:08                 ` Stéphane Graber
  1 sibling, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2018-07-03 16:54 UTC (permalink / raw)
  To: James Bottomley
  Cc: Amir Goldstein, linux-fsdevel, Seth Forshee, Linux Containers,
	Tyler Hicks, Christian Brauner

Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> > [...]
> > > > > > �- Does not break inotify
> > > > > 
> > > > > I don't expect it does, but I haven't checked.
> > > > 
> > > > I haven't checked either; I'm planning to do so soon. This is a
> > > > concern that was expressed to me by others, I think because
> > > > inotify doesn't work with overlayfs.
> > > 
> > > I think shiftfs does work simply because it doesn't really do
> > > overlays, so lots of stuff that doesn't work with overlays does
> > > work with it.
> > > 
> > 
> > I'm afraid shiftfs suffers from the same problems that the old naiive
> > overlayfs inode implementation suffered from.
> > 
> > This problem is demonstrated with LTP tests inotify08 inotify09.
> > shiftfs_new_inode() is called on every lookup, so inotify watch
> > may be set on an inode object, then dentry is evicted from cache
> > and then all events on new dentry are not reported on the watched
> > inode. You will need to implement hashed inodes to solve it.
> > Can be done as overlay does - hashing by real inode pointer.
> > 
> > This is just one of those subtle things about stacked fs and there
> > may be other in present and more in future - if we don't have a
> > shared code base for the two stacked fs, I wager you are going to end
> > up "cherry picking" fixes often.
> > 
> > IMO, an important question to ask is, since both shiftfs and
> > overlayfs are strongly coupled with container use cases, are there
> > users that are interested in both layering AND shifting? on the same
> > "mark"? If the answer is yes, then this may be an argument in favor
> > of integrating at least some of shittfs functionality into overlayfs.
> 
> My container use case is interested in shifting but not layering.  Even
> the docker use case would only mix the two with the overlay graph
> driver.  There seem to be quite a few clouds using non overlayfs graph
> drivers (the dm one being the most popular).
> 
> > Another argument is that shiftfs itself takes the maximum allowed
> > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> > possible with current VFS limitation to combine shiftfs with
> > overlayfs.
> 
> That's an artificial, not an inherent, restriction that was introduced
> to keep the call stack small.  It can be increased or even eliminated
> (although then we'd risk a real run off the end of the kernel stack
> problem).
> 
> > This could be solved relatively easily by adding "-o mark" support
> > to overlayfs and allowing to mount shiftfs also over "marked"
> > overlayfs inside container.
> 
> Can we please decided whether the temporary mark, as implemented in the
> current patch set or a more permanent security.<something> xattr type
> mark is preferred for this?  It's an important question that's been
> asked, but we have no resolution on.

I think something permanent is mandatory.  Otherwise users may be able
to induce a reboot into a state where the temp mark isn't made.  A
security.<something> xattr has the problem that an older kernel may not
know about it.

Two possibilities which have been mentioned before:

1. just demand that the *source* idmap doesn't start at 0.  Ideally it
would be something like 100k uids starting at 100k.  The kernel would
refuse to do a shiftfs mount if the source idmap includes uid 0.

I suppose the "let-them-shoot-themselves-in-the-foot" approach would be
to just strongly recommend using such a source uid mapping, but not
enforce it.

2. Enforce that the base directory have perms 700 for shiftfs-mount to
be allowed.  So /var/lib/shiftfs/base-rootfs might be root-owned with
700 perms.  Root then can shiftfs-mount it to /container1 uid-shifted to
100000:0:100000.  Yes, root could stupidly change the perms later, but
failing that uid 1000 can never exploit a setuid-root script under
/var/lib/shiftfs/base-rootfs

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-07-03 16:54               ` Serge E. Hallyn
@ 2018-07-03 17:08                 ` Stéphane Graber
  2018-07-03 22:05                   ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Stéphane Graber @ 2018-07-03 17:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: James Bottomley, Tyler Hicks, Linux Containers, Seth Forshee,
	Christian Brauner, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5243 bytes --]

On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote:
> Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> > > <James.Bottomley@hansenpartnership.com> wrote:
> > > [...]
> > > > > > >  - Does not break inotify
> > > > > > 
> > > > > > I don't expect it does, but I haven't checked.
> > > > > 
> > > > > I haven't checked either; I'm planning to do so soon. This is a
> > > > > concern that was expressed to me by others, I think because
> > > > > inotify doesn't work with overlayfs.
> > > > 
> > > > I think shiftfs does work simply because it doesn't really do
> > > > overlays, so lots of stuff that doesn't work with overlays does
> > > > work with it.
> > > > 
> > > 
> > > I'm afraid shiftfs suffers from the same problems that the old naiive
> > > overlayfs inode implementation suffered from.
> > > 
> > > This problem is demonstrated with LTP tests inotify08 inotify09.
> > > shiftfs_new_inode() is called on every lookup, so inotify watch
> > > may be set on an inode object, then dentry is evicted from cache
> > > and then all events on new dentry are not reported on the watched
> > > inode. You will need to implement hashed inodes to solve it.
> > > Can be done as overlay does - hashing by real inode pointer.
> > > 
> > > This is just one of those subtle things about stacked fs and there
> > > may be other in present and more in future - if we don't have a
> > > shared code base for the two stacked fs, I wager you are going to end
> > > up "cherry picking" fixes often.
> > > 
> > > IMO, an important question to ask is, since both shiftfs and
> > > overlayfs are strongly coupled with container use cases, are there
> > > users that are interested in both layering AND shifting? on the same
> > > "mark"? If the answer is yes, then this may be an argument in favor
> > > of integrating at least some of shittfs functionality into overlayfs.
> > 
> > My container use case is interested in shifting but not layering.  Even
> > the docker use case would only mix the two with the overlay graph
> > driver.  There seem to be quite a few clouds using non overlayfs graph
> > drivers (the dm one being the most popular).
> > 
> > > Another argument is that shiftfs itself takes the maximum allowed
> > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> > > possible with current VFS limitation to combine shiftfs with
> > > overlayfs.
> > 
> > That's an artificial, not an inherent, restriction that was introduced
> > to keep the call stack small.  It can be increased or even eliminated
> > (although then we'd risk a real run off the end of the kernel stack
> > problem).
> > 
> > > This could be solved relatively easily by adding "-o mark" support
> > > to overlayfs and allowing to mount shiftfs also over "marked"
> > > overlayfs inside container.
> > 
> > Can we please decided whether the temporary mark, as implemented in the
> > current patch set or a more permanent security.<something> xattr type
> > mark is preferred for this?  It's an important question that's been
> > asked, but we have no resolution on.
> 
> I think something permanent is mandatory.  Otherwise users may be able
> to induce a reboot into a state where the temp mark isn't made.  A
> security.<something> xattr has the problem that an older kernel may not
> know about it.
> 
> Two possibilities which have been mentioned before:
> 
> 1. just demand that the *source* idmap doesn't start at 0.  Ideally it
> would be something like 100k uids starting at 100k.  The kernel would
> refuse to do a shiftfs mount if the source idmap includes uid 0.
> 
> I suppose the "let-them-shoot-themselves-in-the-foot" approach would be
> to just strongly recommend using such a source uid mapping, but not
> enforce it.
> 
> 2. Enforce that the base directory have perms 700 for shiftfs-mount to
> be allowed.  So /var/lib/shiftfs/base-rootfs might be root-owned with
> 700 perms.  Root then can shiftfs-mount it to /container1 uid-shifted to
> 100000:0:100000.  Yes, root could stupidly change the perms later, but
> failing that uid 1000 can never exploit a setuid-root script under
> /var/lib/shiftfs/base-rootfs
> 
> -serge

I'm personally in favor of this second approach and is what I was hoping to use
for LXD in the future. It's trivial for us to apply that to the parent
directory of the container (so that the rootfs itself can have a
different mode) and is something we're already doing today for
privileged containers anyway (as those have the exact same issue with
setuid binaries).

Having the data on the host shifted to a map which is never used by any
user on the system would make mistakes less likely to happen but it
would also require you to know ahead of time how many uid/gid you'll
need for all future containers that will use that filesystem. 100k for
example would effectively prevent the use of many network authentication
systems inside the container as those tend to map to the 200k+ uid
range.

-- 
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: shiftfs status and future development
  2018-07-03 17:08                 ` Stéphane Graber
@ 2018-07-03 22:05                   ` Serge E. Hallyn
  0 siblings, 0 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2018-07-03 22:05 UTC (permalink / raw)
  To: Stéphane Graber
  Cc: Serge E. Hallyn, James Bottomley, Tyler Hicks, Linux Containers,
	Seth Forshee, Christian Brauner, linux-fsdevel

Quoting St�phane Graber (stgraber@ubuntu.com):
> On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote:
> > Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> > > > <James.Bottomley@hansenpartnership.com> wrote:
> > > > [...]
> > > > > > > > �- Does not break inotify
> > > > > > > 
> > > > > > > I don't expect it does, but I haven't checked.
> > > > > > 
> > > > > > I haven't checked either; I'm planning to do so soon. This is a
> > > > > > concern that was expressed to me by others, I think because
> > > > > > inotify doesn't work with overlayfs.
> > > > > 
> > > > > I think shiftfs does work simply because it doesn't really do
> > > > > overlays, so lots of stuff that doesn't work with overlays does
> > > > > work with it.
> > > > > 
> > > > 
> > > > I'm afraid shiftfs suffers from the same problems that the old naiive
> > > > overlayfs inode implementation suffered from.
> > > > 
> > > > This problem is demonstrated with LTP tests inotify08 inotify09.
> > > > shiftfs_new_inode() is called on every lookup, so inotify watch
> > > > may be set on an inode object, then dentry is evicted from cache
> > > > and then all events on new dentry are not reported on the watched
> > > > inode. You will need to implement hashed inodes to solve it.
> > > > Can be done as overlay does - hashing by real inode pointer.
> > > > 
> > > > This is just one of those subtle things about stacked fs and there
> > > > may be other in present and more in future - if we don't have a
> > > > shared code base for the two stacked fs, I wager you are going to end
> > > > up "cherry picking" fixes often.
> > > > 
> > > > IMO, an important question to ask is, since both shiftfs and
> > > > overlayfs are strongly coupled with container use cases, are there
> > > > users that are interested in both layering AND shifting? on the same
> > > > "mark"? If the answer is yes, then this may be an argument in favor
> > > > of integrating at least some of shittfs functionality into overlayfs.
> > > 
> > > My container use case is interested in shifting but not layering.  Even
> > > the docker use case would only mix the two with the overlay graph
> > > driver.  There seem to be quite a few clouds using non overlayfs graph
> > > drivers (the dm one being the most popular).
> > > 
> > > > Another argument is that shiftfs itself takes the maximum allowed
> > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> > > > possible with current VFS limitation to combine shiftfs with
> > > > overlayfs.
> > > 
> > > That's an artificial, not an inherent, restriction that was introduced
> > > to keep the call stack small.  It can be increased or even eliminated
> > > (although then we'd risk a real run off the end of the kernel stack
> > > problem).
> > > 
> > > > This could be solved relatively easily by adding "-o mark" support
> > > > to overlayfs and allowing to mount shiftfs also over "marked"
> > > > overlayfs inside container.
> > > 
> > > Can we please decided whether the temporary mark, as implemented in the
> > > current patch set or a more permanent security.<something> xattr type
> > > mark is preferred for this?  It's an important question that's been
> > > asked, but we have no resolution on.
> > 
> > I think something permanent is mandatory.  Otherwise users may be able
> > to induce a reboot into a state where the temp mark isn't made.  A
> > security.<something> xattr has the problem that an older kernel may not
> > know about it.
> > 
> > Two possibilities which have been mentioned before:
> > 
> > 1. just demand that the *source* idmap doesn't start at 0.  Ideally it
> > would be something like 100k uids starting at 100k.  The kernel would
> > refuse to do a shiftfs mount if the source idmap includes uid 0.
> > 
> > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be
> > to just strongly recommend using such a source uid mapping, but not
> > enforce it.
> > 
> > 2. Enforce that the base directory have perms 700 for shiftfs-mount to
> > be allowed.  So /var/lib/shiftfs/base-rootfs might be root-owned with
> > 700 perms.  Root then can shiftfs-mount it to /container1 uid-shifted to
> > 100000:0:100000.  Yes, root could stupidly change the perms later, but
> > failing that uid 1000 can never exploit a setuid-root script under
> > /var/lib/shiftfs/base-rootfs
> > 
> > -serge
> 
> I'm personally in favor of this second approach and is what I was hoping to use
> for LXD in the future. It's trivial for us to apply that to the parent
> directory of the container (so that the rootfs itself can have a
> different mode) and is something we're already doing today for
> privileged containers anyway (as those have the exact same issue with
> setuid binaries).
> 
> Having the data on the host shifted to a map which is never used by any
> user on the system would make mistakes less likely to happen but it
> would also require you to know ahead of time how many uid/gid you'll
> need for all future containers that will use that filesystem. 100k for
> example would effectively prevent the use of many network authentication
> systems inside the container as those tend to map to the 200k+ uid
> range.

Well, maybe not if it's overlay-based.  In that case we could do
something where the base fs might only have 10k uids mapped, but the
overlay has 300k.  But that gets a bit funky :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-07-03 22:05 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-14 18:44 shiftfs status and future development Seth Forshee
2018-06-15 13:56 ` Serge E. Hallyn
2018-06-15 14:59   ` Seth Forshee
2018-06-15 15:25     ` Matthew Wilcox
2018-06-15 15:56       ` Aleksa Sarai
2018-06-15 16:09       ` James Bottomley
2018-06-15 17:04         ` Aleksa Sarai
2018-06-15 17:22           ` James Bottomley
2018-06-15 20:47             ` Seth Forshee
2018-06-15 21:09               ` James Bottomley
2018-06-15 21:35                 ` Seth Forshee
2018-06-16  3:03     ` James Bottomley
2018-06-18 13:40       ` Seth Forshee
2018-06-18 13:49         ` Amir Goldstein
2018-06-18 14:56         ` James Bottomley
2018-06-18 16:03           ` Seth Forshee
2018-06-18 17:11           ` Amir Goldstein
2018-06-18 19:53             ` Phil Estes
2018-06-21 20:16             ` Seth Forshee
2018-06-24 11:32               ` Amir Goldstein
2018-06-25 11:19             ` Christian Brauner
2018-06-27  7:48             ` James Bottomley
2018-06-27 10:17               ` Amir Goldstein
2018-07-03 16:54               ` Serge E. Hallyn
2018-07-03 17:08                 ` Stéphane Graber
2018-07-03 22:05                   ` Serge E. Hallyn
2018-06-15 14:54 ` Aleksa Sarai
2018-06-15 15:05   ` Seth Forshee
2018-06-15 15:28 ` James Bottomley
2018-06-15 15:46   ` Seth Forshee
2018-06-15 16:16     ` Christian Brauner
2018-06-15 16:35     ` James Bottomley
2018-06-15 20:17       ` Seth Forshee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).