All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 1/3] vfs: Add inode_sgid_strip() api
@ 2022-03-28  9:56 Yang Xu
  2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Yang Xu @ 2022-03-28  9:56 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel; +Cc: viro, david, jlayton, Yang Xu

inode_sgid_strip() function is used to strip S_ISGID mode
when creat/open/mknod file.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/inode.c         | 12 ++++++++++++
 include/linux/fs.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 63324df6fa27..1f964e7f9698 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2405,3 +2405,15 @@ struct timespec64 current_time(struct inode *inode)
 	return timestamp_truncate(now, inode);
 }
 EXPORT_SYMBOL(current_time);
+
+void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
+		      umode_t *mode)
+{
+	if ((dir && dir->i_mode & S_ISGID) &&
+		(*mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
+		!S_ISDIR(*mode) &&
+		!in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
+		!capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
+		*mode &= ~S_ISGID;
+}
+EXPORT_SYMBOL(inode_sgid_strip);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..639c830ad797 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1921,6 +1921,9 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
 void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		      const struct inode *dir, umode_t mode);
 extern bool may_open_dev(const struct path *path);
+void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
+		      umode_t *mode);
+
 
 /*
  * This is the "filldir" function type, used by readdir() to let
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-28  9:56 [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Yang Xu
@ 2022-03-28  9:56 ` Yang Xu
  2022-03-29 11:08   ` Christian Brauner
  2022-03-29 11:12   ` Jeff Layton
  2022-03-29 10:45 ` [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Christian Brauner
  2022-03-30 16:34 ` Darrick J. Wong
  2 siblings, 2 replies; 12+ messages in thread
From: Yang Xu @ 2022-03-28  9:56 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel; +Cc: viro, david, jlayton, Yang Xu

Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
S_ISGID clear especially umask with S_IXGRP.

Vfs has all the info it needs - it doesn't need the filesystems to do everything
correctly with the mode and ensuring that they order things like posix acl setup
functions correctly with inode_init_owner() to strip the SGID bit.

Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.

Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
this api may change mode by using umask but S_ISGID clear isn't related to
SB_POSIXACL flag.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/inode.c | 4 ----
 fs/namei.c | 7 +++++--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 1f964e7f9698..a2dd71c2437e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
-			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
-			mode &= ~S_ISGID;
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
diff --git a/fs/namei.c b/fs/namei.c
index 3f1829b3ab5b..e68a99e0ac96 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
+		inode_sgid_strip(mnt_userns, dir->d_inode, &mode);
 		if (!IS_POSIXACL(dir->d_inode))
 			mode &= ~current_umask();
 		if (likely(got_write))
@@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
+	inode_sgid_strip(mnt_userns, dir, &mode);
+
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out1;
-
+	mnt_userns = mnt_user_ns(path.mnt);
+	inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
 		goto out2;
 
-	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] vfs: Add inode_sgid_strip() api
  2022-03-28  9:56 [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Yang Xu
  2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
@ 2022-03-29 10:45 ` Christian Brauner
  2022-03-29 22:15   ` Dave Chinner
  2022-03-30 16:34 ` Darrick J. Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2022-03-29 10:45 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, viro, david, jlayton

On Mon, Mar 28, 2022 at 05:56:27PM +0800, Yang Xu wrote:
> inode_sgid_strip() function is used to strip S_ISGID mode
> when creat/open/mknod file.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---

I would've personally gone for returning umode_t but this work for me
too.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
@ 2022-03-29 11:08   ` Christian Brauner
  2022-03-29 11:12   ` Jeff Layton
  1 sibling, 0 replies; 12+ messages in thread
From: Christian Brauner @ 2022-03-29 11:08 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, viro, david, jlayton

On Mon, Mar 28, 2022 at 05:56:28PM +0800, Yang Xu wrote:
> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
> S_ISGID clear especially umask with S_IXGRP.
> 
> Vfs has all the info it needs - it doesn't need the filesystems to do everything
> correctly with the mode and ensuring that they order things like posix acl setup
> functions correctly with inode_init_owner() to strip the SGID bit.
> 
> Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
> 
> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> this api may change mode by using umask but S_ISGID clear isn't related to
> SB_POSIXACL flag.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---

I think adding that helper and using it in the vfs already is a good
idea. But I wonder whether leaving this in inode_init_owner() might be
desirable as well. I don't know how likely it is but if any filesystem
is somehow internally creating a new inode without using vfs_*() helpers
and botches the job then inode_init_owner() would still correctly strip
the setgid bit currently for them.

If we think it's a rather low risk then we can simply move the
strippping completely out of inode_init_owner(). If we think that that's
too risky it might be worth adding a new inode_owner() helper that is
called from inode_init_owner() and that filesystem can be switched to
that we know are safe in that regard?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
  2022-03-29 11:08   ` Christian Brauner
@ 2022-03-29 11:12   ` Jeff Layton
  2022-03-29 22:10     ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2022-03-29 11:12 UTC (permalink / raw)
  To: Yang Xu, linux-fsdevel, ceph-devel; +Cc: viro, david

On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
> S_ISGID clear especially umask with S_IXGRP.
> 
> Vfs has all the info it needs - it doesn't need the filesystems to do everything
> correctly with the mode and ensuring that they order things like posix acl setup
> functions correctly with inode_init_owner() to strip the SGID bit.
> 
> Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
> 
> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> this api may change mode by using umask but S_ISGID clear isn't related to
> SB_POSIXACL flag.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
>  fs/inode.c | 4 ----
>  fs/namei.c | 7 +++++--
>  2 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 1f964e7f9698..a2dd71c2437e 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> -			mode &= ~S_ISGID;
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> diff --git a/fs/namei.c b/fs/namei.c
> index 3f1829b3ab5b..e68a99e0ac96 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (open_flag & O_CREAT) {
>  		if (open_flag & O_EXCL)
>  			open_flag &= ~O_TRUNC;
> +		inode_sgid_strip(mnt_userns, dir->d_inode, &mode);
>  		if (!IS_POSIXACL(dir->d_inode))
>  			mode &= ~current_umask();
>  		if (likely(got_write))
> @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>  	child = d_alloc(dentry, &slash_name);
>  	if (unlikely(!child))
>  		goto out_err;
> +	inode_sgid_strip(mnt_userns, dir, &mode);
> +
>  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>  	if (error)
>  		goto out_err;
> @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>  	error = PTR_ERR(dentry);
>  	if (IS_ERR(dentry))
>  		goto out1;
> -
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode);
>  	if (!IS_POSIXACL(path.dentry->d_inode))
>  		mode &= ~current_umask();
>  	error = security_path_mknod(&path, dentry, mode, dev);
>  	if (error)
>  		goto out2;
>  
> -	mnt_userns = mnt_user_ns(path.mnt);
>  	switch (mode & S_IFMT) {
>  		case 0: case S_IFREG:
>  			error = vfs_create(mnt_userns, path.dentry->d_inode,

I haven't gone over this in detail, but have you tested this with NFS at
all?

IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
if this may end up running afoul of that by forcing the client to try
and strip these bits.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-29 11:12   ` Jeff Layton
@ 2022-03-29 22:10     ` Dave Chinner
  2022-03-30 10:44       ` Christian Brauner
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2022-03-29 22:10 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yang Xu, linux-fsdevel, ceph-devel, viro

On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote:
> On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
> > Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> > to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> > firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
> > S_ISGID clear especially umask with S_IXGRP.
> > 
> > Vfs has all the info it needs - it doesn't need the filesystems to do everything
> > correctly with the mode and ensuring that they order things like posix acl setup
> > functions correctly with inode_init_owner() to strip the SGID bit.
> > 
> > Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
> > 
> > Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> > this api may change mode by using umask but S_ISGID clear isn't related to
> > SB_POSIXACL flag.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> > ---
> >  fs/inode.c | 4 ----
> >  fs/namei.c | 7 +++++--
> >  2 files changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 1f964e7f9698..a2dd71c2437e 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> >  		/* Directories are special, and always inherit S_ISGID */
> >  		if (S_ISDIR(mode))
> >  			mode |= S_ISGID;
> > -		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> > -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> > -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> > -			mode &= ~S_ISGID;
> >  	} else
> >  		inode_fsgid_set(inode, mnt_userns);
> >  	inode->i_mode = mode;
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3f1829b3ab5b..e68a99e0ac96 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >  	if (open_flag & O_CREAT) {
> >  		if (open_flag & O_EXCL)
> >  			open_flag &= ~O_TRUNC;
> > +		inode_sgid_strip(mnt_userns, dir->d_inode, &mode);
> >  		if (!IS_POSIXACL(dir->d_inode))
> >  			mode &= ~current_umask();
> >  		if (likely(got_write))
> > @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> >  	child = d_alloc(dentry, &slash_name);
> >  	if (unlikely(!child))
> >  		goto out_err;
> > +	inode_sgid_strip(mnt_userns, dir, &mode);
> > +
> >  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> >  	if (error)
> >  		goto out_err;
> > @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> >  	error = PTR_ERR(dentry);
> >  	if (IS_ERR(dentry))
> >  		goto out1;
> > -
> > +	mnt_userns = mnt_user_ns(path.mnt);
> > +	inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode);
> >  	if (!IS_POSIXACL(path.dentry->d_inode))
> >  		mode &= ~current_umask();
> >  	error = security_path_mknod(&path, dentry, mode, dev);
> >  	if (error)
> >  		goto out2;
> >  
> > -	mnt_userns = mnt_user_ns(path.mnt);
> >  	switch (mode & S_IFMT) {
> >  		case 0: case S_IFREG:
> >  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> 
> I haven't gone over this in detail, but have you tested this with NFS at
> all?
> 
> IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
> if this may end up running afoul of that by forcing the client to try
> and strip these bits.

All it means is that the mode passed to the NFS server for the
create already has the SGID bit stripped from it. It means the
client is no longer reliant on the server behaving correctly to
close this security hole.

That is, failing to strip the SGID bit appropriately in the local
context is a security issue. Hence local machine security requires
that the NFS client should try to strip the SGID to defend against
buggy/unfixed servers that fail to strip it appropriately and
thereby continute to expose the local machine to this SGID security
issue.

That's the problem here - the SGID stripping in inode_init_owner()
is not documented, wasn't reviewed, doesn't work correctly
across all filesystems and leaves nasty security landmines when the VFS
create mode and the stripped inode mode differ.

Various filesystems have workarounds, partial fixes or no fixes for
these issues and landmines. Hence we have a situation where we are
playing whack-a-mole to discover and slap band-aids over all the
places that inode_init_owner() based stripping does not work
correctly.

In XFS, this meant the problem was not orginally fixed by the
silent, unreviewed change to inode_init_owner() in 2018
because it didn't call inode_init_owner() at all. So 4 years after
the bug was "fixed" and the CVE released, we are still exposed to
the bug because *no filesystem people knew about it* and *nobody wrote a
regression test* to check that the probelm was fixed and stayed
fixed.

And now that XFS does call inode_init_owner(), we've subsequently
discovered that XFS still fail when default acls are enabled because
we create the ACL from the mode passed from the VFS, not the
stripped mode that results from inode_init_owner() being called.

See what I mean about landmines?

The fact is this: regardless of which filesystem is in use, failure
to strip the SGID correctly is considered a security failure that
needs to be fixed. The current VFS infrastructure requires the
filesystem to do everything right and not step on any landmines to
strip the SGID bit, when in fact it can easily be done at the VFS
and the filesystems then don't even need to be aware that the SGID
needs to be (or has been stripped) by the operation the user asked
to be done.

We need the architecture to be *secure by design*, not tacked onto
the side like it is now.  We need to stop trying to dance around
these landmines - it is *not working* and we are blowing our own
feet off repeatedly. This hurts a lot (especially in distro land)
so we need to take the responsibility for stripping SGID properly
away from the filesystems and put it where it belongs: in the VFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] vfs: Add inode_sgid_strip() api
  2022-03-29 10:45 ` [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Christian Brauner
@ 2022-03-29 22:15   ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2022-03-29 22:15 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Yang Xu, linux-fsdevel, ceph-devel, viro, jlayton

On Tue, Mar 29, 2022 at 12:45:16PM +0200, Christian Brauner wrote:
> On Mon, Mar 28, 2022 at 05:56:27PM +0800, Yang Xu wrote:
> > inode_sgid_strip() function is used to strip S_ISGID mode
> > when creat/open/mknod file.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> > ---
> 
> I would've personally gone for returning umode_t but this work for me
> too.

Agreed, that's a much nicer API for this function - it makes it
clear that it can modifying the mode that is passed in.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-29 22:10     ` Dave Chinner
@ 2022-03-30 10:44       ` Christian Brauner
  2022-03-30 16:44         ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2022-03-30 10:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jeff Layton, Yang Xu, linux-fsdevel, ceph-devel, viro

On Wed, Mar 30, 2022 at 09:10:59AM +1100, Dave Chinner wrote:
> On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote:
> > On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
> > > Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> > > to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> > > firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
> > > S_ISGID clear especially umask with S_IXGRP.
> > > 
> > > Vfs has all the info it needs - it doesn't need the filesystems to do everything
> > > correctly with the mode and ensuring that they order things like posix acl setup
> > > functions correctly with inode_init_owner() to strip the SGID bit.
> > > 
> > > Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
> > > 
> > > Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> > > this api may change mode by using umask but S_ISGID clear isn't related to
> > > SB_POSIXACL flag.
> > > 
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> > > ---
> > >  fs/inode.c | 4 ----
> > >  fs/namei.c | 7 +++++--
> > >  2 files changed, 5 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 1f964e7f9698..a2dd71c2437e 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> > >  		/* Directories are special, and always inherit S_ISGID */
> > >  		if (S_ISDIR(mode))
> > >  			mode |= S_ISGID;
> > > -		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> > > -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> > > -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> > > -			mode &= ~S_ISGID;
> > >  	} else
> > >  		inode_fsgid_set(inode, mnt_userns);
> > >  	inode->i_mode = mode;
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 3f1829b3ab5b..e68a99e0ac96 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > >  	if (open_flag & O_CREAT) {
> > >  		if (open_flag & O_EXCL)
> > >  			open_flag &= ~O_TRUNC;
> > > +		inode_sgid_strip(mnt_userns, dir->d_inode, &mode);
> > >  		if (!IS_POSIXACL(dir->d_inode))
> > >  			mode &= ~current_umask();
> > >  		if (likely(got_write))
> > > @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> > >  	child = d_alloc(dentry, &slash_name);
> > >  	if (unlikely(!child))
> > >  		goto out_err;
> > > +	inode_sgid_strip(mnt_userns, dir, &mode);
> > > +
> > >  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> > >  	if (error)
> > >  		goto out_err;
> > > @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> > >  	error = PTR_ERR(dentry);
> > >  	if (IS_ERR(dentry))
> > >  		goto out1;
> > > -
> > > +	mnt_userns = mnt_user_ns(path.mnt);
> > > +	inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode);
> > >  	if (!IS_POSIXACL(path.dentry->d_inode))
> > >  		mode &= ~current_umask();
> > >  	error = security_path_mknod(&path, dentry, mode, dev);
> > >  	if (error)
> > >  		goto out2;
> > >  
> > > -	mnt_userns = mnt_user_ns(path.mnt);
> > >  	switch (mode & S_IFMT) {
> > >  		case 0: case S_IFREG:
> > >  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> > 
> > I haven't gone over this in detail, but have you tested this with NFS at
> > all?
> > 
> > IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
> > if this may end up running afoul of that by forcing the client to try
> > and strip these bits.
> 
> All it means is that the mode passed to the NFS server for the
> create already has the SGID bit stripped from it. It means the
> client is no longer reliant on the server behaving correctly to
> close this security hole.
> 
> That is, failing to strip the SGID bit appropriately in the local
> context is a security issue. Hence local machine security requires
> that the NFS client should try to strip the SGID to defend against
> buggy/unfixed servers that fail to strip it appropriately and
> thereby continute to expose the local machine to this SGID security
> issue.
> 
> That's the problem here - the SGID stripping in inode_init_owner()
> is not documented, wasn't reviewed, doesn't work correctly
> across all filesystems and leaves nasty security landmines when the VFS
> create mode and the stripped inode mode differ.
> 
> Various filesystems have workarounds, partial fixes or no fixes for
> these issues and landmines. Hence we have a situation where we are
> playing whack-a-mole to discover and slap band-aids over all the
> places that inode_init_owner() based stripping does not work
> correctly.
> 
> In XFS, this meant the problem was not orginally fixed by the
> silent, unreviewed change to inode_init_owner() in 2018
> because it didn't call inode_init_owner() at all. So 4 years after
> the bug was "fixed" and the CVE released, we are still exposed to
> the bug because *no filesystem people knew about it* and *nobody wrote a
> regression test* to check that the probelm was fixed and stayed
> fixed.
> 
> And now that XFS does call inode_init_owner(), we've subsequently
> discovered that XFS still fail when default acls are enabled because
> we create the ACL from the mode passed from the VFS, not the
> stripped mode that results from inode_init_owner() being called.
> 
> See what I mean about landmines?
> 
> The fact is this: regardless of which filesystem is in use, failure
> to strip the SGID correctly is considered a security failure that
> needs to be fixed. The current VFS infrastructure requires the
> filesystem to do everything right and not step on any landmines to
> strip the SGID bit, when in fact it can easily be done at the VFS
> and the filesystems then don't even need to be aware that the SGID
> needs to be (or has been stripped) by the operation the user asked
> to be done.
> 
> We need the architecture to be *secure by design*, not tacked onto
> the side like it is now.  We need to stop trying to dance around
> these landmines - it is *not working* and we are blowing our own
> feet off repeatedly. This hurts a lot (especially in distro land)
> so we need to take the responsibility for stripping SGID properly
> away from the filesystems and put it where it belongs: in the VFS.

I agree. When I added tests for set*id stripping to xfstests for the
sake of getting complete vfs coverage of idmapped mounts in generic/633
I immediately found bugs. Once I made the testsuite useable by all
filesystems we started seeing more.

I think we should add and use the new proposed stripping helper in the
vfs - albeit with a slightly changed api and also use it in
inode_init_owner(). While it is a delicate change in the worst case we
end up removing additional privileges that's an acceptable regression
risk to take.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] vfs: Add inode_sgid_strip() api
  2022-03-28  9:56 [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Yang Xu
  2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
  2022-03-29 10:45 ` [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Christian Brauner
@ 2022-03-30 16:34 ` Darrick J. Wong
  2022-03-31  1:49   ` xuyang2018.jy
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2022-03-30 16:34 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, viro, david, jlayton

On Mon, Mar 28, 2022 at 05:56:27PM +0800, Yang Xu wrote:
> inode_sgid_strip() function is used to strip S_ISGID mode
> when creat/open/mknod file.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
>  fs/inode.c         | 12 ++++++++++++
>  include/linux/fs.h |  3 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 63324df6fa27..1f964e7f9698 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2405,3 +2405,15 @@ struct timespec64 current_time(struct inode *inode)
>  	return timestamp_truncate(now, inode);
>  }
>  EXPORT_SYMBOL(current_time);
> +
> +void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
> +		      umode_t *mode)
> +{
> +	if ((dir && dir->i_mode & S_ISGID) &&
> +		(*mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> +		!S_ISDIR(*mode) &&
> +		!in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> +		!capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> +		*mode &= ~S_ISGID;

A couple of style nits here:

The secondary if test clauses have the same indentation level as the
code that actually gets executed, which makes this harder to scan
visually.

	if ((dir && dir->i_mode & S_ISGID) &&
	    (*mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
	    !S_ISDIR(*mode) &&
	    !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
	    !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
		*mode &= ~S_ISGID;

Alternately, you could use inverse logic to bail out early:

void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
		      umode_t *mode)
{
	if (!dir || !(dir->i_mode & S_ISGID))
		return;
	if (S_ISDIR(*mode))
		return;
	if (in_group_p(...))
		return;
	if (capable_wrt_inode_uidgid(...))
		return;

	*mode &= ~S_ISGID;
}

Though I suppose that /is/ much longer.

The bigger thing here is that I'd like to see this patch hoist the ISGID
stripping code out of init_inode_owner so that it's easier to verify
that the new helper does exactly the same thing as the old code.  The
second patch would then add callsites around the VFS as necessary to
prevent this problem from happening again.

--D

> +}
> +EXPORT_SYMBOL(inode_sgid_strip);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e2d892b201b0..639c830ad797 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1921,6 +1921,9 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
>  void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		      const struct inode *dir, umode_t mode);
>  extern bool may_open_dev(const struct path *path);
> +void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
> +		      umode_t *mode);
> +
>  
>  /*
>   * This is the "filldir" function type, used by readdir() to let
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-30 10:44       ` Christian Brauner
@ 2022-03-30 16:44         ` Darrick J. Wong
  2022-03-31  9:30           ` xuyang2018.jy
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2022-03-30 16:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Jeff Layton, Yang Xu, linux-fsdevel, ceph-devel, viro

On Wed, Mar 30, 2022 at 12:44:19PM +0200, Christian Brauner wrote:
> On Wed, Mar 30, 2022 at 09:10:59AM +1100, Dave Chinner wrote:
> > On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote:
> > > On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
> > > > Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> > > > to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> > > > firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
> > > > S_ISGID clear especially umask with S_IXGRP.
> > > > 
> > > > Vfs has all the info it needs - it doesn't need the filesystems to do everything
> > > > correctly with the mode and ensuring that they order things like posix acl setup
> > > > functions correctly with inode_init_owner() to strip the SGID bit.
> > > > 
> > > > Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
> > > > 
> > > > Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> > > > this api may change mode by using umask but S_ISGID clear isn't related to
> > > > SB_POSIXACL flag.
> > > > 
> > > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > > Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> > > > ---
> > > >  fs/inode.c | 4 ----
> > > >  fs/namei.c | 7 +++++--
> > > >  2 files changed, 5 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index 1f964e7f9698..a2dd71c2437e 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> > > >  		/* Directories are special, and always inherit S_ISGID */
> > > >  		if (S_ISDIR(mode))
> > > >  			mode |= S_ISGID;
> > > > -		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> > > > -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> > > > -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> > > > -			mode &= ~S_ISGID;
> > > >  	} else
> > > >  		inode_fsgid_set(inode, mnt_userns);
> > > >  	inode->i_mode = mode;
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index 3f1829b3ab5b..e68a99e0ac96 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > > >  	if (open_flag & O_CREAT) {
> > > >  		if (open_flag & O_EXCL)
> > > >  			open_flag &= ~O_TRUNC;
> > > > +		inode_sgid_strip(mnt_userns, dir->d_inode, &mode);
> > > >  		if (!IS_POSIXACL(dir->d_inode))
> > > >  			mode &= ~current_umask();
> > > >  		if (likely(got_write))
> > > > @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> > > >  	child = d_alloc(dentry, &slash_name);
> > > >  	if (unlikely(!child))
> > > >  		goto out_err;
> > > > +	inode_sgid_strip(mnt_userns, dir, &mode);
> > > > +
> > > >  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> > > >  	if (error)
> > > >  		goto out_err;
> > > > @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> > > >  	error = PTR_ERR(dentry);
> > > >  	if (IS_ERR(dentry))
> > > >  		goto out1;
> > > > -
> > > > +	mnt_userns = mnt_user_ns(path.mnt);
> > > > +	inode_sgid_strip(mnt_userns, path.dentry->d_inode, &mode);
> > > >  	if (!IS_POSIXACL(path.dentry->d_inode))
> > > >  		mode &= ~current_umask();
> > > >  	error = security_path_mknod(&path, dentry, mode, dev);
> > > >  	if (error)
> > > >  		goto out2;
> > > >  
> > > > -	mnt_userns = mnt_user_ns(path.mnt);
> > > >  	switch (mode & S_IFMT) {
> > > >  		case 0: case S_IFREG:
> > > >  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> > > 
> > > I haven't gone over this in detail, but have you tested this with NFS at
> > > all?
> > > 
> > > IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
> > > if this may end up running afoul of that by forcing the client to try
> > > and strip these bits.
> > 
> > All it means is that the mode passed to the NFS server for the
> > create already has the SGID bit stripped from it. It means the
> > client is no longer reliant on the server behaving correctly to
> > close this security hole.
> > 
> > That is, failing to strip the SGID bit appropriately in the local
> > context is a security issue. Hence local machine security requires
> > that the NFS client should try to strip the SGID to defend against
> > buggy/unfixed servers that fail to strip it appropriately and
> > thereby continute to expose the local machine to this SGID security
> > issue.
> > 
> > That's the problem here - the SGID stripping in inode_init_owner()
> > is not documented, wasn't reviewed, doesn't work correctly
> > across all filesystems and leaves nasty security landmines when the VFS
> > create mode and the stripped inode mode differ.
> > 
> > Various filesystems have workarounds, partial fixes or no fixes for
> > these issues and landmines. Hence we have a situation where we are
> > playing whack-a-mole to discover and slap band-aids over all the
> > places that inode_init_owner() based stripping does not work
> > correctly.
> > 
> > In XFS, this meant the problem was not orginally fixed by the
> > silent, unreviewed change to inode_init_owner() in 2018
> > because it didn't call inode_init_owner() at all. So 4 years after
> > the bug was "fixed" and the CVE released, we are still exposed to
> > the bug because *no filesystem people knew about it* and *nobody wrote a
> > regression test* to check that the probelm was fixed and stayed
> > fixed.
> > 
> > And now that XFS does call inode_init_owner(), we've subsequently
> > discovered that XFS still fail when default acls are enabled because
> > we create the ACL from the mode passed from the VFS, not the
> > stripped mode that results from inode_init_owner() being called.
> > 
> > See what I mean about landmines?
> > 
> > The fact is this: regardless of which filesystem is in use, failure
> > to strip the SGID correctly is considered a security failure that
> > needs to be fixed. The current VFS infrastructure requires the
> > filesystem to do everything right and not step on any landmines to
> > strip the SGID bit, when in fact it can easily be done at the VFS
> > and the filesystems then don't even need to be aware that the SGID
> > needs to be (or has been stripped) by the operation the user asked
> > to be done.
> > 
> > We need the architecture to be *secure by design*, not tacked onto
> > the side like it is now.  We need to stop trying to dance around
> > these landmines - it is *not working* and we are blowing our own
> > feet off repeatedly. This hurts a lot (especially in distro land)
> > so we need to take the responsibility for stripping SGID properly
> > away from the filesystems and put it where it belongs: in the VFS.
> 
> I agree. When I added tests for set*id stripping to xfstests for the
> sake of getting complete vfs coverage of idmapped mounts in generic/633
> I immediately found bugs. Once I made the testsuite useable by all
> filesystems we started seeing more.
> 
> I think we should add and use the new proposed stripping helper in the
> vfs - albeit with a slightly changed api and also use it in
> inode_init_owner(). While it is a delicate change in the worst case we
> end up removing additional privileges that's an acceptable regression
> risk to take.

And if it's not too much trouble, can we add an fstest to encode our
current expectations about how setgid inheritance works?  I would really
like to reduce the need for historic setgid behavior spelunking. ;)

--D

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/3] vfs: Add inode_sgid_strip() api
  2022-03-30 16:34 ` Darrick J. Wong
@ 2022-03-31  1:49   ` xuyang2018.jy
  0 siblings, 0 replies; 12+ messages in thread
From: xuyang2018.jy @ 2022-03-31  1:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, ceph-devel, viro, david, jlayton

on 2022/3/31 0:34, Darrick J. Wong wrote:
> On Mon, Mar 28, 2022 at 05:56:27PM +0800, Yang Xu wrote:
>> inode_sgid_strip() function is used to strip S_ISGID mode
>> when creat/open/mknod file.
>>
>> Suggested-by: Dave Chinner<david@fromorbit.com>
>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>> ---
>>   fs/inode.c         | 12 ++++++++++++
>>   include/linux/fs.h |  3 +++
>>   2 files changed, 15 insertions(+)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 63324df6fa27..1f964e7f9698 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2405,3 +2405,15 @@ struct timespec64 current_time(struct inode *inode)
>>   	return timestamp_truncate(now, inode);
>>   }
>>   EXPORT_SYMBOL(current_time);
>> +
>> +void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
>> +		      umode_t *mode)
>> +{
>> +	if ((dir&&  dir->i_mode&  S_ISGID)&&
>> +		(*mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
>> +		!S_ISDIR(*mode)&&
>> +		!in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
>> +		!capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
>> +		*mode&= ~S_ISGID;
>
> A couple of style nits here:
>
> The secondary if test clauses have the same indentation level as the
> code that actually gets executed, which makes this harder to scan
> visually.
>
> 	if ((dir&&  dir->i_mode&  S_ISGID)&&
> 	(*mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
> 	!S_ISDIR(*mode)&&
> 	!in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
> 	!capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> 		*mode&= ~S_ISGID;
>
> Alternately, you could use inverse logic to bail out early:
>
> void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
> 		      umode_t *mode)
> {
> 	if (!dir || !(dir->i_mode&  S_ISGID))
> 		return;
> 	if (S_ISDIR(*mode))
> 		return;
> 	if (in_group_p(...))
> 		return;
> 	if (capable_wrt_inode_uidgid(...))
> 		return;
>
> 	*mode&= ~S_ISGID;
> }
>
> Though I suppose that /is/ much longer.
>
> The bigger thing here is that I'd like to see this patch hoist the ISGID
> stripping code out of init_inode_owner so that it's easier to verify
> that the new helper does exactly the same thing as the old code.  The
> second patch would then add callsites around the VFS as necessary to
> prevent this problem from happening again.

Sounds reasonable. Will do it in v2.

Best Regards
Yang Xu
>
> --D
>
>> +}
>> +EXPORT_SYMBOL(inode_sgid_strip);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index e2d892b201b0..639c830ad797 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1921,6 +1921,9 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
>>   void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		      const struct inode *dir, umode_t mode);
>>   extern bool may_open_dev(const struct path *path);
>> +void inode_sgid_strip(struct user_namespace *mnt_userns, struct inode *dir,
>> +		      umode_t *mode);
>> +
>>
>>   /*
>>    * This is the "filldir" function type, used by readdir() to let
>> --
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem
  2022-03-30 16:44         ` Darrick J. Wong
@ 2022-03-31  9:30           ` xuyang2018.jy
  0 siblings, 0 replies; 12+ messages in thread
From: xuyang2018.jy @ 2022-03-31  9:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, Dave Chinner, Jeff Layton, linux-fsdevel,
	ceph-devel, viro

on 2022/3/31 0:44, Darrick J. Wong wrote:
> On Wed, Mar 30, 2022 at 12:44:19PM +0200, Christian Brauner wrote:
>> On Wed, Mar 30, 2022 at 09:10:59AM +1100, Dave Chinner wrote:
>>> On Tue, Mar 29, 2022 at 07:12:11AM -0400, Jeff Layton wrote:
>>>> On Mon, 2022-03-28 at 17:56 +0800, Yang Xu wrote:
>>>>> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
>>>>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
>>>>> firstly, then posxi acl setup, but xfs uses the contrary order. It will affect
>>>>> S_ISGID clear especially umask with S_IXGRP.
>>>>>
>>>>> Vfs has all the info it needs - it doesn't need the filesystems to do everything
>>>>> correctly with the mode and ensuring that they order things like posix acl setup
>>>>> functions correctly with inode_init_owner() to strip the SGID bit.
>>>>>
>>>>> Just strip the SGID bit at the VFS, and then the filesystems can't get it wrong.
>>>>>
>>>>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
>>>>> this api may change mode by using umask but S_ISGID clear isn't related to
>>>>> SB_POSIXACL flag.
>>>>>
>>>>> Suggested-by: Dave Chinner<david@fromorbit.com>
>>>>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>>>>> ---
>>>>>   fs/inode.c | 4 ----
>>>>>   fs/namei.c | 7 +++++--
>>>>>   2 files changed, 5 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/inode.c b/fs/inode.c
>>>>> index 1f964e7f9698..a2dd71c2437e 100644
>>>>> --- a/fs/inode.c
>>>>> +++ b/fs/inode.c
>>>>> @@ -2246,10 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>>>>   		/* Directories are special, and always inherit S_ISGID */
>>>>>   		if (S_ISDIR(mode))
>>>>>   			mode |= S_ISGID;
>>>>> -		else if ((mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
>>>>> -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
>>>>> -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
>>>>> -			mode&= ~S_ISGID;
>>>>>   	} else
>>>>>   		inode_fsgid_set(inode, mnt_userns);
>>>>>   	inode->i_mode = mode;
>>>>> diff --git a/fs/namei.c b/fs/namei.c
>>>>> index 3f1829b3ab5b..e68a99e0ac96 100644
>>>>> --- a/fs/namei.c
>>>>> +++ b/fs/namei.c
>>>>> @@ -3287,6 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>>>>   	if (open_flag&  O_CREAT) {
>>>>>   		if (open_flag&  O_EXCL)
>>>>>   			open_flag&= ~O_TRUNC;
>>>>> +		inode_sgid_strip(mnt_userns, dir->d_inode,&mode);
>>>>>   		if (!IS_POSIXACL(dir->d_inode))
>>>>>   			mode&= ~current_umask();
>>>>>   		if (likely(got_write))
>>>>> @@ -3521,6 +3522,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>>>>   	child = d_alloc(dentry,&slash_name);
>>>>>   	if (unlikely(!child))
>>>>>   		goto out_err;
>>>>> +	inode_sgid_strip(mnt_userns, dir,&mode);
>>>>> +
>>>>>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>>>>   	if (error)
>>>>>   		goto out_err;
>>>>> @@ -3849,14 +3852,14 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>>>>   	error = PTR_ERR(dentry);
>>>>>   	if (IS_ERR(dentry))
>>>>>   		goto out1;
>>>>> -
>>>>> +	mnt_userns = mnt_user_ns(path.mnt);
>>>>> +	inode_sgid_strip(mnt_userns, path.dentry->d_inode,&mode);
>>>>>   	if (!IS_POSIXACL(path.dentry->d_inode))
>>>>>   		mode&= ~current_umask();
>>>>>   	error = security_path_mknod(&path, dentry, mode, dev);
>>>>>   	if (error)
>>>>>   		goto out2;
>>>>>
>>>>> -	mnt_userns = mnt_user_ns(path.mnt);
>>>>>   	switch (mode&  S_IFMT) {
>>>>>   		case 0: case S_IFREG:
>>>>>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
>>>>
>>>> I haven't gone over this in detail, but have you tested this with NFS at
>>>> all?
>>>>
>>>> IIRC, NFS has to leave setuid/gid stripping to the server, so I wonder
>>>> if this may end up running afoul of that by forcing the client to try
>>>> and strip these bits.
>>>
>>> All it means is that the mode passed to the NFS server for the
>>> create already has the SGID bit stripped from it. It means the
>>> client is no longer reliant on the server behaving correctly to
>>> close this security hole.
>>>
>>> That is, failing to strip the SGID bit appropriately in the local
>>> context is a security issue. Hence local machine security requires
>>> that the NFS client should try to strip the SGID to defend against
>>> buggy/unfixed servers that fail to strip it appropriately and
>>> thereby continute to expose the local machine to this SGID security
>>> issue.
>>>
>>> That's the problem here - the SGID stripping in inode_init_owner()
>>> is not documented, wasn't reviewed, doesn't work correctly
>>> across all filesystems and leaves nasty security landmines when the VFS
>>> create mode and the stripped inode mode differ.
>>>
>>> Various filesystems have workarounds, partial fixes or no fixes for
>>> these issues and landmines. Hence we have a situation where we are
>>> playing whack-a-mole to discover and slap band-aids over all the
>>> places that inode_init_owner() based stripping does not work
>>> correctly.
>>>
>>> In XFS, this meant the problem was not orginally fixed by the
>>> silent, unreviewed change to inode_init_owner() in 2018
>>> because it didn't call inode_init_owner() at all. So 4 years after
>>> the bug was "fixed" and the CVE released, we are still exposed to
>>> the bug because *no filesystem people knew about it* and *nobody wrote a
>>> regression test* to check that the probelm was fixed and stayed
>>> fixed.
>>>
>>> And now that XFS does call inode_init_owner(), we've subsequently
>>> discovered that XFS still fail when default acls are enabled because
>>> we create the ACL from the mode passed from the VFS, not the
>>> stripped mode that results from inode_init_owner() being called.
>>>
>>> See what I mean about landmines?
>>>
>>> The fact is this: regardless of which filesystem is in use, failure
>>> to strip the SGID correctly is considered a security failure that
>>> needs to be fixed. The current VFS infrastructure requires the
>>> filesystem to do everything right and not step on any landmines to
>>> strip the SGID bit, when in fact it can easily be done at the VFS
>>> and the filesystems then don't even need to be aware that the SGID
>>> needs to be (or has been stripped) by the operation the user asked
>>> to be done.
>>>
>>> We need the architecture to be *secure by design*, not tacked onto
>>> the side like it is now.  We need to stop trying to dance around
>>> these landmines - it is *not working* and we are blowing our own
>>> feet off repeatedly. This hurts a lot (especially in distro land)
>>> so we need to take the responsibility for stripping SGID properly
>>> away from the filesystems and put it where it belongs: in the VFS.
>>
>> I agree. When I added tests for set*id stripping to xfstests for the
>> sake of getting complete vfs coverage of idmapped mounts in generic/633
>> I immediately found bugs. Once I made the testsuite useable by all
>> filesystems we started seeing more.
>>
>> I think we should add and use the new proposed stripping helper in the
>> vfs - albeit with a slightly changed api and also use it in
>> inode_init_owner(). While it is a delicate change in the worst case we
>> end up removing additional privileges that's an acceptable regression
>> risk to take.
>
> And if it's not too much trouble, can we add an fstest to encode our
> current expectations about how setgid inheritance works?  I would really
> like to reduce the need for historic setgid behavior spelunking. ;)
I have sent two patches to increase the idmapped mounts coverage for 
S_ISGID in xfstests.

Best Regards
Yang Xu

>
> --D

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-31  9:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28  9:56 [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Yang Xu
2022-03-28  9:56 ` [PATCH v1 2/3] vfs: strip file's S_ISGID mode on vfs instead of on filesystem Yang Xu
2022-03-29 11:08   ` Christian Brauner
2022-03-29 11:12   ` Jeff Layton
2022-03-29 22:10     ` Dave Chinner
2022-03-30 10:44       ` Christian Brauner
2022-03-30 16:44         ` Darrick J. Wong
2022-03-31  9:30           ` xuyang2018.jy
2022-03-29 10:45 ` [PATCH v1 1/3] vfs: Add inode_sgid_strip() api Christian Brauner
2022-03-29 22:15   ` Dave Chinner
2022-03-30 16:34 ` Darrick J. Wong
2022-03-31  1:49   ` xuyang2018.jy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.