All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 11:11 ` [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
@ 2022-04-26 10:38   ` Christian Brauner
  2022-04-26 11:20     ` Amir Goldstein
  2022-04-26 15:52   ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-26 10:38 UTC (permalink / raw)
  To: Yang Xu
  Cc: linux-fsdevel, ceph-devel, viro, david, djwong, willy, jlayton,
	Amir Goldstein

On Tue, Apr 26, 2022 at 07:11:29PM +0800, Yang Xu wrote:
> Creating files that have both the S_IXGRP and S_ISGID bit raised in
> directories that themselves have the S_ISGID bit set requires additional
> privileges to avoid security issues.
> 
> When a filesystem creates a new inode it needs to take care that the
> caller is either in the group of the newly created inode or they have
> CAP_FSETID in their current user namespace and are privileged over the
> parent directory of the new inode. If any of these two conditions is
> true then the S_ISGID bit can be raised for an S_IXGRP file and if not
> it needs to be stripped.
> 
> However, there are several key issues with the current state of things:
> 
> * The S_ISGID stripping logic is entangled with umask stripping.
> 
>   If a filesystem doesn't support or enable POSIX ACLs then umask
>   stripping is done directly in the vfs before calling into the
>   filesystem.
>   If the filesystem does support POSIX ACLs then unmask stripping may be
>   done in the filesystem itself when calling posix_acl_create().
> 
> * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
>   stripping logic.
> 
>   While that may be intentional (e.g. network filesystems might just
>   defer setgid stripping to a server) it is often just a security issue.
> 
> * The first two points taken together mean that there's a
>   non-standardized ordering between setgid stripping in
>   inode_init_owner() and posix_acl_create() both on the vfs level and
>   the filesystem level. The latter part is especially problematic since
>   each filesystem is technically free to order inode_init_owner() and
>   posix_acl_create() however it sees fit meaning that S_ISGID
>   inheritance might or might not be applied.
> 
> * We do still have bugs in this areas years after the initial round of
>   setgid bugfixes.
> 
> So the current state is quite messy and while we won't be able to make
> it completely clean as posix_acl_create() is still a filesystem specific
> call we can improve the S_SIGD stripping situation quite a bit by
> hoisting it out of inode_init_owner() and into the vfs creation
> operations. This means we alleviate the burden for filesystems to handle
> S_ISGID stripping correctly and can standardize the ordering between
> S_ISGID and umask stripping in the vfs.
> 
> The S_ISGID bit is stripped before any umask is applied. This has the
> advantage that the ordering is unaffected by whether umask stripping is
> done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
> the filesystem in posix_acl_create() (if POSIX ACLs are supported).
> 
> To this end a new helper vfs_prepare_mode() is added which calls the
> previously added mode_strip_setgid() helper and strips the umask
> afterwards.
> 
> All inode operations that create new filesystem objects have been
> updated to call vfs_prepare_mode() before passing the mode into the
> relevant inode operation of the filesystems. Care has been taken to
> ensure that the mode passed to the security hooks is the mode that is
> seen by the filesystem.
> 
> Following is an overview of the filesystem specific and inode operations
> specific implications:
> 
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> 
> All of the above filesystems end up calling inode_init_owner() when new
> filesystem objects are created through the following ->mkdir(),
> ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
> operations.
> 
> Since directories always inherit the S_ISGID bit with the exception of
> xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
> apply. The ->symlink() inode operation trivially inherit the mode from
> the target and the ->rename() inode operation inherits the mode from the
> source inode.
> 
> All other inode operations will have the S_ISGID bit stripped once in
> vfs_prepare_mode() before.
> 
> In addition to this there are filesystems which allow the creation of
> filesystem objects through ioctl()s or - in the case of spufs -
> circumventing the vfs in other ways. If filesystem objects are created
> through ioctl()s the vfs doesn't know about it and can't apply regular
> permission checking including S_ISGID logic. Therfore, a filesystem
> relying on S_ISGID stripping in inode_init_owner() in their ioctl()
> callpath will be affected by moving this logic into the vfs.
> 
> So we did our best to audit all filesystems in this regard:
> 
> * btrfs allows the creation of filesystem objects through various
>   ioctls(). Snapshot creation literally takes a snapshot and so the mode
>   is fully preserved and S_ISGID stripping doesn't apply.
> 
>   Creating a new subvolum relies on inode_init_owner() in
>   btrfs_new_inode() but only creates directories and doesn't raise
>   S_ISGID.
> 
> * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
>   xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
>   the actual extents ocfs2 uses a separate ioctl() that also creates the
>   target file.
> 
>   Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
>   inode_init_owner() to strip the S_ISGID bit. This is the only place
>   where a filesystem needs to call mode_strip_sgid() directly but this
>   is self-inflicted pain tbh.
> 
> * spufs doesn't go through the vfs at all and doesn't use ioctl()s
>   either. Instead it has a dedicated system call spufs_create() which
>   allows the creation of filesystem objects. But spufs only creates
>   directories and doesn't allo S_SIGID bits, i.e. it specifically only
>   allows 0777 bits.
> 
> * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
> 
> While we did our best to audit everything there's a risk of regressions
> in here. However, for the sake of maintenance and given that we've seen
> a range of bugs years after S_ISGID inheritance issues were fixed (see
> [1]-[3]) the risk seems worth taking. In the worst case we will have to
> revert.
> 
> Associated with this change is a new set of fstests to enforce the
> semantics for all new filesystems.
> 
> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---

Thanks for using my commit message!

One thing that I just remembered and which I think I haven't mentioned
so far is that moving S_ISGID stripping from filesystem callpaths into
the vfs callpaths means that we're hoisting this logic out of vfs_*()
helpers implicitly.

So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
stripping being done in vfs_*() helpers anymore unless they pass the
mode on from a prior run through the vfs.

This mostly affects overlayfs which calls vfs_*() functions directly. So
a typical overlayfs callstack would be (roughly - I'm omw to lunch):

sys_mknod()
-> do_mknodat(mode) // calls vfs_prepare_mode()
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

I think we are safe as overlayfs passes on the mode on from its own run
through the vfs and then via vfs_*() to the underlying filesystem but it
is worth point that out.

Ccing Amir just for confirmation.

>  fs/inode.c         |  2 --
>  fs/namei.c         | 22 +++++++++-------------
>  fs/ocfs2/namei.c   |  1 +
>  include/linux/fs.h | 11 +++++++++++
>  4 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e9a5f2ec2f89..dd357f4b556d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else
> -			mode = mode_strip_sgid(mnt_userns, dir, mode);
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> diff --git a/fs/namei.c b/fs/namei.c
> index 73646e28fae0..5dbf00704ae8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (open_flag & O_CREAT) {
>  		if (open_flag & O_EXCL)
>  			open_flag &= ~O_TRUNC;
> -		if (!IS_POSIXACL(dir->d_inode))
> -			mode &= ~current_umask();
> +		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
>  		if (likely(got_write))
>  			create_error = may_o_create(mnt_userns, &nd->path,
>  						    dentry, mode);
> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>  	child = d_alloc(dentry, &slash_name);
>  	if (unlikely(!child))
>  		goto out_err;
> -	if (!IS_POSIXACL(dir))
> -		mode &= ~current_umask();
> +	mode = vfs_prepare_mode(mnt_userns, dir, mode);
>  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>  	if (error)
>  		goto out_err;
> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>  	if (IS_ERR(dentry))
>  		goto out1;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mknod(&path, dentry, mode, dev);
>  	if (error)
>  		goto out2;
>  
> -	mnt_userns = mnt_user_ns(path.mnt);
>  	switch (mode & S_IFMT) {
>  		case 0: case S_IFREG:
>  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +	struct user_namespace *mnt_userns;
>  
>  retry:
>  	dentry = filename_create(dfd, name, &path, lookup_flags);
> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	if (IS_ERR(dentry))
>  		goto out_putname;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mkdir(&path, dentry, mode);
> -	if (!error) {
> -		struct user_namespace *mnt_userns;
> -		mnt_userns = mnt_user_ns(path.mnt);
> +	if (!error)
>  		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>  				  mode);
> -	}
> +
>  	done_path_create(&path, dentry);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index c75fd54b9185..961d1cf54388 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>  	 * callers. */
>  	if (S_ISDIR(mode))
>  		set_nlink(inode, 2);
> +	mode = mode_strip_sgid(&init_user_ns, dir, mode);
>  	inode_init_owner(&init_user_ns, inode, dir, mode);
>  	status = dquot_initialize(inode);
>  	if (status)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 98b44a2732f5..914c8f28bb02 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
>  	return !IS_DEADDIR(inode);
>  }
>  
> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> +				   const struct inode *dir, umode_t mode)
> +{
> +	mode = mode_strip_sgid(mnt_userns, dir, mode);
> +
> +	if (!IS_POSIXACL(dir))
> +		mode &= ~current_umask();
> +
> +	return mode;
> +}
> +
>  extern bool path_noexec(const struct path *path);
>  extern void inode_nohighmem(struct inode *inode);
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 1/4] fs: add mode_strip_sgid() helper
@ 2022-04-26 11:11 Yang Xu
  2022-04-26 11:11 ` [PATCH v8 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Yang Xu @ 2022-04-26 11:11 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Add a dedicated helper to handle the setgid bit when creating a new file
in a setgid directory. This is a preparatory patch for moving setgid
stripping into the vfs. The patch contains no functional changes.

Currently the setgid stripping logic is open-coded directly in
inode_init_owner() and the individual filesystems are responsible for
handling setgid inheritance. Since this has proven to be brittle as
evidenced by old issues we uncovered over the last months (see [1] to
[3] below) we will try to move this logic into the vfs.

Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/inode.c         | 37 +++++++++++++++++++++++++++++++++----
 include/linux/fs.h |  2 ++
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..e9a5f2ec2f89 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
-			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
-			mode &= ~S_ISGID;
+		else
+			mode = mode_strip_sgid(mnt_userns, dir, mode);
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
@@ -2405,3 +2403,34 @@ struct timespec64 current_time(struct inode *inode)
 	return timestamp_truncate(now, inode);
 }
 EXPORT_SYMBOL(current_time);
+
+/**
+ * mode_strip_sgid - handle the sgid bit for non-directories
+ * @mnt_userns: User namespace of the mount the inode was created from
+ * @dir: parent directory inode
+ * @mode: mode of the file to be created in @dir
+ *
+ * If the @mode of the new file has both the S_ISGID and S_IXGRP bit
+ * raised and @dir has the S_ISGID bit raised ensure that the caller is
+ * either in the group of the parent directory or they have CAP_FSETID
+ * in their user namespace and are privileged over the parent directory.
+ * In all other cases, strip the S_ISGID bit from @mode.
+ *
+ * Return: the new mode to use for the file
+ */
+umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
+			 const struct inode *dir, umode_t mode)
+{
+	if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID))
+		return mode;
+	if ((mode & (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
+		return mode;
+	if (in_group_p(i_gid_into_mnt(mnt_userns, dir)))
+		return mode;
+	if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
+		return mode;
+
+	mode &= ~S_ISGID;
+	return mode;
+}
+EXPORT_SYMBOL(mode_strip_sgid);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..98b44a2732f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
 void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		      const struct inode *dir, umode_t mode);
 extern bool may_open_dev(const struct path *path);
+umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
+			 const struct inode *dir, umode_t mode);
 
 /*
  * This is the "filldir" function type, used by readdir() to let
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 2/4] fs: Add missing umask strip in vfs_tmpfile
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
@ 2022-04-26 11:11 ` Yang Xu
  2022-04-26 11:11 ` [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Yang Xu @ 2022-04-26 11:11 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu, stable

All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.

Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 509657fdf4f5..73646e28fae0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3521,6 +3521,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
  2022-04-26 11:11 ` [PATCH v8 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
@ 2022-04-26 11:11 ` Yang Xu
  2022-04-26 10:38   ` Christian Brauner
  2022-04-26 15:52   ` Christian Brauner
  2022-04-26 11:11 ` [PATCH v8 4/4] ceph: rely on vfs for setgid stripping Yang Xu
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Yang Xu @ 2022-04-26 11:11 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current state of things:

* The S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

* The first two points taken together mean that there's a
  non-standardized ordering between setgid stripping in
  inode_init_owner() and posix_acl_create() both on the vfs level and
  the filesystem level. The latter part is especially problematic since
  each filesystem is technically free to order inode_init_owner() and
  posix_acl_create() however it sees fit meaning that S_ISGID
  inheritance might or might not be applied.

* We do still have bugs in this areas years after the initial round of
  setgid bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

The S_ISGID bit is stripped before any umask is applied. This has the
advantage that the ordering is unaffected by whether umask stripping is
done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
the filesystem in posix_acl_create() (if POSIX ACLs are supported).

To this end a new helper vfs_prepare_mode() is added which calls the
previously added mode_strip_setgid() helper and strips the umask
afterwards.

All inode operations that create new filesystem objects have been
updated to call vfs_prepare_mode() before passing the mode into the
relevant inode operation of the filesystems. Care has been taken to
ensure that the mode passed to the security hooks is the mode that is
seen by the filesystem.

Following is an overview of the filesystem specific and inode operations
specific implications:

arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the following ->mkdir(),
->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() inode operation trivially inherit the mode from
the target and the ->rename() inode operation inherits the mode from the
source inode.

All other inode operations will have the S_ISGID bit stripped once in
vfs_prepare_mode() before.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs.

So we did our best to audit all filesystems in this regard:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_inode() but only creates directories and doesn't raise
  S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain tbh.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

While we did our best to audit everything there's a risk of regressions
in here. However, for the sake of maintenance and given that we've seen
a range of bugs years after S_ISGID inheritance issues were fixed (see
[1]-[3]) the risk seems worth taking. In the worst case we will have to
revert.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/inode.c         |  2 --
 fs/namei.c         | 22 +++++++++-------------
 fs/ocfs2/namei.c   |  1 +
 include/linux/fs.h | 11 +++++++++++
 4 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e9a5f2ec2f89..dd357f4b556d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else
-			mode = mode_strip_sgid(mnt_userns, dir, mode);
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
diff --git a/fs/namei.c b/fs/namei.c
index 73646e28fae0..5dbf00704ae8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		if (!IS_POSIXACL(dir->d_inode))
-			mode &= ~current_umask();
+		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(mnt_userns, &nd->path,
 						    dentry, mode);
@@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	if (!IS_POSIXACL(dir))
-		mode &= ~current_umask();
+	mode = vfs_prepare_mode(mnt_userns, dir, mode);
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	if (IS_ERR(dentry))
 		goto out1;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
 		goto out2;
 
-	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
@@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct user_namespace *mnt_userns;
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
@@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mkdir(&path, dentry, mode);
-	if (!error) {
-		struct user_namespace *mnt_userns;
-		mnt_userns = mnt_user_ns(path.mnt);
+	if (!error)
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
-	}
+
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c75fd54b9185..961d1cf54388 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
 	 * callers. */
 	if (S_ISDIR(mode))
 		set_nlink(inode, 2);
+	mode = mode_strip_sgid(&init_user_ns, dir, mode);
 	inode_init_owner(&init_user_ns, inode, dir, mode);
 	status = dquot_initialize(inode);
 	if (status)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b44a2732f5..914c8f28bb02 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
+static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
+				   const struct inode *dir, umode_t mode)
+{
+	mode = mode_strip_sgid(mnt_userns, dir, mode);
+
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
+
+	return mode;
+}
+
 extern bool path_noexec(const struct path *path);
 extern void inode_nohighmem(struct inode *inode);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 4/4] ceph: rely on vfs for setgid stripping
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
  2022-04-26 11:11 ` [PATCH v8 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
  2022-04-26 11:11 ` [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
@ 2022-04-26 11:11 ` Yang Xu
  2022-04-26 14:52 ` [PATCH v8 1/4] fs: add mode_strip_sgid() helper Jeff Layton
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Yang Xu @ 2022-04-26 11:11 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Now that we finished moving setgid stripping for regular files in setgid
directories into the vfs, individual filesystem don't need to manually
strip the setgid bit anymore. Drop the now unneeded code from ceph.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/ceph/file.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 6c9e837aa1d3..8e3b99853333 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -651,10 +651,6 @@ static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
 		/* Directories always inherit the setgid bit. */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(dir->i_gid) &&
-			 !capable_wrt_inode_uidgid(&init_user_ns, dir, CAP_FSETID))
-			mode &= ~S_ISGID;
 	} else {
 		in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
 	}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 10:38   ` Christian Brauner
@ 2022-04-26 11:20     ` Amir Goldstein
  2022-04-26 11:52       ` Miklos Szeredi
  0 siblings, 1 reply; 29+ messages in thread
From: Amir Goldstein @ 2022-04-26 11:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Yang Xu, linux-fsdevel, ceph-devel, Al Viro, Dave Chinner,
	Darrick J. Wong, Matthew Wilcox, Jeff Layton, Miklos Szeredi

On Tue, Apr 26, 2022 at 1:38 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Apr 26, 2022 at 07:11:29PM +0800, Yang Xu wrote:
> > Creating files that have both the S_IXGRP and S_ISGID bit raised in
> > directories that themselves have the S_ISGID bit set requires additional
> > privileges to avoid security issues.
> >
> > When a filesystem creates a new inode it needs to take care that the
> > caller is either in the group of the newly created inode or they have
> > CAP_FSETID in their current user namespace and are privileged over the
> > parent directory of the new inode. If any of these two conditions is
> > true then the S_ISGID bit can be raised for an S_IXGRP file and if not
> > it needs to be stripped.
> >
> > However, there are several key issues with the current state of things:
> >
> > * The S_ISGID stripping logic is entangled with umask stripping.
> >
> >   If a filesystem doesn't support or enable POSIX ACLs then umask
> >   stripping is done directly in the vfs before calling into the
> >   filesystem.
> >   If the filesystem does support POSIX ACLs then unmask stripping may be
> >   done in the filesystem itself when calling posix_acl_create().
> >
> > * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
> >   stripping logic.
> >
> >   While that may be intentional (e.g. network filesystems might just
> >   defer setgid stripping to a server) it is often just a security issue.
> >
> > * The first two points taken together mean that there's a
> >   non-standardized ordering between setgid stripping in
> >   inode_init_owner() and posix_acl_create() both on the vfs level and
> >   the filesystem level. The latter part is especially problematic since
> >   each filesystem is technically free to order inode_init_owner() and
> >   posix_acl_create() however it sees fit meaning that S_ISGID
> >   inheritance might or might not be applied.
> >
> > * We do still have bugs in this areas years after the initial round of
> >   setgid bugfixes.
> >
> > So the current state is quite messy and while we won't be able to make
> > it completely clean as posix_acl_create() is still a filesystem specific
> > call we can improve the S_SIGD stripping situation quite a bit by
> > hoisting it out of inode_init_owner() and into the vfs creation
> > operations. This means we alleviate the burden for filesystems to handle
> > S_ISGID stripping correctly and can standardize the ordering between
> > S_ISGID and umask stripping in the vfs.
> >
> > The S_ISGID bit is stripped before any umask is applied. This has the
> > advantage that the ordering is unaffected by whether umask stripping is
> > done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
> > the filesystem in posix_acl_create() (if POSIX ACLs are supported).
> >
> > To this end a new helper vfs_prepare_mode() is added which calls the
> > previously added mode_strip_setgid() helper and strips the umask
> > afterwards.
> >
> > All inode operations that create new filesystem objects have been
> > updated to call vfs_prepare_mode() before passing the mode into the
> > relevant inode operation of the filesystems. Care has been taken to
> > ensure that the mode passed to the security hooks is the mode that is
> > seen by the filesystem.
> >
> > Following is an overview of the filesystem specific and inode operations
> > specific implications:
> >
> > arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> > arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> > fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> > fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> > fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> > fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> > fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> > fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> > fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> > fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> > fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> > fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> > fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> > fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> > fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> > fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> > kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> > mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> >
> > All of the above filesystems end up calling inode_init_owner() when new
> > filesystem objects are created through the following ->mkdir(),
> > ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
> > operations.
> >
> > Since directories always inherit the S_ISGID bit with the exception of
> > xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
> > apply. The ->symlink() inode operation trivially inherit the mode from
> > the target and the ->rename() inode operation inherits the mode from the
> > source inode.
> >
> > All other inode operations will have the S_ISGID bit stripped once in
> > vfs_prepare_mode() before.
> >
> > In addition to this there are filesystems which allow the creation of
> > filesystem objects through ioctl()s or - in the case of spufs -
> > circumventing the vfs in other ways. If filesystem objects are created
> > through ioctl()s the vfs doesn't know about it and can't apply regular
> > permission checking including S_ISGID logic. Therfore, a filesystem
> > relying on S_ISGID stripping in inode_init_owner() in their ioctl()
> > callpath will be affected by moving this logic into the vfs.
> >
> > So we did our best to audit all filesystems in this regard:
> >
> > * btrfs allows the creation of filesystem objects through various
> >   ioctls(). Snapshot creation literally takes a snapshot and so the mode
> >   is fully preserved and S_ISGID stripping doesn't apply.
> >
> >   Creating a new subvolum relies on inode_init_owner() in
> >   btrfs_new_inode() but only creates directories and doesn't raise
> >   S_ISGID.
> >
> > * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
> >   xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
> >   the actual extents ocfs2 uses a separate ioctl() that also creates the
> >   target file.
> >
> >   Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
> >   inode_init_owner() to strip the S_ISGID bit. This is the only place
> >   where a filesystem needs to call mode_strip_sgid() directly but this
> >   is self-inflicted pain tbh.
> >
> > * spufs doesn't go through the vfs at all and doesn't use ioctl()s
> >   either. Instead it has a dedicated system call spufs_create() which
> >   allows the creation of filesystem objects. But spufs only creates
> >   directories and doesn't allo S_SIGID bits, i.e. it specifically only
> >   allows 0777 bits.
> >
> > * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
> >
> > While we did our best to audit everything there's a risk of regressions
> > in here. However, for the sake of maintenance and given that we've seen
> > a range of bugs years after S_ISGID inheritance issues were fixed (see
> > [1]-[3]) the risk seems worth taking. In the worst case we will have to
> > revert.
> >
> > Associated with this change is a new set of fstests to enforce the
> > semantics for all new filesystems.
> >
> > Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
> > Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
> > Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> > ---
>
> Thanks for using my commit message!
>
> One thing that I just remembered and which I think I haven't mentioned
> so far is that moving S_ISGID stripping from filesystem callpaths into
> the vfs callpaths means that we're hoisting this logic out of vfs_*()
> helpers implicitly.
>
> So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
> stripping being done in vfs_*() helpers anymore unless they pass the
> mode on from a prior run through the vfs.
>
> This mostly affects overlayfs which calls vfs_*() functions directly. So
> a typical overlayfs callstack would be (roughly - I'm omw to lunch):
>
> sys_mknod()
> -> do_mknodat(mode) // calls vfs_prepare_mode()
>    -> .mknod = ovl_mknod(mode)
>       -> ovl_create(mode)
>          -> vfs_mknod(mode)
>
> I think we are safe as overlayfs passes on the mode on from its own run
> through the vfs and then via vfs_*() to the underlying filesystem but it
> is worth point that out.
>
> Ccing Amir just for confirmation.

Looks fine to me, but CC Miklos ...

Thanks,
Amir.

>
> >  fs/inode.c         |  2 --
> >  fs/namei.c         | 22 +++++++++-------------
> >  fs/ocfs2/namei.c   |  1 +
> >  include/linux/fs.h | 11 +++++++++++
> >  4 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index e9a5f2ec2f89..dd357f4b556d 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> >               /* Directories are special, and always inherit S_ISGID */
> >               if (S_ISDIR(mode))
> >                       mode |= S_ISGID;
> > -             else
> > -                     mode = mode_strip_sgid(mnt_userns, dir, mode);
> >       } else
> >               inode_fsgid_set(inode, mnt_userns);
> >       inode->i_mode = mode;
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 73646e28fae0..5dbf00704ae8 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >       if (open_flag & O_CREAT) {
> >               if (open_flag & O_EXCL)
> >                       open_flag &= ~O_TRUNC;
> > -             if (!IS_POSIXACL(dir->d_inode))
> > -                     mode &= ~current_umask();
> > +             mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
> >               if (likely(got_write))
> >                       create_error = may_o_create(mnt_userns, &nd->path,
> >                                                   dentry, mode);
> > @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> >       child = d_alloc(dentry, &slash_name);
> >       if (unlikely(!child))
> >               goto out_err;
> > -     if (!IS_POSIXACL(dir))
> > -             mode &= ~current_umask();
> > +     mode = vfs_prepare_mode(mnt_userns, dir, mode);
> >       error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> >       if (error)
> >               goto out_err;
> > @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> >       if (IS_ERR(dentry))
> >               goto out1;
> >
> > -     if (!IS_POSIXACL(path.dentry->d_inode))
> > -             mode &= ~current_umask();
> > +     mnt_userns = mnt_user_ns(path.mnt);
> > +     mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >       error = security_path_mknod(&path, dentry, mode, dev);
> >       if (error)
> >               goto out2;
> >
> > -     mnt_userns = mnt_user_ns(path.mnt);
> >       switch (mode & S_IFMT) {
> >               case 0: case S_IFREG:
> >                       error = vfs_create(mnt_userns, path.dentry->d_inode,
> > @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >       struct path path;
> >       int error;
> >       unsigned int lookup_flags = LOOKUP_DIRECTORY;
> > +     struct user_namespace *mnt_userns;
> >
> >  retry:
> >       dentry = filename_create(dfd, name, &path, lookup_flags);
> > @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >       if (IS_ERR(dentry))
> >               goto out_putname;
> >
> > -     if (!IS_POSIXACL(path.dentry->d_inode))
> > -             mode &= ~current_umask();
> > +     mnt_userns = mnt_user_ns(path.mnt);
> > +     mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >       error = security_path_mkdir(&path, dentry, mode);
> > -     if (!error) {
> > -             struct user_namespace *mnt_userns;
> > -             mnt_userns = mnt_user_ns(path.mnt);
> > +     if (!error)
> >               error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
> >                                 mode);
> > -     }
> > +
> >       done_path_create(&path, dentry);
> >       if (retry_estale(error, lookup_flags)) {
> >               lookup_flags |= LOOKUP_REVAL;
> > diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> > index c75fd54b9185..961d1cf54388 100644
> > --- a/fs/ocfs2/namei.c
> > +++ b/fs/ocfs2/namei.c
> > @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
> >        * callers. */
> >       if (S_ISDIR(mode))
> >               set_nlink(inode, 2);
> > +     mode = mode_strip_sgid(&init_user_ns, dir, mode);
> >       inode_init_owner(&init_user_ns, inode, dir, mode);
> >       status = dquot_initialize(inode);
> >       if (status)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 98b44a2732f5..914c8f28bb02 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
> >       return !IS_DEADDIR(inode);
> >  }
> >
> > +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> > +                                const struct inode *dir, umode_t mode)
> > +{
> > +     mode = mode_strip_sgid(mnt_userns, dir, mode);
> > +
> > +     if (!IS_POSIXACL(dir))
> > +             mode &= ~current_umask();
> > +
> > +     return mode;
> > +}
> > +
> >  extern bool path_noexec(const struct path *path);
> >  extern void inode_nohighmem(struct inode *inode);
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 11:20     ` Amir Goldstein
@ 2022-04-26 11:52       ` Miklos Szeredi
  2022-04-26 14:53         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2022-04-26 11:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Yang Xu, linux-fsdevel, ceph-devel, Al Viro,
	Dave Chinner, Darrick J. Wong, Matthew Wilcox, Jeff Layton

On Tue, 26 Apr 2022 at 13:21, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Apr 26, 2022 at 1:38 PM Christian Brauner <brauner@kernel.org> wrote:

> > One thing that I just remembered and which I think I haven't mentioned
> > so far is that moving S_ISGID stripping from filesystem callpaths into
> > the vfs callpaths means that we're hoisting this logic out of vfs_*()
> > helpers implicitly.
> >
> > So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
> > stripping being done in vfs_*() helpers anymore unless they pass the
> > mode on from a prior run through the vfs.
> >
> > This mostly affects overlayfs which calls vfs_*() functions directly. So
> > a typical overlayfs callstack would be (roughly - I'm omw to lunch):
> >
> > sys_mknod()
> > -> do_mknodat(mode) // calls vfs_prepare_mode()
> >    -> .mknod = ovl_mknod(mode)
> >       -> ovl_create(mode)
> >          -> vfs_mknod(mode)
> >
> > I think we are safe as overlayfs passes on the mode on from its own run
> > through the vfs and then via vfs_*() to the underlying filesystem but it
> > is worth point that out.
> >
> > Ccing Amir just for confirmation.
>
> Looks fine to me, but CC Miklos ...

Looks fine to me as well.  Overlayfs should share the mode (including
the suid and sgid bits), owner, group and ACL's with the underlying
filesystem, so clearing sgid based on overlay parent directory should
result in the same mode as if it was done based on the parent
directory on the underlying layer.

AFAIU this logic is not affected by userns or mnt_userns, but
Christian would be best to confirm that.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
                   ` (2 preceding siblings ...)
  2022-04-26 11:11 ` [PATCH v8 4/4] ceph: rely on vfs for setgid stripping Yang Xu
@ 2022-04-26 14:52 ` Jeff Layton
  2022-04-27  1:34   ` xuyang2018.jy
  2022-04-28  1:59 ` Al Viro
  2022-04-28  4:40 ` Al Viro
  5 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2022-04-26 14:52 UTC (permalink / raw)
  To: Yang Xu, linux-fsdevel, ceph-devel; +Cc: viro, david, djwong, brauner, willy

On Tue, 2022-04-26 at 19:11 +0800, Yang Xu wrote:
> Add a dedicated helper to handle the setgid bit when creating a new file
> in a setgid directory. This is a preparatory patch for moving setgid
> stripping into the vfs. The patch contains no functional changes.
> 
> Currently the setgid stripping logic is open-coded directly in
> inode_init_owner() and the individual filesystems are responsible for
> handling setgid inheritance. Since this has proven to be brittle as
> evidenced by old issues we uncovered over the last months (see [1] to
> [3] below) we will try to move this logic into the vfs.
> 
> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
>  fs/inode.c         | 37 +++++++++++++++++++++++++++++++++----
>  include/linux/fs.h |  2 ++
>  2 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 9d9b422504d1..e9a5f2ec2f89 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
> -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
> -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> -			mode &= ~S_ISGID;
> +		else
> +			mode = mode_strip_sgid(mnt_userns, dir, mode);
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> @@ -2405,3 +2403,34 @@ struct timespec64 current_time(struct inode *inode)
>  	return timestamp_truncate(now, inode);
>  }
>  EXPORT_SYMBOL(current_time);
> +
> +/**
> + * mode_strip_sgid - handle the sgid bit for non-directories
> + * @mnt_userns: User namespace of the mount the inode was created from
> + * @dir: parent directory inode
> + * @mode: mode of the file to be created in @dir
> + *
> + * If the @mode of the new file has both the S_ISGID and S_IXGRP bit
> + * raised and @dir has the S_ISGID bit raised ensure that the caller is
> + * either in the group of the parent directory or they have CAP_FSETID
> + * in their user namespace and are privileged over the parent directory.
> + * In all other cases, strip the S_ISGID bit from @mode.
> + *
> + * Return: the new mode to use for the file
> + */
> +umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
> +			 const struct inode *dir, umode_t mode)
> +{
> +	if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID))
> +		return mode;
> +	if ((mode & (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
> +		return mode;
> +	if (in_group_p(i_gid_into_mnt(mnt_userns, dir)))
> +		return mode;
> +	if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> +		return mode;
> +
> +	mode &= ~S_ISGID;
> +	return mode;
> +}
> +EXPORT_SYMBOL(mode_strip_sgid);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bbde95387a23..98b44a2732f5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1897,6 +1897,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
>  void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		      const struct inode *dir, umode_t mode);
>  extern bool may_open_dev(const struct path *path);
> +umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
> +			 const struct inode *dir, umode_t mode);
>  
>  /*
>   * This is the "filldir" function type, used by readdir() to let

This series looks like a nice cleanup. I went ahead and added this pile
to another kernel I was testing with xfstests and it seemed to do fine.

You can add this (or some variant of it) to all 4 patches.

Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 11:52       ` Miklos Szeredi
@ 2022-04-26 14:53         ` Christian Brauner
  2022-04-27  9:22           ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-26 14:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, Yang Xu, linux-fsdevel, ceph-devel, Al Viro,
	Dave Chinner, Darrick J. Wong, Matthew Wilcox, Jeff Layton

On Tue, Apr 26, 2022 at 01:52:11PM +0200, Miklos Szeredi wrote:
> On Tue, 26 Apr 2022 at 13:21, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Apr 26, 2022 at 1:38 PM Christian Brauner <brauner@kernel.org> wrote:
> 
> > > One thing that I just remembered and which I think I haven't mentioned
> > > so far is that moving S_ISGID stripping from filesystem callpaths into
> > > the vfs callpaths means that we're hoisting this logic out of vfs_*()
> > > helpers implicitly.
> > >
> > > So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
> > > stripping being done in vfs_*() helpers anymore unless they pass the
> > > mode on from a prior run through the vfs.
> > >
> > > This mostly affects overlayfs which calls vfs_*() functions directly. So
> > > a typical overlayfs callstack would be (roughly - I'm omw to lunch):
> > >
> > > sys_mknod()
> > > -> do_mknodat(mode) // calls vfs_prepare_mode()
> > >    -> .mknod = ovl_mknod(mode)
> > >       -> ovl_create(mode)
> > >          -> vfs_mknod(mode)
> > >
> > > I think we are safe as overlayfs passes on the mode on from its own run
> > > through the vfs and then via vfs_*() to the underlying filesystem but it
> > > is worth point that out.
> > >
> > > Ccing Amir just for confirmation.
> >
> > Looks fine to me, but CC Miklos ...
> 
> Looks fine to me as well.  Overlayfs should share the mode (including
> the suid and sgid bits), owner, group and ACL's with the underlying
> filesystem, so clearing sgid based on overlay parent directory should
> result in the same mode as if it was done based on the parent
> directory on the underlying layer.

Ah yes, good point.

> 
> AFAIU this logic is not affected by userns or mnt_userns, but
> Christian would be best to confirm that.

It does depend on it as S_ISGID stripping requires knowledge about
whether the caller has CAP_FSETID and is capable over the parent
directory or if they are in the group the file is owned by.

I think ultimately it might just come down to moving vfs_prepare_mode()
into vfs_*() helpers and not into the do_*at() helpers.

That would be cleaner anyway as right now we have this weird disconnect
between vfs_tmpfile() and vfs_{create,mknod,mkdir}(). IOW, vfs_tmpfile()
doesn't even have an associated do_*() wrapper where we could call
vfs_prepare_mode() from.

So ultimately it might be nicer if we do it in vfs_*() helpers anyway.

The less pretty thing about it will be that the security_path_*() hooks
also want a mode.

Right now these hooks receive the mode as it's passed in from userspace
minus umask but before S_ISGID stripping happens.

Whereas I think they should really see what the filesystem sees and
currently it's a bug that they see something else.

I need to think about this a bit.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 11:11 ` [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
  2022-04-26 10:38   ` Christian Brauner
@ 2022-04-26 15:52   ` Christian Brauner
  2022-04-27  1:21     ` xuyang2018.jy
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-26 15:52 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, viro, david, djwong, willy, jlayton

On Tue, Apr 26, 2022 at 07:11:29PM +0800, Yang Xu wrote:
> Creating files that have both the S_IXGRP and S_ISGID bit raised in
> directories that themselves have the S_ISGID bit set requires additional
> privileges to avoid security issues.
> 
> When a filesystem creates a new inode it needs to take care that the
> caller is either in the group of the newly created inode or they have
> CAP_FSETID in their current user namespace and are privileged over the
> parent directory of the new inode. If any of these two conditions is
> true then the S_ISGID bit can be raised for an S_IXGRP file and if not
> it needs to be stripped.
> 
> However, there are several key issues with the current state of things:
> 
> * The S_ISGID stripping logic is entangled with umask stripping.
> 
>   If a filesystem doesn't support or enable POSIX ACLs then umask
>   stripping is done directly in the vfs before calling into the
>   filesystem.
>   If the filesystem does support POSIX ACLs then unmask stripping may be
>   done in the filesystem itself when calling posix_acl_create().
> 
> * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
>   stripping logic.
> 
>   While that may be intentional (e.g. network filesystems might just
>   defer setgid stripping to a server) it is often just a security issue.
> 
> * The first two points taken together mean that there's a
>   non-standardized ordering between setgid stripping in
>   inode_init_owner() and posix_acl_create() both on the vfs level and
>   the filesystem level. The latter part is especially problematic since
>   each filesystem is technically free to order inode_init_owner() and
>   posix_acl_create() however it sees fit meaning that S_ISGID
>   inheritance might or might not be applied.
> 
> * We do still have bugs in this areas years after the initial round of
>   setgid bugfixes.
> 
> So the current state is quite messy and while we won't be able to make
> it completely clean as posix_acl_create() is still a filesystem specific
> call we can improve the S_SIGD stripping situation quite a bit by
> hoisting it out of inode_init_owner() and into the vfs creation
> operations. This means we alleviate the burden for filesystems to handle
> S_ISGID stripping correctly and can standardize the ordering between
> S_ISGID and umask stripping in the vfs.
> 
> The S_ISGID bit is stripped before any umask is applied. This has the
> advantage that the ordering is unaffected by whether umask stripping is
> done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
> the filesystem in posix_acl_create() (if POSIX ACLs are supported).
> 
> To this end a new helper vfs_prepare_mode() is added which calls the
> previously added mode_strip_setgid() helper and strips the umask
> afterwards.
> 
> All inode operations that create new filesystem objects have been
> updated to call vfs_prepare_mode() before passing the mode into the
> relevant inode operation of the filesystems. Care has been taken to
> ensure that the mode passed to the security hooks is the mode that is
> seen by the filesystem.
> 
> Following is an overview of the filesystem specific and inode operations
> specific implications:
> 
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> 
> All of the above filesystems end up calling inode_init_owner() when new
> filesystem objects are created through the following ->mkdir(),
> ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
> operations.
> 
> Since directories always inherit the S_ISGID bit with the exception of
> xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
> apply. The ->symlink() inode operation trivially inherit the mode from
> the target and the ->rename() inode operation inherits the mode from the
> source inode.
> 
> All other inode operations will have the S_ISGID bit stripped once in
> vfs_prepare_mode() before.
> 
> In addition to this there are filesystems which allow the creation of
> filesystem objects through ioctl()s or - in the case of spufs -
> circumventing the vfs in other ways. If filesystem objects are created
> through ioctl()s the vfs doesn't know about it and can't apply regular
> permission checking including S_ISGID logic. Therfore, a filesystem
> relying on S_ISGID stripping in inode_init_owner() in their ioctl()
> callpath will be affected by moving this logic into the vfs.
> 
> So we did our best to audit all filesystems in this regard:
> 
> * btrfs allows the creation of filesystem objects through various
>   ioctls(). Snapshot creation literally takes a snapshot and so the mode
>   is fully preserved and S_ISGID stripping doesn't apply.
> 
>   Creating a new subvolum relies on inode_init_owner() in
>   btrfs_new_inode() but only creates directories and doesn't raise
>   S_ISGID.
> 
> * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
>   xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
>   the actual extents ocfs2 uses a separate ioctl() that also creates the
>   target file.
> 
>   Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
>   inode_init_owner() to strip the S_ISGID bit. This is the only place
>   where a filesystem needs to call mode_strip_sgid() directly but this
>   is self-inflicted pain tbh.
> 
> * spufs doesn't go through the vfs at all and doesn't use ioctl()s
>   either. Instead it has a dedicated system call spufs_create() which
>   allows the creation of filesystem objects. But spufs only creates
>   directories and doesn't allo S_SIGID bits, i.e. it specifically only
>   allows 0777 bits.
> 
> * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
> 
> While we did our best to audit everything there's a risk of regressions
> in here. However, for the sake of maintenance and given that we've seen
> a range of bugs years after S_ISGID inheritance issues were fixed (see
> [1]-[3]) the risk seems worth taking. In the worst case we will have to
> revert.
> 
> Associated with this change is a new set of fstests to enforce the
> semantics for all new filesystems.
> 
> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
>  fs/inode.c         |  2 --
>  fs/namei.c         | 22 +++++++++-------------
>  fs/ocfs2/namei.c   |  1 +
>  include/linux/fs.h | 11 +++++++++++
>  4 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index e9a5f2ec2f89..dd357f4b556d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else
> -			mode = mode_strip_sgid(mnt_userns, dir, mode);
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> diff --git a/fs/namei.c b/fs/namei.c
> index 73646e28fae0..5dbf00704ae8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (open_flag & O_CREAT) {
>  		if (open_flag & O_EXCL)
>  			open_flag &= ~O_TRUNC;
> -		if (!IS_POSIXACL(dir->d_inode))
> -			mode &= ~current_umask();
> +		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
>  		if (likely(got_write))
>  			create_error = may_o_create(mnt_userns, &nd->path,
>  						    dentry, mode);
> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>  	child = d_alloc(dentry, &slash_name);
>  	if (unlikely(!child))
>  		goto out_err;
> -	if (!IS_POSIXACL(dir))
> -		mode &= ~current_umask();
> +	mode = vfs_prepare_mode(mnt_userns, dir, mode);
>  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>  	if (error)
>  		goto out_err;
> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>  	if (IS_ERR(dentry))
>  		goto out1;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mknod(&path, dentry, mode, dev);
>  	if (error)
>  		goto out2;
>  
> -	mnt_userns = mnt_user_ns(path.mnt);
>  	switch (mode & S_IFMT) {
>  		case 0: case S_IFREG:
>  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +	struct user_namespace *mnt_userns;
>  
>  retry:
>  	dentry = filename_create(dfd, name, &path, lookup_flags);
> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	if (IS_ERR(dentry))
>  		goto out_putname;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mkdir(&path, dentry, mode);
> -	if (!error) {
> -		struct user_namespace *mnt_userns;
> -		mnt_userns = mnt_user_ns(path.mnt);
> +	if (!error)
>  		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>  				  mode);
> -	}
> +
>  	done_path_create(&path, dentry);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index c75fd54b9185..961d1cf54388 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>  	 * callers. */
>  	if (S_ISDIR(mode))
>  		set_nlink(inode, 2);
> +	mode = mode_strip_sgid(&init_user_ns, dir, mode);
>  	inode_init_owner(&init_user_ns, inode, dir, mode);
>  	status = dquot_initialize(inode);
>  	if (status)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 98b44a2732f5..914c8f28bb02 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
>  	return !IS_DEADDIR(inode);
>  }
>  
> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> +				   const struct inode *dir, umode_t mode)

Sorry, since you're only calling the helper in fs/namei.c you don't need
to expose it in fs.h; just keep it local to fs/namei.c.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 15:52   ` Christian Brauner
@ 2022-04-27  1:21     ` xuyang2018.jy
  0 siblings, 0 replies; 29+ messages in thread
From: xuyang2018.jy @ 2022-04-27  1:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, ceph-devel, viro, david, djwong, willy, jlayton

on 2022/4/26 23:52, Christian Brauner wrote:
> On Tue, Apr 26, 2022 at 07:11:29PM +0800, Yang Xu wrote:
>> Creating files that have both the S_IXGRP and S_ISGID bit raised in
>> directories that themselves have the S_ISGID bit set requires additional
>> privileges to avoid security issues.
>>
>> When a filesystem creates a new inode it needs to take care that the
>> caller is either in the group of the newly created inode or they have
>> CAP_FSETID in their current user namespace and are privileged over the
>> parent directory of the new inode. If any of these two conditions is
>> true then the S_ISGID bit can be raised for an S_IXGRP file and if not
>> it needs to be stripped.
>>
>> However, there are several key issues with the current state of things:
>>
>> * The S_ISGID stripping logic is entangled with umask stripping.
>>
>>    If a filesystem doesn't support or enable POSIX ACLs then umask
>>    stripping is done directly in the vfs before calling into the
>>    filesystem.
>>    If the filesystem does support POSIX ACLs then unmask stripping may be
>>    done in the filesystem itself when calling posix_acl_create().
>>
>> * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
>>    stripping logic.
>>
>>    While that may be intentional (e.g. network filesystems might just
>>    defer setgid stripping to a server) it is often just a security issue.
>>
>> * The first two points taken together mean that there's a
>>    non-standardized ordering between setgid stripping in
>>    inode_init_owner() and posix_acl_create() both on the vfs level and
>>    the filesystem level. The latter part is especially problematic since
>>    each filesystem is technically free to order inode_init_owner() and
>>    posix_acl_create() however it sees fit meaning that S_ISGID
>>    inheritance might or might not be applied.
>>
>> * We do still have bugs in this areas years after the initial round of
>>    setgid bugfixes.
>>
>> So the current state is quite messy and while we won't be able to make
>> it completely clean as posix_acl_create() is still a filesystem specific
>> call we can improve the S_SIGD stripping situation quite a bit by
>> hoisting it out of inode_init_owner() and into the vfs creation
>> operations. This means we alleviate the burden for filesystems to handle
>> S_ISGID stripping correctly and can standardize the ordering between
>> S_ISGID and umask stripping in the vfs.
>>
>> The S_ISGID bit is stripped before any umask is applied. This has the
>> advantage that the ordering is unaffected by whether umask stripping is
>> done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
>> the filesystem in posix_acl_create() (if POSIX ACLs are supported).
>>
>> To this end a new helper vfs_prepare_mode() is added which calls the
>> previously added mode_strip_setgid() helper and strips the umask
>> afterwards.
>>
>> All inode operations that create new filesystem objects have been
>> updated to call vfs_prepare_mode() before passing the mode into the
>> relevant inode operation of the filesystems. Care has been taken to
>> ensure that the mode passed to the security hooks is the mode that is
>> seen by the filesystem.
>>
>> Following is an overview of the filesystem specific and inode operations
>> specific implications:
>>
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
>> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
>> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
>> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
>>
>> All of the above filesystems end up calling inode_init_owner() when new
>> filesystem objects are created through the following ->mkdir(),
>> ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
>> operations.
>>
>> Since directories always inherit the S_ISGID bit with the exception of
>> xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
>> apply. The ->symlink() inode operation trivially inherit the mode from
>> the target and the ->rename() inode operation inherits the mode from the
>> source inode.
>>
>> All other inode operations will have the S_ISGID bit stripped once in
>> vfs_prepare_mode() before.
>>
>> In addition to this there are filesystems which allow the creation of
>> filesystem objects through ioctl()s or - in the case of spufs -
>> circumventing the vfs in other ways. If filesystem objects are created
>> through ioctl()s the vfs doesn't know about it and can't apply regular
>> permission checking including S_ISGID logic. Therfore, a filesystem
>> relying on S_ISGID stripping in inode_init_owner() in their ioctl()
>> callpath will be affected by moving this logic into the vfs.
>>
>> So we did our best to audit all filesystems in this regard:
>>
>> * btrfs allows the creation of filesystem objects through various
>>    ioctls(). Snapshot creation literally takes a snapshot and so the mode
>>    is fully preserved and S_ISGID stripping doesn't apply.
>>
>>    Creating a new subvolum relies on inode_init_owner() in
>>    btrfs_new_inode() but only creates directories and doesn't raise
>>    S_ISGID.
>>
>> * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
>>    xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
>>    the actual extents ocfs2 uses a separate ioctl() that also creates the
>>    target file.
>>
>>    Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
>>    inode_init_owner() to strip the S_ISGID bit. This is the only place
>>    where a filesystem needs to call mode_strip_sgid() directly but this
>>    is self-inflicted pain tbh.
>>
>> * spufs doesn't go through the vfs at all and doesn't use ioctl()s
>>    either. Instead it has a dedicated system call spufs_create() which
>>    allows the creation of filesystem objects. But spufs only creates
>>    directories and doesn't allo S_SIGID bits, i.e. it specifically only
>>    allows 0777 bits.
>>
>> * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
>>
>> While we did our best to audit everything there's a risk of regressions
>> in here. However, for the sake of maintenance and given that we've seen
>> a range of bugs years after S_ISGID inheritance issues were fixed (see
>> [1]-[3]) the risk seems worth taking. In the worst case we will have to
>> revert.
>>
>> Associated with this change is a new set of fstests to enforce the
>> semantics for all new filesystems.
>>
>> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
>> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
>> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
>> Reviewed-by: Darrick J. Wong<djwong@kernel.org>
>> Suggested-by: Dave Chinner<david@fromorbit.com>
>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>> ---
>>   fs/inode.c         |  2 --
>>   fs/namei.c         | 22 +++++++++-------------
>>   fs/ocfs2/namei.c   |  1 +
>>   include/linux/fs.h | 11 +++++++++++
>>   4 files changed, 21 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index e9a5f2ec2f89..dd357f4b556d 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		/* Directories are special, and always inherit S_ISGID */
>>   		if (S_ISDIR(mode))
>>   			mode |= S_ISGID;
>> -		else
>> -			mode = mode_strip_sgid(mnt_userns, dir, mode);
>>   	} else
>>   		inode_fsgid_set(inode, mnt_userns);
>>   	inode->i_mode = mode;
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 73646e28fae0..5dbf00704ae8 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>   	if (open_flag&  O_CREAT) {
>>   		if (open_flag&  O_EXCL)
>>   			open_flag&= ~O_TRUNC;
>> -		if (!IS_POSIXACL(dir->d_inode))
>> -			mode&= ~current_umask();
>> +		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
>>   		if (likely(got_write))
>>   			create_error = may_o_create(mnt_userns,&nd->path,
>>   						    dentry, mode);
>> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>   	child = d_alloc(dentry,&slash_name);
>>   	if (unlikely(!child))
>>   		goto out_err;
>> -	if (!IS_POSIXACL(dir))
>> -		mode&= ~current_umask();
>> +	mode = vfs_prepare_mode(mnt_userns, dir, mode);
>>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>   	if (error)
>>   		goto out_err;
>> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>   	if (IS_ERR(dentry))
>>   		goto out1;
>>
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode&= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>   	error = security_path_mknod(&path, dentry, mode, dev);
>>   	if (error)
>>   		goto out2;
>>
>> -	mnt_userns = mnt_user_ns(path.mnt);
>>   	switch (mode&  S_IFMT) {
>>   		case 0: case S_IFREG:
>>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
>> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	struct path path;
>>   	int error;
>>   	unsigned int lookup_flags = LOOKUP_DIRECTORY;
>> +	struct user_namespace *mnt_userns;
>>
>>   retry:
>>   	dentry = filename_create(dfd, name,&path, lookup_flags);
>> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	if (IS_ERR(dentry))
>>   		goto out_putname;
>>
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode&= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>   	error = security_path_mkdir(&path, dentry, mode);
>> -	if (!error) {
>> -		struct user_namespace *mnt_userns;
>> -		mnt_userns = mnt_user_ns(path.mnt);
>> +	if (!error)
>>   		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>>   				  mode);
>> -	}
>> +
>>   	done_path_create(&path, dentry);
>>   	if (retry_estale(error, lookup_flags)) {
>>   		lookup_flags |= LOOKUP_REVAL;
>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>> index c75fd54b9185..961d1cf54388 100644
>> --- a/fs/ocfs2/namei.c
>> +++ b/fs/ocfs2/namei.c
>> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>>   	 * callers. */
>>   	if (S_ISDIR(mode))
>>   		set_nlink(inode, 2);
>> +	mode = mode_strip_sgid(&init_user_ns, dir, mode);
>>   	inode_init_owner(&init_user_ns, inode, dir, mode);
>>   	status = dquot_initialize(inode);
>>   	if (status)
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 98b44a2732f5..914c8f28bb02 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
>>   	return !IS_DEADDIR(inode);
>>   }
>>
>> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
>> +				   const struct inode *dir, umode_t mode)
>
> Sorry, since you're only calling the helper in fs/namei.c you don't need
> to expose it in fs.h; just keep it local to fs/namei.c.

Oh, yes, will move it into fs/namei.c.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-26 14:52 ` [PATCH v8 1/4] fs: add mode_strip_sgid() helper Jeff Layton
@ 2022-04-27  1:34   ` xuyang2018.jy
  0 siblings, 0 replies; 29+ messages in thread
From: xuyang2018.jy @ 2022-04-27  1:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, ceph-devel, viro, david, djwong, brauner, willy

于 2022/4/26 22:52, Jeff Layton 写道:
> On Tue, 2022-04-26 at 19:11 +0800, Yang Xu wrote:
>> Add a dedicated helper to handle the setgid bit when creating a new file
>> in a setgid directory. This is a preparatory patch for moving setgid
>> stripping into the vfs. The patch contains no functional changes.
>>
>> Currently the setgid stripping logic is open-coded directly in
>> inode_init_owner() and the individual filesystems are responsible for
>> handling setgid inheritance. Since this has proven to be brittle as
>> evidenced by old issues we uncovered over the last months (see [1] to
>> [3] below) we will try to move this logic into the vfs.
>>
>> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
>> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
>> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
>> Reviewed-by: Darrick J. Wong<djwong@kernel.org>
>> Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org>
>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>> ---
>>   fs/inode.c         | 37 +++++++++++++++++++++++++++++++++----
>>   include/linux/fs.h |  2 ++
>>   2 files changed, 35 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 9d9b422504d1..e9a5f2ec2f89 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		/* Directories are special, and always inherit S_ISGID */
>>   		if (S_ISDIR(mode))
>>   			mode |= S_ISGID;
>> -		else if ((mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
>> -			 !in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
>> -			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
>> -			mode&= ~S_ISGID;
>> +		else
>> +			mode = mode_strip_sgid(mnt_userns, dir, mode);
>>   	} else
>>   		inode_fsgid_set(inode, mnt_userns);
>>   	inode->i_mode = mode;
>> @@ -2405,3 +2403,34 @@ struct timespec64 current_time(struct inode *inode)
>>   	return timestamp_truncate(now, inode);
>>   }
>>   EXPORT_SYMBOL(current_time);
>> +
>> +/**
>> + * mode_strip_sgid - handle the sgid bit for non-directories
>> + * @mnt_userns: User namespace of the mount the inode was created from
>> + * @dir: parent directory inode
>> + * @mode: mode of the file to be created in @dir
>> + *
>> + * If the @mode of the new file has both the S_ISGID and S_IXGRP bit
>> + * raised and @dir has the S_ISGID bit raised ensure that the caller is
>> + * either in the group of the parent directory or they have CAP_FSETID
>> + * in their user namespace and are privileged over the parent directory.
>> + * In all other cases, strip the S_ISGID bit from @mode.
>> + *
>> + * Return: the new mode to use for the file
>> + */
>> +umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
>> +			 const struct inode *dir, umode_t mode)
>> +{
>> +	if (S_ISDIR(mode) || !dir || !(dir->i_mode&  S_ISGID))
>> +		return mode;
>> +	if ((mode&  (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
>> +		return mode;
>> +	if (in_group_p(i_gid_into_mnt(mnt_userns, dir)))
>> +		return mode;
>> +	if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
>> +		return mode;
>> +
>> +	mode&= ~S_ISGID;
>> +	return mode;
>> +}
>> +EXPORT_SYMBOL(mode_strip_sgid);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index bbde95387a23..98b44a2732f5 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1897,6 +1897,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
>>   void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		      const struct inode *dir, umode_t mode);
>>   extern bool may_open_dev(const struct path *path);
>> +umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
>> +			 const struct inode *dir, umode_t mode);
>>
>>   /*
>>    * This is the "filldir" function type, used by readdir() to let
>
> This series looks like a nice cleanup. I went ahead and added this pile
> to another kernel I was testing with xfstests and it seemed to do fine.
>
> You can add this (or some variant of it) to all 4 patches.
>
> Reviewed-and-Tested-by: Jeff Layton<jlayton@kernel.org>

Thanks, I also have a xfstests patch set[1] for testing this, but now it 
is not a final version.

[1]https://patchwork.kernel.org/project/fstests/list/?series=631461

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-26 14:53         ` Christian Brauner
@ 2022-04-27  9:22           ` Christian Brauner
  2022-04-28  4:45             ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-27  9:22 UTC (permalink / raw)
  To: Yang Xu, Dave Chinner, Darrick J. Wong, Matthew Wilcox,
	Jeff Layton, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel, ceph-devel, Al Viro

On Tue, Apr 26, 2022 at 04:53:49PM +0200, Christian Brauner wrote:
> On Tue, Apr 26, 2022 at 01:52:11PM +0200, Miklos Szeredi wrote:
> > On Tue, 26 Apr 2022 at 13:21, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Apr 26, 2022 at 1:38 PM Christian Brauner <brauner@kernel.org> wrote:
> > 
> > > > One thing that I just remembered and which I think I haven't mentioned
> > > > so far is that moving S_ISGID stripping from filesystem callpaths into
> > > > the vfs callpaths means that we're hoisting this logic out of vfs_*()
> > > > helpers implicitly.
> > > >
> > > > So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
> > > > stripping being done in vfs_*() helpers anymore unless they pass the
> > > > mode on from a prior run through the vfs.
> > > >
> > > > This mostly affects overlayfs which calls vfs_*() functions directly. So
> > > > a typical overlayfs callstack would be (roughly - I'm omw to lunch):
> > > >
> > > > sys_mknod()
> > > > -> do_mknodat(mode) // calls vfs_prepare_mode()
> > > >    -> .mknod = ovl_mknod(mode)
> > > >       -> ovl_create(mode)
> > > >          -> vfs_mknod(mode)
> > > >
> > > > I think we are safe as overlayfs passes on the mode on from its own run
> > > > through the vfs and then via vfs_*() to the underlying filesystem but it
> > > > is worth point that out.
> > > >
> > > > Ccing Amir just for confirmation.
> > >
> > > Looks fine to me, but CC Miklos ...
> > 
> > Looks fine to me as well.  Overlayfs should share the mode (including
> > the suid and sgid bits), owner, group and ACL's with the underlying
> > filesystem, so clearing sgid based on overlay parent directory should
> > result in the same mode as if it was done based on the parent
> > directory on the underlying layer.
> 
> Ah yes, good point.
> 
> > 
> > AFAIU this logic is not affected by userns or mnt_userns, but
> > Christian would be best to confirm that.
> 
> It does depend on it as S_ISGID stripping requires knowledge about
> whether the caller has CAP_FSETID and is capable over the parent
> directory or if they are in the group the file is owned by.
> 
> I think ultimately it might just come down to moving vfs_prepare_mode()
> into vfs_*() helpers and not into the do_*at() helpers.
> 
> That would be cleaner anyway as right now we have this weird disconnect
> between vfs_tmpfile() and vfs_{create,mknod,mkdir}(). IOW, vfs_tmpfile()
> doesn't even have an associated do_*() wrapper where we could call
> vfs_prepare_mode() from.
> 
> So ultimately it might be nicer if we do it in vfs_*() helpers anyway.
> 
> The less pretty thing about it will be that the security_path_*() hooks
> also want a mode.
> 
> Right now these hooks receive the mode as it's passed in from userspace
> minus umask but before S_ISGID stripping happens.
> 
> Whereas I think they should really see what the filesystem sees and
> currently it's a bug that they see something else.
> 
> I need to think about this a bit.

So on top of that series (though it should just be folded in), does that
look reasonable?

From e993f81caae60fee4f77b40d46ad3863ea383493 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Wed, 27 Apr 2022 10:53:35 +0200
Subject: [PATCH] UNTESTED UNTESTED UNTESTED

As I realized yesterday we need to move mode preparation into the vfs_*()
instead of do_*() helpers as filesystems like overlayfs have the following
callstacks:

sys_mknod(ovl_path, mode)
-> do_mknodat(ovl_path, mode)
   -> .mknod = ovl_mknod(ovl_path, mode)
      -> vfs_mknod(xfs_path, mode)
	 -> .mknod = xfs_vn_mknod(xfs_path, mode)

and the requirement that this will yield the same mode as:

sys_mknod(xfs_path, mode)
-> do_mknodat(xfs_path, mode)
   -> .mknod = xfs_vn_mknod(xfs_path, mode)

By moving setgid stripping into vfs_*() helpers we achieve:

- Moving setgid stripping out of the individual filesystem's responsibility.
- Ensure that callers of vfs_*() helpers continue to get correct setgid
  stripping.

Another thing I realized while looking at this yesterday was the entanglement
with security hooks. Security hooks currently see a different mode than the
actual filesystem sees when it calls into inode_init_owner(). This patch
doesn't change that!

I originally thought that we might be able to make the security hooks see the
same mode that the filesystem will see. However, I have doubts. First, I don't
think that is achievable without more restructuring. Second, I don't think it's
required as the hooks have clearly been placed before any vfs_*() calls and
thereby have committed themselves to see the mode as passed in from userspace
(minus the umask). We will simply continue doing just exactly that
side-stepping the issue for now.

Sketched-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namei.c         | 103 +++++++++++++++++++++++++++++++++++++++------
 include/linux/fs.h |  11 -----
 2 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5dbf00704ae8..8b83db15ae5f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2998,6 +2998,71 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(unlock_rename);
 
+/**
+ * mode_strip_umask - handle vfs umask stripping
+ * @dir:	parent directory of the new inode
+ * @mode:	mode of the new inode to be created in @dir
+ *
+ * Umask stripping depends on whether or not the filesystem supports POSIX
+ * ACLs. If the filesystem doesn't support it umask stripping is done directly
+ * in here. If the filesystem does support POSIX ACLs umask stripping is
+ * deferred until the filesystem calls posix_acl_create().
+ *
+ * Returns: mode
+ */
+static inline umode_t mode_strip_umask(const struct inode *dir, umode_t mode)
+{
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
+	return mode;
+}
+
+/**
+ * vfs_prepare_mode - prepare the mode to be used for a new inode
+ * @mnt_userns:	user namespace of the mount the inode was found from
+ * @dir:	parent directory of the new inode
+ * @mode:	mode of the new inode
+ * @mask_perms:	allowed permission by the vfs
+ * @type:	type of file to be created
+ *
+ * This helper consolidates and enforces vfs restrictions on the @mode of a new
+ * object to be created.
+ *
+ * Umask stripping depends on whether the filesystem supports POSIX ACLs (see
+ * the kernel documentation for mode_strip_umask()). Moving umask stripping
+ * after setgid stripping allows the same ordering for both non-POSIX ACL and
+ * POSIX ACL supporting filesystems.
+ *
+ * Note that it's currently valid for @type to be 0 if a directory is created.
+ * Filesystems raise that flag individually and we need to check whether each
+ * filesystem can deal with receiving S_IFDIR from the vfs before we enforce a
+ * non-zero type.
+ *
+ * Returns: mode to be passed to the filesystem
+ */
+static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
+				       const struct inode *dir, umode_t mode,
+				       umode_t mask_perms, umode_t type)
+{
+	/*
+	 * S_ISGID stripping depends on the mode of the new file so make sure
+	 * that the caller gives us this information and splat if we miss it.
+	 */
+	WARN_ON_ONCE((mode & S_IFMT) == 0);
+
+	mode = mode_strip_sgid(mnt_userns, dir, mode);
+	mode = mode_strip_umask(dir, mode);
+
+	/*
+	 * Apply the vfs mandated allowed permission mask and set the type of
+	 * file to be created before we call into the filesystem.
+	 */
+	mode &= (mask_perms & ~S_IFMT);
+	mode |= (type & S_IFMT);
+
+	return mode;
+}
+
 /**
  * vfs_create - create new file
  * @mnt_userns:	user namespace of the mount the inode was found from
@@ -3023,8 +3088,9 @@ int vfs_create(struct user_namespace *mnt_userns, struct inode *dir,
 
 	if (!dir->i_op->create)
 		return -EACCES;	/* shouldn't it be ENOSYS? */
-	mode &= S_IALLUGO;
-	mode |= S_IFREG;
+
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry), mode,
+				S_IALLUGO, S_IFREG);
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
@@ -3287,7 +3353,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
+		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode, mode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(mnt_userns, &nd->path,
 						    dentry, mode);
@@ -3520,7 +3586,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	mode = vfs_prepare_mode(mnt_userns, dir, mode);
+	mode = vfs_prepare_mode(mnt_userns, dir, mode, mode, mode);
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3798,6 +3864,8 @@ int vfs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
 	if (!dir->i_op->mknod)
 		return -EPERM;
 
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry),
+				mode, mode, mode);
 	error = devcgroup_inode_mknod(mode, dev);
 	if (error)
 		return error;
@@ -3848,12 +3916,13 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	if (IS_ERR(dentry))
 		goto out1;
 
-	mnt_userns = mnt_user_ns(path.mnt);
-	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
-	error = security_path_mknod(&path, dentry, mode, dev);
+	error = security_path_mknod(&path, dentry,
+				    mode_strip_umask(d_inode(path.dentry), mode),
+				    dev);
 	if (error)
 		goto out2;
 
+	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
@@ -3919,7 +3988,13 @@ int vfs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 	if (!dir->i_op->mkdir)
 		return -EPERM;
 
-	mode &= (S_IRWXUGO|S_ISVTX);
+	/*
+	 * Filesystems currently raise S_IFDIR individually. We should try and
+	 * fix that going forward passing it in from the vfs as we do for all
+	 * other files going forward.
+	 */
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry),
+				mode, S_IRWXUGO | S_ISVTX, 0);
 	error = security_inode_mkdir(dir, dentry, mode);
 	if (error)
 		return error;
@@ -3940,7 +4015,6 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
-	struct user_namespace *mnt_userns;
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
@@ -3948,12 +4022,15 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	mnt_userns = mnt_user_ns(path.mnt);
-	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
-	error = security_path_mkdir(&path, dentry, mode);
-	if (!error)
+	error = security_path_mkdir(&path, dentry,
+				    mode_strip_umask(d_inode(path.dentry), mode));
+	if (!error) {
+		struct user_namespace *mnt_userns;
+
+		mnt_userns = mnt_user_ns(path.mnt);
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
+	}
 
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 914c8f28bb02..98b44a2732f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3459,17 +3459,6 @@ static inline bool dir_relax_shared(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
-static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
-				   const struct inode *dir, umode_t mode)
-{
-	mode = mode_strip_sgid(mnt_userns, dir, mode);
-
-	if (!IS_POSIXACL(dir))
-		mode &= ~current_umask();
-
-	return mode;
-}
-
 extern bool path_noexec(const struct path *path);
 extern void inode_nohighmem(struct inode *inode);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
                   ` (3 preceding siblings ...)
  2022-04-26 14:52 ` [PATCH v8 1/4] fs: add mode_strip_sgid() helper Jeff Layton
@ 2022-04-28  1:59 ` Al Viro
  2022-04-28  2:15   ` Al Viro
  2022-04-28  8:25   ` Christian Brauner
  2022-04-28  4:40 ` Al Viro
  5 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2022-04-28  1:59 UTC (permalink / raw)
  To: Yang Xu
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

On Tue, Apr 26, 2022 at 07:11:27PM +0800, Yang Xu wrote:
> Add a dedicated helper to handle the setgid bit when creating a new file
> in a setgid directory. This is a preparatory patch for moving setgid
> stripping into the vfs. The patch contains no functional changes.
> 
> Currently the setgid stripping logic is open-coded directly in
> inode_init_owner() and the individual filesystems are responsible for
> handling setgid inheritance. Since this has proven to be brittle as
> evidenced by old issues we uncovered over the last months (see [1] to
> [3] below) we will try to move this logic into the vfs.

First of all, inode_init_owner() is (and always had been) an optional helper.
Filesystems are *NOT* required to call it, so putting any common functionality
in there had always been a mistake.

That goes for inode_fsuid_set() and inode_fsgid_set() calls as well.
Consider e.g. this:
struct inode *ext2_new_inode(struct inode *dir, umode_t mode,
                             const struct qstr *qstr)
{
	...
        if (test_opt(sb, GRPID)) {
		inode->i_mode = mode;
		inode->i_uid = current_fsuid();
		inode->i_gid = dir->i_gid;
	} else
		inode_init_owner(&init_user_ns, inode, dir, mode);

Here we have an explicit mount option, selecting the way S_ISGID on directories
is handled.  Mount ext2 with -o grpid and see for yourself - no inode_init_owner()
calls there.

The same goes for ext4 - that code is copied there unchanged.

What's more, I'm not sure that Jann's fix made any sense in the first place.
After all, the file being created here is empty; exec on it won't give you
anything - it'll simply fail.  And modifying that file ought to strip SGID,
or we have much more interesting problems.

What am I missing here?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  1:59 ` Al Viro
@ 2022-04-28  2:15   ` Al Viro
  2022-04-28  2:23     ` xuyang2018.jy
  2022-04-28  8:25   ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-04-28  2:15 UTC (permalink / raw)
  To: Yang Xu
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 01:59:01AM +0000, Al Viro wrote:
> On Tue, Apr 26, 2022 at 07:11:27PM +0800, Yang Xu wrote:
> > Add a dedicated helper to handle the setgid bit when creating a new file
> > in a setgid directory. This is a preparatory patch for moving setgid
> > stripping into the vfs. The patch contains no functional changes.
> > 
> > Currently the setgid stripping logic is open-coded directly in
> > inode_init_owner() and the individual filesystems are responsible for
> > handling setgid inheritance. Since this has proven to be brittle as
> > evidenced by old issues we uncovered over the last months (see [1] to
> > [3] below) we will try to move this logic into the vfs.
> 
> First of all, inode_init_owner() is (and always had been) an optional helper.
> Filesystems are *NOT* required to call it, so putting any common functionality
> in there had always been a mistake.
> 
> That goes for inode_fsuid_set() and inode_fsgid_set() calls as well.
> Consider e.g. this:
> struct inode *ext2_new_inode(struct inode *dir, umode_t mode,
>                              const struct qstr *qstr)
> {
> 	...
>         if (test_opt(sb, GRPID)) {
> 		inode->i_mode = mode;
> 		inode->i_uid = current_fsuid();
> 		inode->i_gid = dir->i_gid;
> 	} else
> 		inode_init_owner(&init_user_ns, inode, dir, mode);
> 
> Here we have an explicit mount option, selecting the way S_ISGID on directories
> is handled.  Mount ext2 with -o grpid and see for yourself - no inode_init_owner()
> calls there.
> 
> The same goes for ext4 - that code is copied there unchanged.
> 
> What's more, I'm not sure that Jann's fix made any sense in the first place.
> After all, the file being created here is empty; exec on it won't give you
> anything - it'll simply fail.  And modifying that file ought to strip SGID,
> or we have much more interesting problems.
> 
> What am I missing here?

BTW, xfs has grpid option as well:
	if (dir && !(dir->i_mode & S_ISGID) && xfs_has_grpid(mp)) {
		inode_fsuid_set(inode, mnt_userns);
		inode->i_gid = dir->i_gid;
		inode->i_mode = mode;
	} else {
		inode_init_owner(mnt_userns, inode, dir, mode);
	}

We could lift that stuff into VFS, but it would require lifting that flag
(BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
and ignores SGID on directories) into generic superblock.  Otherwise we'd
be breaking existing behaviour for ext* and xfs...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  2:15   ` Al Viro
@ 2022-04-28  2:23     ` xuyang2018.jy
  2022-04-28  2:49       ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: xuyang2018.jy @ 2022-04-28  2:23 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

on 2022/4/28 10:15, Al Viro wrote:
> On Thu, Apr 28, 2022 at 01:59:01AM +0000, Al Viro wrote:
>> On Tue, Apr 26, 2022 at 07:11:27PM +0800, Yang Xu wrote:
>>> Add a dedicated helper to handle the setgid bit when creating a new file
>>> in a setgid directory. This is a preparatory patch for moving setgid
>>> stripping into the vfs. The patch contains no functional changes.
>>>
>>> Currently the setgid stripping logic is open-coded directly in
>>> inode_init_owner() and the individual filesystems are responsible for
>>> handling setgid inheritance. Since this has proven to be brittle as
>>> evidenced by old issues we uncovered over the last months (see [1] to
>>> [3] below) we will try to move this logic into the vfs.
>>
>> First of all, inode_init_owner() is (and always had been) an optional helper.
>> Filesystems are *NOT* required to call it, so putting any common functionality
>> in there had always been a mistake.
>>
>> That goes for inode_fsuid_set() and inode_fsgid_set() calls as well.
>> Consider e.g. this:
>> struct inode *ext2_new_inode(struct inode *dir, umode_t mode,
>>                               const struct qstr *qstr)
>> {
>> 	...
>>          if (test_opt(sb, GRPID)) {
>> 		inode->i_mode = mode;
>> 		inode->i_uid = current_fsuid();
>> 		inode->i_gid = dir->i_gid;
>> 	} else
>> 		inode_init_owner(&init_user_ns, inode, dir, mode);
>>
>> Here we have an explicit mount option, selecting the way S_ISGID on directories
>> is handled.  Mount ext2 with -o grpid and see for yourself - no inode_init_owner()
>> calls there.
>>
>> The same goes for ext4 - that code is copied there unchanged.
>>
>> What's more, I'm not sure that Jann's fix made any sense in the first place.
>> After all, the file being created here is empty; exec on it won't give you
>> anything - it'll simply fail.  And modifying that file ought to strip SGID,
>> or we have much more interesting problems.
>>
>> What am I missing here?
>
> BTW, xfs has grpid option as well:
> 	if (dir&&  !(dir->i_mode&  S_ISGID)&&  xfs_has_grpid(mp)) {
> 		inode_fsuid_set(inode, mnt_userns);
> 		inode->i_gid = dir->i_gid;
> 		inode->i_mode = mode;
> 	} else {
> 		inode_init_owner(mnt_userns, inode, dir, mode);
> 	}
>
> We could lift that stuff into VFS, but it would require lifting that flag
> (BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
> and ignores SGID on directories) into generic superblock.  Otherwise we'd
> be breaking existing behaviour for ext* and xfs...

I also mentioned it in my previous version(in the 3/4 patch)
"This patch also changed grpid behaviour for ext4/xfs because the mode 
passed to them may been changed by vfs_prepare_mode.
"

I guess we can add a  grpid option check in vfs_prepare_mode or in 
mode_strip_sgid, then it should not break the existing behaviour for 
ext* and xfs.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  2:23     ` xuyang2018.jy
@ 2022-04-28  2:49       ` Al Viro
  2022-04-28  3:12         ` Al Viro
  2022-04-28  8:44         ` Christian Brauner
  0 siblings, 2 replies; 29+ messages in thread
From: Al Viro @ 2022-04-28  2:49 UTC (permalink / raw)
  To: xuyang2018.jy
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 02:23:25AM +0000, xuyang2018.jy@fujitsu.com wrote:

> > BTW, xfs has grpid option as well:
> > 	if (dir&&  !(dir->i_mode&  S_ISGID)&&  xfs_has_grpid(mp)) {
> > 		inode_fsuid_set(inode, mnt_userns);
> > 		inode->i_gid = dir->i_gid;
> > 		inode->i_mode = mode;
> > 	} else {
> > 		inode_init_owner(mnt_userns, inode, dir, mode);
> > 	}
> >
> > We could lift that stuff into VFS, but it would require lifting that flag
> > (BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
> > and ignores SGID on directories) into generic superblock.  Otherwise we'd
> > be breaking existing behaviour for ext* and xfs...
> 
> I also mentioned it in my previous version(in the 3/4 patch)
> "This patch also changed grpid behaviour for ext4/xfs because the mode 
> passed to them may been changed by vfs_prepare_mode.
> "
> 
> I guess we can add a  grpid option check in vfs_prepare_mode or in 
> mode_strip_sgid, then it should not break the existing behaviour for 
> ext* and xfs.

I don't like it, TBH.  That way we have
	1) caller mangles mode (after having looked at grpid, etc.)
	2) filesystem checks grpid and either calls inode_init_owner(),
which might (or might not) modify the gid to be used or skips the
call and assigns gid directly.

It's asking for trouble.  We have two places where the predicate is
checked; the first mangles mode (and I'm still not convinced that we
need to bother with that at all), the second affects gid (and for
mkdir in sgid directory on non-grpid ones inherits sgid).

That kind of structure is asking for trouble.  *IF* we make inode_init_owner()
aware of grpid (and make that predicate usable from generic helper), we might
as well just make inode_init_owner() mandatory (for the first time ever) and
get rid of grpid checks in filesystems themselves.  But then there's no point
doing it in method callers.

Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
ext2 doesn't bother with either in such case...

Let's try to separate the issues here.  Jann, could you explain what makes
empty sgid files dangerous?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  2:49       ` Al Viro
@ 2022-04-28  3:12         ` Al Viro
  2022-04-28  3:46           ` Al Viro
  2022-04-28  8:06           ` Jann Horn
  2022-04-28  8:44         ` Christian Brauner
  1 sibling, 2 replies; 29+ messages in thread
From: Al Viro @ 2022-04-28  3:12 UTC (permalink / raw)
  To: xuyang2018.jy
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 02:49:11AM +0000, Al Viro wrote:
> On Thu, Apr 28, 2022 at 02:23:25AM +0000, xuyang2018.jy@fujitsu.com wrote:
> 
> > > BTW, xfs has grpid option as well:
> > > 	if (dir&&  !(dir->i_mode&  S_ISGID)&&  xfs_has_grpid(mp)) {
> > > 		inode_fsuid_set(inode, mnt_userns);
> > > 		inode->i_gid = dir->i_gid;
> > > 		inode->i_mode = mode;
> > > 	} else {
> > > 		inode_init_owner(mnt_userns, inode, dir, mode);
> > > 	}
> > >
> > > We could lift that stuff into VFS, but it would require lifting that flag
> > > (BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
> > > and ignores SGID on directories) into generic superblock.  Otherwise we'd
> > > be breaking existing behaviour for ext* and xfs...
> > 
> > I also mentioned it in my previous version(in the 3/4 patch)
> > "This patch also changed grpid behaviour for ext4/xfs because the mode 
> > passed to them may been changed by vfs_prepare_mode.
> > "
> > 
> > I guess we can add a  grpid option check in vfs_prepare_mode or in 
> > mode_strip_sgid, then it should not break the existing behaviour for 
> > ext* and xfs.
> 
> I don't like it, TBH.  That way we have
> 	1) caller mangles mode (after having looked at grpid, etc.)
> 	2) filesystem checks grpid and either calls inode_init_owner(),
> which might (or might not) modify the gid to be used or skips the
> call and assigns gid directly.
> 
> It's asking for trouble.  We have two places where the predicate is
> checked; the first mangles mode (and I'm still not convinced that we
> need to bother with that at all), the second affects gid (and for
> mkdir in sgid directory on non-grpid ones inherits sgid).
> 
> That kind of structure is asking for trouble.  *IF* we make inode_init_owner()
> aware of grpid (and make that predicate usable from generic helper), we might
> as well just make inode_init_owner() mandatory (for the first time ever) and
> get rid of grpid checks in filesystems themselves.  But then there's no point
> doing it in method callers.
> 
> Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> ext2 doesn't bother with either in such case...
> 
> Let's try to separate the issues here.  Jann, could you explain what makes
> empty sgid files dangerous?

Found the original thread in old mailbox, and the method of avoiding the
SGID removal on modification is usable.  Which answers the question above...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  3:12         ` Al Viro
@ 2022-04-28  3:46           ` Al Viro
  2022-04-28  9:34             ` Christian Brauner
  2022-04-28  8:06           ` Jann Horn
  1 sibling, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-04-28  3:46 UTC (permalink / raw)
  To: xuyang2018.jy
  Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 03:12:30AM +0000, Al Viro wrote:

> > Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> > path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> > ext2 doesn't bother with either in such case...
> > 
> > Let's try to separate the issues here.  Jann, could you explain what makes
> > empty sgid files dangerous?
> 
> Found the original thread in old mailbox, and the method of avoiding the
> SGID removal on modification is usable.  Which answers the question above...

OK, what do we want for grpid mounts?  Aside of "don't forget inode_fsuid_set(),
please", that is.  We don't want inode_fsgid_set() there (whatever went for
the parent directory should be the right value for the child).  Same "strip
SGID from non-directory child, unless we are in the resulting group"?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
                   ` (4 preceding siblings ...)
  2022-04-28  1:59 ` Al Viro
@ 2022-04-28  4:40 ` Al Viro
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-04-28  4:40 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, david, djwong, brauner, willy, jlayton

On Tue, Apr 26, 2022 at 07:11:27PM +0800, Yang Xu wrote:


> +umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
> +			 const struct inode *dir, umode_t mode)
> +{
> +	if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID))
> +		return mode;
> +	if ((mode & (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
> +		return mode;
> +	if (in_group_p(i_gid_into_mnt(mnt_userns, dir)))
> +		return mode;
> +	if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> +		return mode;
> +
> +	mode &= ~S_ISGID;
> +	return mode;
> +}

	I'm fairly sure that absolute majority of calls will not
have S_ISGID passed in mode.  So I'd prefer to reorder the first
two ifs.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-27  9:22           ` Christian Brauner
@ 2022-04-28  4:45             ` Al Viro
  2022-04-28  8:07               ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-04-28  4:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Yang Xu, Dave Chinner, Darrick J. Wong, Matthew Wilcox,
	Jeff Layton, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	ceph-devel

On Wed, Apr 27, 2022 at 11:22:01AM +0200, Christian Brauner wrote:

> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> +				       const struct inode *dir, umode_t mode,
> +				       umode_t mask_perms, umode_t type)
> +{
> +	/*
> +	 * S_ISGID stripping depends on the mode of the new file so make sure
> +	 * that the caller gives us this information and splat if we miss it.
> +	 */
> +	WARN_ON_ONCE((mode & S_IFMT) == 0);

<blink>

First of all, what happens if you call mknod("/tmp/blah", 0, 0)?  And the only
thing about type bits we care about is "is it a directory" - the sensitive
stuff is in the low 12 bits...  What is that check about?

> +	mode = mode_strip_sgid(mnt_userns, dir, mode);
> +	mode = mode_strip_umask(dir, mode);
> +
> +	/*
> +	 * Apply the vfs mandated allowed permission mask and set the type of
> +	 * file to be created before we call into the filesystem.
> +	 */
> +	mode &= (mask_perms & ~S_IFMT);
> +	mode |= (type & S_IFMT);
> +
> +	return mode;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  3:12         ` Al Viro
  2022-04-28  3:46           ` Al Viro
@ 2022-04-28  8:06           ` Jann Horn
  1 sibling, 0 replies; 29+ messages in thread
From: Jann Horn @ 2022-04-28  8:06 UTC (permalink / raw)
  To: Al Viro
  Cc: xuyang2018.jy, linux-fsdevel, ceph-devel, david, djwong, brauner,
	willy, jlayton, Linus Torvalds

On Thu, Apr 28, 2022 at 5:12 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Apr 28, 2022 at 02:49:11AM +0000, Al Viro wrote:
> > Let's try to separate the issues here.  Jann, could you explain what makes
> > empty sgid files dangerous?
>
> Found the original thread in old mailbox, and the method of avoiding the
> SGID removal on modification is usable.  Which answers the question above...

As context for everyone on the thread who isn't on security@, you can see a
public copy of the bug report here:
https://bugs.chromium.org/p/project-zero/issues/detail?id=1611
and also here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779923

And the kernel patch in question is this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
  2022-04-28  4:45             ` Al Viro
@ 2022-04-28  8:07               ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2022-04-28  8:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Yang Xu, Dave Chinner, Darrick J. Wong, Matthew Wilcox,
	Jeff Layton, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	ceph-devel

On Thu, Apr 28, 2022 at 04:45:05AM +0000, Al Viro wrote:
> On Wed, Apr 27, 2022 at 11:22:01AM +0200, Christian Brauner wrote:
> 
> > +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> > +				       const struct inode *dir, umode_t mode,
> > +				       umode_t mask_perms, umode_t type)
> > +{
> > +	/*
> > +	 * S_ISGID stripping depends on the mode of the new file so make sure
> > +	 * that the caller gives us this information and splat if we miss it.
> > +	 */
> > +	WARN_ON_ONCE((mode & S_IFMT) == 0);
> 
> <blink>
> 
> First of all, what happens if you call mknod("/tmp/blah", 0, 0)?  And the only
> thing about type bits we care about is "is it a directory" - the sensitive
> stuff is in the low 12 bits...  What is that check about?

Do note that this is just an untested rough sketch to illustrate how to
move it into vfs_*() helpers.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  1:59 ` Al Viro
  2022-04-28  2:15   ` Al Viro
@ 2022-04-28  8:25   ` Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2022-04-28  8:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Yang Xu, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 01:59:01AM +0000, Al Viro wrote:
> On Tue, Apr 26, 2022 at 07:11:27PM +0800, Yang Xu wrote:
> > Add a dedicated helper to handle the setgid bit when creating a new file
> > in a setgid directory. This is a preparatory patch for moving setgid
> > stripping into the vfs. The patch contains no functional changes.
> > 
> > Currently the setgid stripping logic is open-coded directly in
> > inode_init_owner() and the individual filesystems are responsible for
> > handling setgid inheritance. Since this has proven to be brittle as
> > evidenced by old issues we uncovered over the last months (see [1] to
> > [3] below) we will try to move this logic into the vfs.
> 
> First of all, inode_init_owner() is (and always had been) an optional helper.

The whole patch series was triggered because ever since I added setgid
inheritance tests (see [1]) as part of the idmapped mounts test suite
into xfstests we found 3 setgid inheritance bugs (The bugs are linked in
the commit messages.).
The bugs showed up whenever a filesystem didn't call inode_init_owner()
or had custom code in place that deviated from expectations.

That's what triggered this whole patch series. Yang took it on and seems
here to see it through.

I should point out that it was rather unclear what expectations are btw
because of the ordering dependency between umask and POSIX ACLs and
setgid stripping. I've describe this at length in the commit message I
gave Yang.

It took a lot of digging and over the course of me reviewing this patch
series more and more corner-cases pop up that we haven't handled.

> Filesystems are *NOT* required to call it, so putting any common functionality
> in there had always been a mistake.

See above. I pointed this out in earlier version.
I very much agree which is why we should move it into the vfs proper if
we can with reasonably minimal regression risk.

[1]: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/idmapped-mounts.c#n7812

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  2:49       ` Al Viro
  2022-04-28  3:12         ` Al Viro
@ 2022-04-28  8:44         ` Christian Brauner
  2022-04-28 11:55           ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-28  8:44 UTC (permalink / raw)
  To: Al Viro
  Cc: xuyang2018.jy, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 02:49:11AM +0000, Al Viro wrote:
> On Thu, Apr 28, 2022 at 02:23:25AM +0000, xuyang2018.jy@fujitsu.com wrote:
> 
> > > BTW, xfs has grpid option as well:
> > > 	if (dir&&  !(dir->i_mode&  S_ISGID)&&  xfs_has_grpid(mp)) {
> > > 		inode_fsuid_set(inode, mnt_userns);
> > > 		inode->i_gid = dir->i_gid;
> > > 		inode->i_mode = mode;
> > > 	} else {
> > > 		inode_init_owner(mnt_userns, inode, dir, mode);
> > > 	}
> > >
> > > We could lift that stuff into VFS, but it would require lifting that flag
> > > (BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
> > > and ignores SGID on directories) into generic superblock.  Otherwise we'd
> > > be breaking existing behaviour for ext* and xfs...
> > 
> > I also mentioned it in my previous version(in the 3/4 patch)
> > "This patch also changed grpid behaviour for ext4/xfs because the mode 
> > passed to them may been changed by vfs_prepare_mode.
> > "
> > 
> > I guess we can add a  grpid option check in vfs_prepare_mode or in 
> > mode_strip_sgid, then it should not break the existing behaviour for 
> > ext* and xfs.
> 
> I don't like it, TBH.  That way we have
> 	1) caller mangles mode (after having looked at grpid, etc.)
> 	2) filesystem checks grpid and either calls inode_init_owner(),
> which might (or might not) modify the gid to be used or skips the
> call and assigns gid directly.
> 
> It's asking for trouble.  We have two places where the predicate is
> checked; the first mangles mode (and I'm still not convinced that we
> need to bother with that at all), the second affects gid (and for
> mkdir in sgid directory on non-grpid ones inherits sgid).
> 
> That kind of structure is asking for trouble.  *IF* we make inode_init_owner()
> aware of grpid (and make that predicate usable from generic helper), we might
> as well just make inode_init_owner() mandatory (for the first time ever) and
> get rid of grpid checks in filesystems themselves.  But then there's no point
> doing it in method callers.

This has ordering issues. In the vfs the umask is stripped and then we
call into the filesystem and inode_init_owner() is called. So if
S_IXGRP is removed by the umask then we inherit the S_ISGID bit.

But if the filesystem uses POSIX ACLs then the umask isn't stripped in
the vfs and instead may be done (I say "may" because it depends on
whether or not applicable POSIX ACLs are found) in posix_acl_create().

But in order to call posix_acl_create() the filesystem will have often
ran through inode_init_owner() first. For example, ext4 does
inode_init_owner() and afterwards calls ext4_init_acl() which in turn
ends up calling posix_acl_create() which _may_ strip the umask.

Iow, you end up with two possible setgid removal paths:
* strip setgid first, apply umask (no POSIX ACLs supported)
* apply umask, strip setgid (POSIX ACLs supported)
with possibly different results.

Mandating inode_init_owner() being used doesn't solve that and I think
it's still brittle overall.

If we can hoist all of this into vfs_*() before we call into the
filesystem we're better off in the long run. It's worth the imho
negible regression risk. 

> 
> Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> ext2 doesn't bother with either in such case...

Using inode_fs*id_set() is only relevant when the inode is initialized
based on the caller's fs*id. If you request to inherit the parent
directories gid then the caller's gid doesn't matter.

ext2 doesn't need to care at all about this because it doesn't raise
FS_ALLOW_IDMAP.

> 
> Let's try to separate the issues here.  Jann, could you explain what makes
> empty sgid files dangerous?

I see that's answered in a later mail.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  3:46           ` Al Viro
@ 2022-04-28  9:34             ` Christian Brauner
  2022-05-19  1:03               ` xuyang2018.jy
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2022-04-28  9:34 UTC (permalink / raw)
  To: Al Viro
  Cc: xuyang2018.jy, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 03:46:59AM +0000, Al Viro wrote:
> On Thu, Apr 28, 2022 at 03:12:30AM +0000, Al Viro wrote:
> 
> > > Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> > > path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> > > ext2 doesn't bother with either in such case...
> > > 
> > > Let's try to separate the issues here.  Jann, could you explain what makes
> > > empty sgid files dangerous?
> > 
> > Found the original thread in old mailbox, and the method of avoiding the
> > SGID removal on modification is usable.  Which answers the question above...
> 
> OK, what do we want for grpid mounts?  Aside of "don't forget inode_fsuid_set(),
> please", that is.  We don't want inode_fsgid_set() there (whatever went for
> the parent directory should be the right value for the child).  Same "strip

Exactly. You sounded puzzled why we don't call that in an earlier mail.

> SGID from non-directory child, unless we are in the resulting group"?

Honestly, I think we should try and see if we can't use the same setgid
inheritance enforcement of the new mode_strip_sgid() helper for the
grpid mount option as well. Iow, just don't give the grpid mount option
a separate setgid treatment and try it with the current approach.

It'll allow us to move things into vfs proper which I think is a robust
solution with clear semantics. It also gives us a uniform ordering wrt
to umask stripping and POSIX ACLs.

Yes, as we've pointed out in the thread this carries a non-zero
regression risk. But so does the whole patch series. But this might end
up being a big win security wise and makes maintenance way easier going
forward.

The current setgid situation is thoroughly messy though and we keep
plugging holes. Even writing tests for the current situation is an
almost herculean task let alone reviewing it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  8:44         ` Christian Brauner
@ 2022-04-28 11:55           ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2022-04-28 11:55 UTC (permalink / raw)
  To: Al Viro
  Cc: xuyang2018.jy, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, Apr 28, 2022 at 10:45:01AM +0200, Christian Brauner wrote:
> On Thu, Apr 28, 2022 at 02:49:11AM +0000, Al Viro wrote:
> > On Thu, Apr 28, 2022 at 02:23:25AM +0000, xuyang2018.jy@fujitsu.com wrote:
> > 
> > > > BTW, xfs has grpid option as well:
> > > > 	if (dir&&  !(dir->i_mode&  S_ISGID)&&  xfs_has_grpid(mp)) {
> > > > 		inode_fsuid_set(inode, mnt_userns);
> > > > 		inode->i_gid = dir->i_gid;
> > > > 		inode->i_mode = mode;
> > > > 	} else {
> > > > 		inode_init_owner(mnt_userns, inode, dir, mode);
> > > > 	}
> > > >
> > > > We could lift that stuff into VFS, but it would require lifting that flag
> > > > (BSD vs. SysV behaviour wrt GID - BSD *always* inherits GID from parent
> > > > and ignores SGID on directories) into generic superblock.  Otherwise we'd
> > > > be breaking existing behaviour for ext* and xfs...
> > > 
> > > I also mentioned it in my previous version(in the 3/4 patch)
> > > "This patch also changed grpid behaviour for ext4/xfs because the mode 
> > > passed to them may been changed by vfs_prepare_mode.
> > > "
> > > 
> > > I guess we can add a  grpid option check in vfs_prepare_mode or in 
> > > mode_strip_sgid, then it should not break the existing behaviour for 
> > > ext* and xfs.
> > 
> > I don't like it, TBH.  That way we have
> > 	1) caller mangles mode (after having looked at grpid, etc.)
> > 	2) filesystem checks grpid and either calls inode_init_owner(),
> > which might (or might not) modify the gid to be used or skips the
> > call and assigns gid directly.
> > 
> > It's asking for trouble.  We have two places where the predicate is
> > checked; the first mangles mode (and I'm still not convinced that we
> > need to bother with that at all), the second affects gid (and for
> > mkdir in sgid directory on non-grpid ones inherits sgid).
> > 
> > That kind of structure is asking for trouble.  *IF* we make inode_init_owner()
> > aware of grpid (and make that predicate usable from generic helper), we might
> > as well just make inode_init_owner() mandatory (for the first time ever) and
> > get rid of grpid checks in filesystems themselves.  But then there's no point
> > doing it in method callers.
> 
> This has ordering issues. In the vfs the umask is stripped and then we
> call into the filesystem and inode_init_owner() is called. So if
> S_IXGRP is removed by the umask then we inherit the S_ISGID bit.
> 
> But if the filesystem uses POSIX ACLs then the umask isn't stripped in
> the vfs and instead may be done (I say "may" because it depends on
> whether or not applicable POSIX ACLs are found) in posix_acl_create().
> 
> But in order to call posix_acl_create() the filesystem will have often
> ran through inode_init_owner() first. For example, ext4 does
> inode_init_owner() and afterwards calls ext4_init_acl() which in turn
> ends up calling posix_acl_create() which _may_ strip the umask.
> 
> Iow, you end up with two possible setgid removal paths:
> * strip setgid first, apply umask (no POSIX ACLs supported)
> * apply umask, strip setgid (POSIX ACLs supported)

Ugh, that's reversed:
* POSIX ACLs supported:
  1. strip setgid first
  2. (possibly) strip umask
* POSIX ACLS unsupported:
  1. apply umask
  2. strip setgid

> with possibly different results.
> 
> Mandating inode_init_owner() being used doesn't solve that and I think
> it's still brittle overall.
> 
> If we can hoist all of this into vfs_*() before we call into the
> filesystem we're better off in the long run. It's worth the imho
> negible regression risk. 
> 
> > 
> > Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> > path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> > ext2 doesn't bother with either in such case...
> 
> Using inode_fs*id_set() is only relevant when the inode is initialized
> based on the caller's fs*id. If you request to inherit the parent
> directories gid then the caller's gid doesn't matter.
> 
> ext2 doesn't need to care at all about this because it doesn't raise
> FS_ALLOW_IDMAP.
> 
> > 
> > Let's try to separate the issues here.  Jann, could you explain what makes
> > empty sgid files dangerous?
> 
> I see that's answered in a later mail.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-04-28  9:34             ` Christian Brauner
@ 2022-05-19  1:03               ` xuyang2018.jy
  2022-05-19  9:14                 ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: xuyang2018.jy @ 2022-05-19  1:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

on 2022/4/28 17:34, Christian Brauner wrote:
> On Thu, Apr 28, 2022 at 03:46:59AM +0000, Al Viro wrote:
>> On Thu, Apr 28, 2022 at 03:12:30AM +0000, Al Viro wrote:
>>
>>>> Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
>>>> path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
>>>> ext2 doesn't bother with either in such case...
>>>>
>>>> Let's try to separate the issues here.  Jann, could you explain what makes
>>>> empty sgid files dangerous?
>>>
>>> Found the original thread in old mailbox, and the method of avoiding the
>>> SGID removal on modification is usable.  Which answers the question above...
>>
>> OK, what do we want for grpid mounts?  Aside of "don't forget inode_fsuid_set(),
>> please", that is.  We don't want inode_fsgid_set() there (whatever went for
>> the parent directory should be the right value for the child).  Same "strip
>
> Exactly. You sounded puzzled why we don't call that in an earlier mail.
>
>> SGID from non-directory child, unless we are in the resulting group"?
>
> Honestly, I think we should try and see if we can't use the same setgid
> inheritance enforcement of the new mode_strip_sgid() helper for the
> grpid mount option as well. Iow, just don't give the grpid mount option
> a separate setgid treatment and try it with the current approach.
>
> It'll allow us to move things into vfs proper which I think is a robust
> solution with clear semantics. It also gives us a uniform ordering wrt
> to umask stripping and POSIX ACLs.
>
> Yes, as we've pointed out in the thread this carries a non-zero
> regression risk. But so does the whole patch series. But this might end
> up being a big win security wise and makes maintenance way easier going
> forward.
>
> The current setgid situation is thoroughly messy though and we keep
> plugging holes. Even writing tests for the current situation is an
> almost herculean task let alone reviewing it.

Sorry for the late reply.
I am agree with these. So what should I do in next step?

ps:I also think I may send test case to xfstests for posix acl and umask 
ASAP, then xfstests can merge these test and so more people will notice 
this problem.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/4] fs: add mode_strip_sgid() helper
  2022-05-19  1:03               ` xuyang2018.jy
@ 2022-05-19  9:14                 ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2022-05-19  9:14 UTC (permalink / raw)
  To: xuyang2018.jy
  Cc: Al Viro, linux-fsdevel, ceph-devel, david, djwong, willy,
	jlayton, Linus Torvalds, Jann Horn

On Thu, May 19, 2022 at 01:03:12AM +0000, xuyang2018.jy@fujitsu.com wrote:
> on 2022/4/28 17:34, Christian Brauner wrote:
> > On Thu, Apr 28, 2022 at 03:46:59AM +0000, Al Viro wrote:
> >> On Thu, Apr 28, 2022 at 03:12:30AM +0000, Al Viro wrote:
> >>
> >>>> Note, BTW, that while XFS has inode_fsuid_set() on the non-inode_init_owner()
> >>>> path, it doesn't have inode_fsgid_set() there.  Same goes for ext4, while
> >>>> ext2 doesn't bother with either in such case...
> >>>>
> >>>> Let's try to separate the issues here.  Jann, could you explain what makes
> >>>> empty sgid files dangerous?
> >>>
> >>> Found the original thread in old mailbox, and the method of avoiding the
> >>> SGID removal on modification is usable.  Which answers the question above...
> >>
> >> OK, what do we want for grpid mounts?  Aside of "don't forget inode_fsuid_set(),
> >> please", that is.  We don't want inode_fsgid_set() there (whatever went for
> >> the parent directory should be the right value for the child).  Same "strip
> >
> > Exactly. You sounded puzzled why we don't call that in an earlier mail.
> >
> >> SGID from non-directory child, unless we are in the resulting group"?
> >
> > Honestly, I think we should try and see if we can't use the same setgid
> > inheritance enforcement of the new mode_strip_sgid() helper for the
> > grpid mount option as well. Iow, just don't give the grpid mount option
> > a separate setgid treatment and try it with the current approach.
> >
> > It'll allow us to move things into vfs proper which I think is a robust
> > solution with clear semantics. It also gives us a uniform ordering wrt
> > to umask stripping and POSIX ACLs.
> >
> > Yes, as we've pointed out in the thread this carries a non-zero
> > regression risk. But so does the whole patch series. But this might end
> > up being a big win security wise and makes maintenance way easier going
> > forward.
> >
> > The current setgid situation is thoroughly messy though and we keep
> > plugging holes. Even writing tests for the current situation is an
> > almost herculean task let alone reviewing it.
> 
> Sorry for the late reply.
> I am agree with these. So what should I do in next step?

I'd say we try to go forward with the original approach and not
special-casing the grpid option.

> 
> ps:I also think I may send test case to xfstests for posix acl and umask 
> ASAP, then xfstests can merge these test and so more people will notice 
> this problem.

The big refactoring of the idmapped mounts testsuite into a generic
vfstest refactoring is sitting in for-next of xfstests so make sure to
base your patches on that because you really won't enjoy having to
rebase them...

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-05-19  9:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 11:11 [PATCH v8 1/4] fs: add mode_strip_sgid() helper Yang Xu
2022-04-26 11:11 ` [PATCH v8 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
2022-04-26 11:11 ` [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
2022-04-26 10:38   ` Christian Brauner
2022-04-26 11:20     ` Amir Goldstein
2022-04-26 11:52       ` Miklos Szeredi
2022-04-26 14:53         ` Christian Brauner
2022-04-27  9:22           ` Christian Brauner
2022-04-28  4:45             ` Al Viro
2022-04-28  8:07               ` Christian Brauner
2022-04-26 15:52   ` Christian Brauner
2022-04-27  1:21     ` xuyang2018.jy
2022-04-26 11:11 ` [PATCH v8 4/4] ceph: rely on vfs for setgid stripping Yang Xu
2022-04-26 14:52 ` [PATCH v8 1/4] fs: add mode_strip_sgid() helper Jeff Layton
2022-04-27  1:34   ` xuyang2018.jy
2022-04-28  1:59 ` Al Viro
2022-04-28  2:15   ` Al Viro
2022-04-28  2:23     ` xuyang2018.jy
2022-04-28  2:49       ` Al Viro
2022-04-28  3:12         ` Al Viro
2022-04-28  3:46           ` Al Viro
2022-04-28  9:34             ` Christian Brauner
2022-05-19  1:03               ` xuyang2018.jy
2022-05-19  9:14                 ` Christian Brauner
2022-04-28  8:06           ` Jann Horn
2022-04-28  8:44         ` Christian Brauner
2022-04-28 11:55           ` Christian Brauner
2022-04-28  8:25   ` Christian Brauner
2022-04-28  4:40 ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.