All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 1/4] fs: add mode_strip_sgid() helper
@ 2022-05-20 16:10 Yang Xu
  2022-05-20 16:10 ` [PATCH v9 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Yang Xu @ 2022-05-20 16:10 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Add a dedicated helper to handle the setgid bit when creating a new file
in a setgid directory. This is a preparatory patch for moving setgid
stripping into the vfs. The patch contains no functional changes.

Currently the setgid stripping logic is open-coded directly in
inode_init_owner() and the individual filesystems are responsible for
handling setgid inheritance. Since this has proven to be brittle as
evidenced by old issues we uncovered over the last months (see [1] to
[3] below) we will try to move this logic into the vfs.

Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
v8->v9: 
1.move if orders in mode_strip_sgid helper
2.xfstests testcase url
https://patchwork.kernel.org/project/fstests/list/?series=643643
 fs/inode.c         | 37 +++++++++++++++++++++++++++++++++----
 include/linux/fs.h |  2 ++
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..37bd85981d38 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(i_gid_into_mnt(mnt_userns, dir)) &&
-			 !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
-			mode &= ~S_ISGID;
+		else
+			mode = mode_strip_sgid(mnt_userns, dir, mode);
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
@@ -2405,3 +2403,34 @@ struct timespec64 current_time(struct inode *inode)
 	return timestamp_truncate(now, inode);
 }
 EXPORT_SYMBOL(current_time);
+
+/**
+ * mode_strip_sgid - handle the sgid bit for non-directories
+ * @mnt_userns: User namespace of the mount the inode was created from
+ * @dir: parent directory inode
+ * @mode: mode of the file to be created in @dir
+ *
+ * If the @mode of the new file has both the S_ISGID and S_IXGRP bit
+ * raised and @dir has the S_ISGID bit raised ensure that the caller is
+ * either in the group of the parent directory or they have CAP_FSETID
+ * in their user namespace and are privileged over the parent directory.
+ * In all other cases, strip the S_ISGID bit from @mode.
+ *
+ * Return: the new mode to use for the file
+ */
+umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
+			const struct inode *dir, umode_t mode)
+{
+	if ((mode & (S_ISGID | S_IXGRP)) != (S_ISGID | S_IXGRP))
+		return mode;
+	if (S_ISDIR(mode) || !dir || !(dir->i_mode & S_ISGID))
+		return mode;
+	if (in_group_p(i_gid_into_mnt(mnt_userns, dir)))
+		return mode;
+	if (capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
+		return mode;
+
+	mode &= ~S_ISGID;
+	return mode;
+}
+EXPORT_SYMBOL(mode_strip_sgid);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..3bccb7ba44a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
 void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		      const struct inode *dir, umode_t mode);
 extern bool may_open_dev(const struct path *path);
+umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
+			const struct inode *dir, umode_t mode);
 
 /*
  * This is the "filldir" function type, used by readdir() to let
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v9 2/4] fs: Add missing umask strip in vfs_tmpfile
  2022-05-20 16:10 [PATCH v9 1/4] fs: add mode_strip_sgid() helper Yang Xu
@ 2022-05-20 16:10 ` Yang Xu
  2022-05-20 16:10 ` [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
  2022-05-20 16:10 ` [PATCH v9 4/4] ceph: rely on vfs for setgid stripping Yang Xu
  2 siblings, 0 replies; 6+ messages in thread
From: Yang Xu @ 2022-05-20 16:10 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu, stable

All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.

Fixes: 60545d0d4610 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 509657fdf4f5..73646e28fae0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3521,6 +3521,8 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs
  2022-05-20 16:10 [PATCH v9 1/4] fs: add mode_strip_sgid() helper Yang Xu
  2022-05-20 16:10 ` [PATCH v9 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
@ 2022-05-20 16:10 ` Yang Xu
  2022-06-10 12:53   ` Christian Brauner
  2022-05-20 16:10 ` [PATCH v9 4/4] ceph: rely on vfs for setgid stripping Yang Xu
  2 siblings, 1 reply; 6+ messages in thread
From: Yang Xu @ 2022-05-20 16:10 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current state of things:

* The S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

* The first two points taken together mean that there's a
  non-standardized ordering between setgid stripping in
  inode_init_owner() and posix_acl_create() both on the vfs level and
  the filesystem level. The latter part is especially problematic since
  each filesystem is technically free to order inode_init_owner() and
  posix_acl_create() however it sees fit meaning that S_ISGID
  inheritance might or might not be applied.

* We do still have bugs in this areas years after the initial round of
  setgid bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

The S_ISGID bit is stripped before any umask is applied. This has the
advantage that the ordering is unaffected by whether umask stripping is
done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
the filesystem in posix_acl_create() (if POSIX ACLs are supported).

To this end a new helper vfs_prepare_mode() is added which calls the
previously added mode_strip_setgid() helper and strips the umask
afterwards.

All inode operations that create new filesystem objects have been
updated to call vfs_prepare_mode() before passing the mode into the
relevant inode operation of the filesystems. Care has been taken to
ensure that the mode passed to the security hooks is the mode that is
seen by the filesystem.

Moving S_ISGID stripping from filesystem callpaths into the vfs callpaths
means thatfilesystems that call vfs_*() helpers directly can't rely on
S_ISGID stripping being done in vfs_*() helpers anymore unless they pass the
mode on from a prior run through the vfs.

This mostly affects overlayfs which calls vfs_*() functions directly. So
a typical overlayfs callstack would be
sys_mknod()
-> do_mknodat(mode) // calls vfs_prepare_mode()
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

But it is safe as overlayfs passes on the mode on from its own run
through the vfs and then via vfs_*() to the underlying filesystem.

Following is an overview of the filesystem specific and inode operations
specific implications:

arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the following ->mkdir(),
->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() inode operation trivially inherit the mode from
the target and the ->rename() inode operation inherits the mode from the
source inode.

All other inode operations will have the S_ISGID bit stripped once in
vfs_prepare_mode() before.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs.

So we did our best to audit all filesystems in this regard:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_inode() but only creates directories and doesn't raise
  S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain tbh.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

This patch also changed grpid behaviour for ext4/xfs because the mode
passed to them may have been changed by vfs_prepare_mode.

While we did our best to audit everything there's a risk of regressions
in here. However, for the sake of maintenance and given that we've seen
a range of bugs years after S_ISGID inheritance issues were fixed (see
[1]-[3]) the risk seems worth taking. In the worst case we will have to
revert.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
v8->v9:
1.move vfs_prepare_mode info fs/namei.c
2. add grpid behaviour change in commit message
3. also mention the overflay in commit meessage because it will use call vfs_mknod directly
 fs/inode.c       |  2 --
 fs/namei.c       | 33 ++++++++++++++++++++-------------
 fs/ocfs2/namei.c |  1 +
 3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 37bd85981d38..42ecaf79aaaf 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else
-			mode = mode_strip_sgid(mnt_userns, dir, mode);
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
diff --git a/fs/namei.c b/fs/namei.c
index 73646e28fae0..8b60914861f5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2998,6 +2998,17 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(unlock_rename);
 
+static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
+				       const struct inode *dir, umode_t mode)
+{
+	mode = mode_strip_sgid(mnt_userns, dir, mode);
+
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
+
+	return mode;
+}
+
 /**
  * vfs_create - create new file
  * @mnt_userns:	user namespace of the mount the inode was found from
@@ -3287,8 +3298,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		if (!IS_POSIXACL(dir->d_inode))
-			mode &= ~current_umask();
+		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(mnt_userns, &nd->path,
 						    dentry, mode);
@@ -3521,8 +3531,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	if (!IS_POSIXACL(dir))
-		mode &= ~current_umask();
+	mode = vfs_prepare_mode(mnt_userns, dir, mode);
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3850,13 +3859,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	if (IS_ERR(dentry))
 		goto out1;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
 		goto out2;
 
-	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
@@ -3943,6 +3951,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct user_namespace *mnt_userns;
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
@@ -3950,15 +3959,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mkdir(&path, dentry, mode);
-	if (!error) {
-		struct user_namespace *mnt_userns;
-		mnt_userns = mnt_user_ns(path.mnt);
+	if (!error)
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
-	}
+
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c75fd54b9185..961d1cf54388 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
 	 * callers. */
 	if (S_ISDIR(mode))
 		set_nlink(inode, 2);
+	mode = mode_strip_sgid(&init_user_ns, dir, mode);
 	inode_init_owner(&init_user_ns, inode, dir, mode);
 	status = dquot_initialize(inode);
 	if (status)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v9 4/4] ceph: rely on vfs for setgid stripping
  2022-05-20 16:10 [PATCH v9 1/4] fs: add mode_strip_sgid() helper Yang Xu
  2022-05-20 16:10 ` [PATCH v9 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
  2022-05-20 16:10 ` [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
@ 2022-05-20 16:10 ` Yang Xu
  2 siblings, 0 replies; 6+ messages in thread
From: Yang Xu @ 2022-05-20 16:10 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel
  Cc: viro, david, djwong, brauner, willy, jlayton, Yang Xu

Now that we finished moving setgid stripping for regular files in setgid
directories into the vfs, individual filesystem don't need to manually
strip the setgid bit anymore. Drop the now unneeded code from ceph.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft)<brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
 fs/ceph/file.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8c8226c0feac..257ec53b1765 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -657,10 +657,6 @@ static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
 		/* Directories always inherit the setgid bit. */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(dir->i_gid) &&
-			 !capable_wrt_inode_uidgid(&init_user_ns, dir, CAP_FSETID))
-			mode &= ~S_ISGID;
 	} else {
 		in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
 	}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs
  2022-05-20 16:10 ` [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
@ 2022-06-10 12:53   ` Christian Brauner
  2022-07-11  9:58     ` xuyang2018.jy
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Brauner @ 2022-06-10 12:53 UTC (permalink / raw)
  To: Yang Xu; +Cc: linux-fsdevel, ceph-devel, viro, david, djwong, willy, jlayton

On Sat, May 21, 2022 at 12:10:36AM +0800, Yang Xu wrote:
> Creating files that have both the S_IXGRP and S_ISGID bit raised in
> directories that themselves have the S_ISGID bit set requires additional
> privileges to avoid security issues.
> 
> When a filesystem creates a new inode it needs to take care that the
> caller is either in the group of the newly created inode or they have
> CAP_FSETID in their current user namespace and are privileged over the
> parent directory of the new inode. If any of these two conditions is
> true then the S_ISGID bit can be raised for an S_IXGRP file and if not
> it needs to be stripped.
> 
> However, there are several key issues with the current state of things:
> 
> * The S_ISGID stripping logic is entangled with umask stripping.
> 
>   If a filesystem doesn't support or enable POSIX ACLs then umask
>   stripping is done directly in the vfs before calling into the
>   filesystem.
>   If the filesystem does support POSIX ACLs then unmask stripping may be
>   done in the filesystem itself when calling posix_acl_create().
> 
> * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
>   stripping logic.
> 
>   While that may be intentional (e.g. network filesystems might just
>   defer setgid stripping to a server) it is often just a security issue.
> 
> * The first two points taken together mean that there's a
>   non-standardized ordering between setgid stripping in
>   inode_init_owner() and posix_acl_create() both on the vfs level and
>   the filesystem level. The latter part is especially problematic since
>   each filesystem is technically free to order inode_init_owner() and
>   posix_acl_create() however it sees fit meaning that S_ISGID
>   inheritance might or might not be applied.
> 
> * We do still have bugs in this areas years after the initial round of
>   setgid bugfixes.
> 
> So the current state is quite messy and while we won't be able to make
> it completely clean as posix_acl_create() is still a filesystem specific
> call we can improve the S_SIGD stripping situation quite a bit by
> hoisting it out of inode_init_owner() and into the vfs creation
> operations. This means we alleviate the burden for filesystems to handle
> S_ISGID stripping correctly and can standardize the ordering between
> S_ISGID and umask stripping in the vfs.
> 
> The S_ISGID bit is stripped before any umask is applied. This has the
> advantage that the ordering is unaffected by whether umask stripping is
> done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
> the filesystem in posix_acl_create() (if POSIX ACLs are supported).
> 
> To this end a new helper vfs_prepare_mode() is added which calls the
> previously added mode_strip_setgid() helper and strips the umask
> afterwards.
> 
> All inode operations that create new filesystem objects have been
> updated to call vfs_prepare_mode() before passing the mode into the
> relevant inode operation of the filesystems. Care has been taken to
> ensure that the mode passed to the security hooks is the mode that is
> seen by the filesystem.
> 
> Moving S_ISGID stripping from filesystem callpaths into the vfs callpaths
> means thatfilesystems that call vfs_*() helpers directly can't rely on
> S_ISGID stripping being done in vfs_*() helpers anymore unless they pass the
> mode on from a prior run through the vfs.
> 
> This mostly affects overlayfs which calls vfs_*() functions directly. So
> a typical overlayfs callstack would be
> sys_mknod()
> -> do_mknodat(mode) // calls vfs_prepare_mode()
>    -> .mknod = ovl_mknod(mode)
>       -> ovl_create(mode)
>          -> vfs_mknod(mode)
> 
> But it is safe as overlayfs passes on the mode on from its own run
> through the vfs and then via vfs_*() to the underlying filesystem.
> 
> Following is an overview of the filesystem specific and inode operations
> specific implications:
> 
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> 
> All of the above filesystems end up calling inode_init_owner() when new
> filesystem objects are created through the following ->mkdir(),
> ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
> operations.
> 
> Since directories always inherit the S_ISGID bit with the exception of
> xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
> apply. The ->symlink() inode operation trivially inherit the mode from
> the target and the ->rename() inode operation inherits the mode from the
> source inode.
> 
> All other inode operations will have the S_ISGID bit stripped once in
> vfs_prepare_mode() before.
> 
> In addition to this there are filesystems which allow the creation of
> filesystem objects through ioctl()s or - in the case of spufs -
> circumventing the vfs in other ways. If filesystem objects are created
> through ioctl()s the vfs doesn't know about it and can't apply regular
> permission checking including S_ISGID logic. Therfore, a filesystem
> relying on S_ISGID stripping in inode_init_owner() in their ioctl()
> callpath will be affected by moving this logic into the vfs.
> 
> So we did our best to audit all filesystems in this regard:
> 
> * btrfs allows the creation of filesystem objects through various
>   ioctls(). Snapshot creation literally takes a snapshot and so the mode
>   is fully preserved and S_ISGID stripping doesn't apply.
> 
>   Creating a new subvolum relies on inode_init_owner() in
>   btrfs_new_inode() but only creates directories and doesn't raise
>   S_ISGID.
> 
> * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
>   xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
>   the actual extents ocfs2 uses a separate ioctl() that also creates the
>   target file.
> 
>   Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
>   inode_init_owner() to strip the S_ISGID bit. This is the only place
>   where a filesystem needs to call mode_strip_sgid() directly but this
>   is self-inflicted pain tbh.
> 
> * spufs doesn't go through the vfs at all and doesn't use ioctl()s
>   either. Instead it has a dedicated system call spufs_create() which
>   allows the creation of filesystem objects. But spufs only creates
>   directories and doesn't allo S_SIGID bits, i.e. it specifically only
>   allows 0777 bits.
> 
> * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
> 
> This patch also changed grpid behaviour for ext4/xfs because the mode
> passed to them may have been changed by vfs_prepare_mode.
> 
> While we did our best to audit everything there's a risk of regressions
> in here. However, for the sake of maintenance and given that we've seen
> a range of bugs years after S_ISGID inheritance issues were fixed (see
> [1]-[3]) the risk seems worth taking. In the worst case we will have to
> revert.
> 
> Associated with this change is a new set of fstests to enforce the
> semantics for all new filesystems.
> 
> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
> v8->v9:
> 1.move vfs_prepare_mode info fs/namei.c
> 2. add grpid behaviour change in commit message
> 3. also mention the overflay in commit meessage because it will use call vfs_mknod directly
>  fs/inode.c       |  2 --
>  fs/namei.c       | 33 ++++++++++++++++++++-------------
>  fs/ocfs2/namei.c |  1 +
>  3 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 37bd85981d38..42ecaf79aaaf 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else
> -			mode = mode_strip_sgid(mnt_userns, dir, mode);
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> diff --git a/fs/namei.c b/fs/namei.c
> index 73646e28fae0..8b60914861f5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2998,6 +2998,17 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
>  }
>  EXPORT_SYMBOL(unlock_rename);
>  
> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
> +				       const struct inode *dir, umode_t mode)
> +{
> +	mode = mode_strip_sgid(mnt_userns, dir, mode);
> +
> +	if (!IS_POSIXACL(dir))
> +		mode &= ~current_umask();
> +
> +	return mode;
> +}
> +
>  /**
>   * vfs_create - create new file
>   * @mnt_userns:	user namespace of the mount the inode was found from
> @@ -3287,8 +3298,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (open_flag & O_CREAT) {
>  		if (open_flag & O_EXCL)
>  			open_flag &= ~O_TRUNC;
> -		if (!IS_POSIXACL(dir->d_inode))
> -			mode &= ~current_umask();
> +		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
>  		if (likely(got_write))
>  			create_error = may_o_create(mnt_userns, &nd->path,
>  						    dentry, mode);
> @@ -3521,8 +3531,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>  	child = d_alloc(dentry, &slash_name);
>  	if (unlikely(!child))
>  		goto out_err;
> -	if (!IS_POSIXACL(dir))
> -		mode &= ~current_umask();
> +	mode = vfs_prepare_mode(mnt_userns, dir, mode);
>  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>  	if (error)
>  		goto out_err;
> @@ -3850,13 +3859,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>  	if (IS_ERR(dentry))
>  		goto out1;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);

This needs to move into vfs_mknod() itself to allow overlayfs to
continue to rely on setgid stripping.
See the relevant part in the _draft_ patch in [1].

>  	error = security_path_mknod(&path, dentry, mode, dev);
>  	if (error)
>  		goto out2;
>  
> -	mnt_userns = mnt_user_ns(path.mnt);
>  	switch (mode & S_IFMT) {
>  		case 0: case S_IFREG:
>  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> @@ -3943,6 +3951,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +	struct user_namespace *mnt_userns;
>  
>  retry:
>  	dentry = filename_create(dfd, name, &path, lookup_flags);
> @@ -3950,15 +3959,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	if (IS_ERR(dentry))
>  		goto out_putname;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);

This needs to move into vfs_mkdir() itself to allow overlayfs to
continue to rely on setgid stripping.
See the relevant part in the _draft_ patch in [1].

[1]: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs
  2022-06-10 12:53   ` Christian Brauner
@ 2022-07-11  9:58     ` xuyang2018.jy
  0 siblings, 0 replies; 6+ messages in thread
From: xuyang2018.jy @ 2022-07-11  9:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, ceph-devel, viro, david, djwong, willy, jlayton

on 2022/06/10 20:53, Christian Brauner wrote:
> On Sat, May 21, 2022 at 12:10:36AM +0800, Yang Xu wrote:
>> Creating files that have both the S_IXGRP and S_ISGID bit raised in
>> directories that themselves have the S_ISGID bit set requires additional
>> privileges to avoid security issues.
>>
>> When a filesystem creates a new inode it needs to take care that the
>> caller is either in the group of the newly created inode or they have
>> CAP_FSETID in their current user namespace and are privileged over the
>> parent directory of the new inode. If any of these two conditions is
>> true then the S_ISGID bit can be raised for an S_IXGRP file and if not
>> it needs to be stripped.
>>
>> However, there are several key issues with the current state of things:
>>
>> * The S_ISGID stripping logic is entangled with umask stripping.
>>
>>    If a filesystem doesn't support or enable POSIX ACLs then umask
>>    stripping is done directly in the vfs before calling into the
>>    filesystem.
>>    If the filesystem does support POSIX ACLs then unmask stripping may be
>>    done in the filesystem itself when calling posix_acl_create().
>>
>> * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
>>    stripping logic.
>>
>>    While that may be intentional (e.g. network filesystems might just
>>    defer setgid stripping to a server) it is often just a security issue.
>>
>> * The first two points taken together mean that there's a
>>    non-standardized ordering between setgid stripping in
>>    inode_init_owner() and posix_acl_create() both on the vfs level and
>>    the filesystem level. The latter part is especially problematic since
>>    each filesystem is technically free to order inode_init_owner() and
>>    posix_acl_create() however it sees fit meaning that S_ISGID
>>    inheritance might or might not be applied.
>>
>> * We do still have bugs in this areas years after the initial round of
>>    setgid bugfixes.
>>
>> So the current state is quite messy and while we won't be able to make
>> it completely clean as posix_acl_create() is still a filesystem specific
>> call we can improve the S_SIGD stripping situation quite a bit by
>> hoisting it out of inode_init_owner() and into the vfs creation
>> operations. This means we alleviate the burden for filesystems to handle
>> S_ISGID stripping correctly and can standardize the ordering between
>> S_ISGID and umask stripping in the vfs.
>>
>> The S_ISGID bit is stripped before any umask is applied. This has the
>> advantage that the ordering is unaffected by whether umask stripping is
>> done by the vfs itself (if no POSIX ACLs are supported or enabled) or in
>> the filesystem in posix_acl_create() (if POSIX ACLs are supported).
>>
>> To this end a new helper vfs_prepare_mode() is added which calls the
>> previously added mode_strip_setgid() helper and strips the umask
>> afterwards.
>>
>> All inode operations that create new filesystem objects have been
>> updated to call vfs_prepare_mode() before passing the mode into the
>> relevant inode operation of the filesystems. Care has been taken to
>> ensure that the mode passed to the security hooks is the mode that is
>> seen by the filesystem.
>>
>> Moving S_ISGID stripping from filesystem callpaths into the vfs callpaths
>> means thatfilesystems that call vfs_*() helpers directly can't rely on
>> S_ISGID stripping being done in vfs_*() helpers anymore unless they pass the
>> mode on from a prior run through the vfs.
>>
>> This mostly affects overlayfs which calls vfs_*() functions directly. So
>> a typical overlayfs callstack would be
>> sys_mknod()
>> -> do_mknodat(mode) // calls vfs_prepare_mode()
>>     -> .mknod = ovl_mknod(mode)
>>        -> ovl_create(mode)
>>           -> vfs_mknod(mode)
>>
>> But it is safe as overlayfs passes on the mode on from its own run
>> through the vfs and then via vfs_*() to the underlying filesystem.
>>
>> Following is an overview of the filesystem specific and inode operations
>> specific implications:
>>
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
>> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
>> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
>> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
>>
>> All of the above filesystems end up calling inode_init_owner() when new
>> filesystem objects are created through the following ->mkdir(),
>> ->symlink(), ->mknod(), ->create(), ->tmpfile(), ->rename() inode
>> operations.
>>
>> Since directories always inherit the S_ISGID bit with the exception of
>> xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
>> apply. The ->symlink() inode operation trivially inherit the mode from
>> the target and the ->rename() inode operation inherits the mode from the
>> source inode.
>>
>> All other inode operations will have the S_ISGID bit stripped once in
>> vfs_prepare_mode() before.
>>
>> In addition to this there are filesystems which allow the creation of
>> filesystem objects through ioctl()s or - in the case of spufs -
>> circumventing the vfs in other ways. If filesystem objects are created
>> through ioctl()s the vfs doesn't know about it and can't apply regular
>> permission checking including S_ISGID logic. Therfore, a filesystem
>> relying on S_ISGID stripping in inode_init_owner() in their ioctl()
>> callpath will be affected by moving this logic into the vfs.
>>
>> So we did our best to audit all filesystems in this regard:
>>
>> * btrfs allows the creation of filesystem objects through various
>>    ioctls(). Snapshot creation literally takes a snapshot and so the mode
>>    is fully preserved and S_ISGID stripping doesn't apply.
>>
>>    Creating a new subvolum relies on inode_init_owner() in
>>    btrfs_new_inode() but only creates directories and doesn't raise
>>    S_ISGID.
>>
>> * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
>>    xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
>>    the actual extents ocfs2 uses a separate ioctl() that also creates the
>>    target file.
>>
>>    Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
>>    inode_init_owner() to strip the S_ISGID bit. This is the only place
>>    where a filesystem needs to call mode_strip_sgid() directly but this
>>    is self-inflicted pain tbh.
>>
>> * spufs doesn't go through the vfs at all and doesn't use ioctl()s
>>    either. Instead it has a dedicated system call spufs_create() which
>>    allows the creation of filesystem objects. But spufs only creates
>>    directories and doesn't allo S_SIGID bits, i.e. it specifically only
>>    allows 0777 bits.
>>
>> * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
>>
>> This patch also changed grpid behaviour for ext4/xfs because the mode
>> passed to them may have been changed by vfs_prepare_mode.
>>
>> While we did our best to audit everything there's a risk of regressions
>> in here. However, for the sake of maintenance and given that we've seen
>> a range of bugs years after S_ISGID inheritance issues were fixed (see
>> [1]-[3]) the risk seems worth taking. In the worst case we will have to
>> revert.
>>
>> Associated with this change is a new set of fstests to enforce the
>> semantics for all new filesystems.
>>
>> Link: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") [1]
>> Link: 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") [2]
>> Link: fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") [3]
>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
>> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
>> ---
>> v8->v9:
>> 1.move vfs_prepare_mode info fs/namei.c
>> 2. add grpid behaviour change in commit message
>> 3. also mention the overflay in commit meessage because it will use call vfs_mknod directly
>>   fs/inode.c       |  2 --
>>   fs/namei.c       | 33 ++++++++++++++++++++-------------
>>   fs/ocfs2/namei.c |  1 +
>>   3 files changed, 21 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 37bd85981d38..42ecaf79aaaf 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		/* Directories are special, and always inherit S_ISGID */
>>   		if (S_ISDIR(mode))
>>   			mode |= S_ISGID;
>> -		else
>> -			mode = mode_strip_sgid(mnt_userns, dir, mode);
>>   	} else
>>   		inode_fsgid_set(inode, mnt_userns);
>>   	inode->i_mode = mode;
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 73646e28fae0..8b60914861f5 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -2998,6 +2998,17 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
>>   }
>>   EXPORT_SYMBOL(unlock_rename);
>>   
>> +static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
>> +				       const struct inode *dir, umode_t mode)
>> +{
>> +	mode = mode_strip_sgid(mnt_userns, dir, mode);
>> +
>> +	if (!IS_POSIXACL(dir))
>> +		mode &= ~current_umask();
>> +
>> +	return mode;
>> +}
>> +
>>   /**
>>    * vfs_create - create new file
>>    * @mnt_userns:	user namespace of the mount the inode was found from
>> @@ -3287,8 +3298,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>   	if (open_flag & O_CREAT) {
>>   		if (open_flag & O_EXCL)
>>   			open_flag &= ~O_TRUNC;
>> -		if (!IS_POSIXACL(dir->d_inode))
>> -			mode &= ~current_umask();
>> +		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
>>   		if (likely(got_write))
>>   			create_error = may_o_create(mnt_userns, &nd->path,
>>   						    dentry, mode);
>> @@ -3521,8 +3531,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>   	child = d_alloc(dentry, &slash_name);
>>   	if (unlikely(!child))
>>   		goto out_err;
>> -	if (!IS_POSIXACL(dir))
>> -		mode &= ~current_umask();
>> +	mode = vfs_prepare_mode(mnt_userns, dir, mode);
>>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>   	if (error)
>>   		goto out_err;
>> @@ -3850,13 +3859,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>   	if (IS_ERR(dentry))
>>   		goto out1;
>>   
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode &= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> 
> This needs to move into vfs_mknod() itself to allow overlayfs to
> continue to rely on setgid stripping.
> See the relevant part in the _draft_ patch in [1].
> 
>>   	error = security_path_mknod(&path, dentry, mode, dev);
>>   	if (error)
>>   		goto out2;
>>   
>> -	mnt_userns = mnt_user_ns(path.mnt);
>>   	switch (mode & S_IFMT) {
>>   		case 0: case S_IFREG:
>>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
>> @@ -3943,6 +3951,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	struct path path;
>>   	int error;
>>   	unsigned int lookup_flags = LOOKUP_DIRECTORY;
>> +	struct user_namespace *mnt_userns;
>>   
>>   retry:
>>   	dentry = filename_create(dfd, name, &path, lookup_flags);
>> @@ -3950,15 +3959,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	if (IS_ERR(dentry))
>>   		goto out_putname;
>>   
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode &= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> 
> This needs to move into vfs_mkdir() itself to allow overlayfs to
> continue to rely on setgid stripping.
> See the relevant part in the _draft_ patch in [1].
> 
> [1]: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein

Sorry for the late reply. I see your draft patch, but only one thing 
need to ensure,

"WARN_ON_ONCE((mode & S_IFMT) == 0);" seems should be deleted because 
mknod think it is a reguler file when mode is 0 in mknod man-page

Zero file type is equivalent to type S_IFREG

On v10, I will use your vfs_prepare_mode in vfs_tmpfile, vfs_create, 
vfs_mknod, vfs_mkdir, so filesystems does call vfs_* paths can
use setgid strip logic.

Best Regards
Yang Xu

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-11 10:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20 16:10 [PATCH v9 1/4] fs: add mode_strip_sgid() helper Yang Xu
2022-05-20 16:10 ` [PATCH v9 2/4] fs: Add missing umask strip in vfs_tmpfile Yang Xu
2022-05-20 16:10 ` [PATCH v9 3/4] fs: move S_ISGID stripping into the vfs Yang Xu
2022-06-10 12:53   ` Christian Brauner
2022-07-11  9:58     ` xuyang2018.jy
2022-05-20 16:10 ` [PATCH v9 4/4] ceph: rely on vfs for setgid stripping Yang Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.