Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs

From: Christian Brauner <brauner@kernel.org>
To: Yang Xu <xuyang2018.jy@fujitsu.com>,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jeff Layton <jlayton@kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	ceph-devel <ceph-devel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH v8 3/4] fs: move S_ISGID stripping into the vfs
Date: Wed, 27 Apr 2022 11:22:01 +0200	[thread overview]
Message-ID: <20220427092201.wvsdjbnc7b4dttaw@wittgenstein> (raw)
In-Reply-To: <20220426145349.zxmahoq2app2lhip@wittgenstein>

On Tue, Apr 26, 2022 at 04:53:49PM +0200, Christian Brauner wrote:
> On Tue, Apr 26, 2022 at 01:52:11PM +0200, Miklos Szeredi wrote:
> > On Tue, 26 Apr 2022 at 13:21, Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Apr 26, 2022 at 1:38 PM Christian Brauner <brauner@kernel.org> wrote:
> > 
> > > > One thing that I just remembered and which I think I haven't mentioned
> > > > so far is that moving S_ISGID stripping from filesystem callpaths into
> > > > the vfs callpaths means that we're hoisting this logic out of vfs_*()
> > > > helpers implicitly.
> > > >
> > > > So filesystems that call vfs_*() helpers directly can't rely on S_ISGID
> > > > stripping being done in vfs_*() helpers anymore unless they pass the
> > > > mode on from a prior run through the vfs.
> > > >
> > > > This mostly affects overlayfs which calls vfs_*() functions directly. So
> > > > a typical overlayfs callstack would be (roughly - I'm omw to lunch):
> > > >
> > > > sys_mknod()
> > > > -> do_mknodat(mode) // calls vfs_prepare_mode()
> > > >    -> .mknod = ovl_mknod(mode)
> > > >       -> ovl_create(mode)
> > > >          -> vfs_mknod(mode)
> > > >
> > > > I think we are safe as overlayfs passes on the mode on from its own run
> > > > through the vfs and then via vfs_*() to the underlying filesystem but it
> > > > is worth point that out.
> > > >
> > > > Ccing Amir just for confirmation.
> > >
> > > Looks fine to me, but CC Miklos ...
> > 
> > Looks fine to me as well.  Overlayfs should share the mode (including
> > the suid and sgid bits), owner, group and ACL's with the underlying
> > filesystem, so clearing sgid based on overlay parent directory should
> > result in the same mode as if it was done based on the parent
> > directory on the underlying layer.
> 
> Ah yes, good point.
> 
> > 
> > AFAIU this logic is not affected by userns or mnt_userns, but
> > Christian would be best to confirm that.
> 
> It does depend on it as S_ISGID stripping requires knowledge about
> whether the caller has CAP_FSETID and is capable over the parent
> directory or if they are in the group the file is owned by.
> 
> I think ultimately it might just come down to moving vfs_prepare_mode()
> into vfs_*() helpers and not into the do_*at() helpers.
> 
> That would be cleaner anyway as right now we have this weird disconnect
> between vfs_tmpfile() and vfs_{create,mknod,mkdir}(). IOW, vfs_tmpfile()
> doesn't even have an associated do_*() wrapper where we could call
> vfs_prepare_mode() from.
> 
> So ultimately it might be nicer if we do it in vfs_*() helpers anyway.
> 
> The less pretty thing about it will be that the security_path_*() hooks
> also want a mode.
> 
> Right now these hooks receive the mode as it's passed in from userspace
> minus umask but before S_ISGID stripping happens.
> 
> Whereas I think they should really see what the filesystem sees and
> currently it's a bug that they see something else.
> 
> I need to think about this a bit.

So on top of that series (though it should just be folded in), does that
look reasonable?

From e993f81caae60fee4f77b40d46ad3863ea383493 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Wed, 27 Apr 2022 10:53:35 +0200
Subject: [PATCH] UNTESTED UNTESTED UNTESTED

As I realized yesterday we need to move mode preparation into the vfs_*()
instead of do_*() helpers as filesystems like overlayfs have the following
callstacks:

sys_mknod(ovl_path, mode)
-> do_mknodat(ovl_path, mode)
   -> .mknod = ovl_mknod(ovl_path, mode)
      -> vfs_mknod(xfs_path, mode)
	 -> .mknod = xfs_vn_mknod(xfs_path, mode)

and the requirement that this will yield the same mode as:

sys_mknod(xfs_path, mode)
-> do_mknodat(xfs_path, mode)
   -> .mknod = xfs_vn_mknod(xfs_path, mode)

By moving setgid stripping into vfs_*() helpers we achieve:

- Moving setgid stripping out of the individual filesystem's responsibility.
- Ensure that callers of vfs_*() helpers continue to get correct setgid
  stripping.

Another thing I realized while looking at this yesterday was the entanglement
with security hooks. Security hooks currently see a different mode than the
actual filesystem sees when it calls into inode_init_owner(). This patch
doesn't change that!

I originally thought that we might be able to make the security hooks see the
same mode that the filesystem will see. However, I have doubts. First, I don't
think that is achievable without more restructuring. Second, I don't think it's
required as the hooks have clearly been placed before any vfs_*() calls and
thereby have committed themselves to see the mode as passed in from userspace
(minus the umask). We will simply continue doing just exactly that
side-stepping the issue for now.

Sketched-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---
 fs/namei.c         | 103 +++++++++++++++++++++++++++++++++++++++------
 include/linux/fs.h |  11 -----
 2 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5dbf00704ae8..8b83db15ae5f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2998,6 +2998,71 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(unlock_rename);
 
+/**
+ * mode_strip_umask - handle vfs umask stripping
+ * @dir:	parent directory of the new inode
+ * @mode:	mode of the new inode to be created in @dir
+ *
+ * Umask stripping depends on whether or not the filesystem supports POSIX
+ * ACLs. If the filesystem doesn't support it umask stripping is done directly
+ * in here. If the filesystem does support POSIX ACLs umask stripping is
+ * deferred until the filesystem calls posix_acl_create().
+ *
+ * Returns: mode
+ */
+static inline umode_t mode_strip_umask(const struct inode *dir, umode_t mode)
+{
+	if (!IS_POSIXACL(dir))
+		mode &= ~current_umask();
+	return mode;
+}
+
+/**
+ * vfs_prepare_mode - prepare the mode to be used for a new inode
+ * @mnt_userns:	user namespace of the mount the inode was found from
+ * @dir:	parent directory of the new inode
+ * @mode:	mode of the new inode
+ * @mask_perms:	allowed permission by the vfs
+ * @type:	type of file to be created
+ *
+ * This helper consolidates and enforces vfs restrictions on the @mode of a new
+ * object to be created.
+ *
+ * Umask stripping depends on whether the filesystem supports POSIX ACLs (see
+ * the kernel documentation for mode_strip_umask()). Moving umask stripping
+ * after setgid stripping allows the same ordering for both non-POSIX ACL and
+ * POSIX ACL supporting filesystems.
+ *
+ * Note that it's currently valid for @type to be 0 if a directory is created.
+ * Filesystems raise that flag individually and we need to check whether each
+ * filesystem can deal with receiving S_IFDIR from the vfs before we enforce a
+ * non-zero type.
+ *
+ * Returns: mode to be passed to the filesystem
+ */
+static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
+				       const struct inode *dir, umode_t mode,
+				       umode_t mask_perms, umode_t type)
+{
+	/*
+	 * S_ISGID stripping depends on the mode of the new file so make sure
+	 * that the caller gives us this information and splat if we miss it.
+	 */
+	WARN_ON_ONCE((mode & S_IFMT) == 0);
+
+	mode = mode_strip_sgid(mnt_userns, dir, mode);
+	mode = mode_strip_umask(dir, mode);
+
+	/*
+	 * Apply the vfs mandated allowed permission mask and set the type of
+	 * file to be created before we call into the filesystem.
+	 */
+	mode &= (mask_perms & ~S_IFMT);
+	mode |= (type & S_IFMT);
+
+	return mode;
+}
+
 /**
  * vfs_create - create new file
  * @mnt_userns:	user namespace of the mount the inode was found from
@@ -3023,8 +3088,9 @@ int vfs_create(struct user_namespace *mnt_userns, struct inode *dir,
 
 	if (!dir->i_op->create)
 		return -EACCES;	/* shouldn't it be ENOSYS? */
-	mode &= S_IALLUGO;
-	mode |= S_IFREG;
+
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry), mode,
+				S_IALLUGO, S_IFREG);
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
@@ -3287,7 +3353,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode);
+		mode = vfs_prepare_mode(mnt_userns, dir->d_inode, mode, mode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(mnt_userns, &nd->path,
 						    dentry, mode);
@@ -3520,7 +3586,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	mode = vfs_prepare_mode(mnt_userns, dir, mode);
+	mode = vfs_prepare_mode(mnt_userns, dir, mode, mode, mode);
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3798,6 +3864,8 @@ int vfs_mknod(struct user_namespace *mnt_userns, struct inode *dir,
 	if (!dir->i_op->mknod)
 		return -EPERM;
 
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry),
+				mode, mode, mode);
 	error = devcgroup_inode_mknod(mode, dev);
 	if (error)
 		return error;
@@ -3848,12 +3916,13 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	if (IS_ERR(dentry))
 		goto out1;
 
-	mnt_userns = mnt_user_ns(path.mnt);
-	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
-	error = security_path_mknod(&path, dentry, mode, dev);
+	error = security_path_mknod(&path, dentry,
+				    mode_strip_umask(d_inode(path.dentry), mode),
+				    dev);
 	if (error)
 		goto out2;
 
+	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
@@ -3919,7 +3988,13 @@ int vfs_mkdir(struct user_namespace *mnt_userns, struct inode *dir,
 	if (!dir->i_op->mkdir)
 		return -EPERM;
 
-	mode &= (S_IRWXUGO|S_ISVTX);
+	/*
+	 * Filesystems currently raise S_IFDIR individually. We should try and
+	 * fix that going forward passing it in from the vfs as we do for all
+	 * other files going forward.
+	 */
+	mode = vfs_prepare_mode(mnt_userns, d_inode(path.dentry),
+				mode, S_IRWXUGO | S_ISVTX, 0);
 	error = security_inode_mkdir(dir, dentry, mode);
 	if (error)
 		return error;
@@ -3940,7 +4015,6 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
-	struct user_namespace *mnt_userns;
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
@@ -3948,12 +4022,15 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	mnt_userns = mnt_user_ns(path.mnt);
-	mode = vfs_prepare_mode(mnt_userns, path.dentry->d_inode, mode);
-	error = security_path_mkdir(&path, dentry, mode);
-	if (!error)
+	error = security_path_mkdir(&path, dentry,
+				    mode_strip_umask(d_inode(path.dentry), mode));
+	if (!error) {
+		struct user_namespace *mnt_userns;
+
+		mnt_userns = mnt_user_ns(path.mnt);
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
+	}
 
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 914c8f28bb02..98b44a2732f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3459,17 +3459,6 @@ static inline bool dir_relax_shared(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
-static inline umode_t vfs_prepare_mode(struct user_namespace *mnt_userns,
-				   const struct inode *dir, umode_t mode)
-{
-	mode = mode_strip_sgid(mnt_userns, dir, mode);
-
-	if (!IS_POSIXACL(dir))
-		mode &= ~current_umask();
-
-	return mode;
-}
-
 extern bool path_noexec(const struct path *path);
 extern void inode_nohighmem(struct inode *inode);
 
-- 
2.32.0