Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/10] allow unprivileged overlay mounts
@ 2020-12-07 16:32 Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module,
	linux-kernel, John Johansen, Tetsuo Handa

I've done some more work to verify that unprivileged mount of overlayfs is
safe.

One thing I did is to basically audit all function calls made by overlayfs
to see if it's normally called with any checks and whether overlayfs calls
it with the same (permission and other) checks.

Some of this work has already made it into 5.8 and this series contains
more fixes.

A general observation is that overlayfs does not call security_path_*()
hooks on the underlying fs.  I don't see this as a problem, because a
simple bind mount done inside a private mount namespace also defeats the
path based security checks.  Maybe I'm missing something here, so I'm
interested in comments from AppArmor and Tomoyo developers.

Eric, do you have thought about what to look for with respect to
unprivileged mount safety and whether you think this is ready for upstream?

Git tree:
  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git#ovl-unpriv-v2

Thanks,
Miklos


Miklos Szeredi (10):
  vfs: move cap_convert_nscap() call into vfs_setxattr()
  vfs: verify source area in vfs_dedupe_file_range_one()
  ovl: check privs before decoding file handle
  ovl: make ioctl() safe
  ovl: simplify file splice
  ovl: user xattr
  ovl: do not fail when setting origin xattr
  ovl: do not fail because of O_NOATIME
  ovl: do not get metacopy for userxattr
  ovl: unprivieged mounts

 fs/overlayfs/copy_up.c     |   3 +-
 fs/overlayfs/file.c        | 126 +++----------------------------------
 fs/overlayfs/inode.c       |  10 ++-
 fs/overlayfs/namei.c       |   3 +
 fs/overlayfs/overlayfs.h   |   8 ++-
 fs/overlayfs/ovl_entry.h   |   1 +
 fs/overlayfs/super.c       |  56 +++++++++++++++--
 fs/overlayfs/util.c        |  12 +++-
 fs/remap_range.c           |  10 ++-
 fs/xattr.c                 |  17 +++--
 include/linux/capability.h |   2 +-
 security/commoncap.c       |   3 +-
 12 files changed, 110 insertions(+), 141 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-09  1:53   ` James Morris
  2021-01-01 17:35   ` Eric W. Biederman
  2020-12-07 16:32 ` [PATCH v2 02/10] vfs: verify source area in vfs_dedupe_file_range_one() Miklos Szeredi
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

cap_convert_nscap() does permission checking as well as conversion of the
xattr value conditionally based on fs's user-ns.

This is needed by overlayfs and probably other layered fs (ecryptfs) and is
what vfs_foo() is supposed to do anyway.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/xattr.c                 | 17 +++++++++++------
 include/linux/capability.h |  2 +-
 security/commoncap.c       |  3 +--
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index cd7a563e8bcd..fd57153b1f61 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -276,8 +276,16 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 {
 	struct inode *inode = dentry->d_inode;
 	struct inode *delegated_inode = NULL;
+	const void  *orig_value = value;
 	int error;
 
+	if (size && strcmp(name, XATTR_NAME_CAPS) == 0) {
+		error = cap_convert_nscap(dentry, &value, size);
+		if (error < 0)
+			return error;
+		size = error;
+	}
+
 retry_deleg:
 	inode_lock(inode);
 	error = __vfs_setxattr_locked(dentry, name, value, size, flags,
@@ -289,6 +297,9 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 		if (!error)
 			goto retry_deleg;
 	}
+	if (value != orig_value)
+		kfree(value);
+
 	return error;
 }
 EXPORT_SYMBOL_GPL(vfs_setxattr);
@@ -537,12 +548,6 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
 			posix_acl_fix_xattr_from_user(kvalue, size);
-		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
-			error = cap_convert_nscap(d, &kvalue, size);
-			if (error < 0)
-				goto out;
-			size = error;
-		}
 	}
 
 	error = vfs_setxattr(d, kname, kvalue, size, flags);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 1e7fe311cabe..b2f698915c0f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -270,6 +270,6 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
-extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
+extern int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size);
 
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/security/commoncap.c b/security/commoncap.c
index 59bf3c1674c8..bacc1111d871 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -473,7 +473,7 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
  *
  * If all is ok, we return the new size, on error return < 0.
  */
-int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
+int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size)
 {
 	struct vfs_ns_cap_data *nscap;
 	uid_t nsrootid;
@@ -516,7 +516,6 @@ int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
 	nscap->magic_etc = cpu_to_le32(nsmagic);
 	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
 
-	kvfree(*ivalue);
 	*ivalue = nscap;
 	return newsize;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 02/10] vfs: verify source area in vfs_dedupe_file_range_one()
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 03/10] ovl: check privs before decoding file handle Miklos Szeredi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

Call remap_verify_area() on the source file as well as the destination.

When called from vfs_dedupe_file_range() the check as already been
performed, but not so if called from layered fs (overlayfs, etc...)

Could ommit the redundant check in vfs_dedupe_file_range(), but leave for
now to get error early (for fear of breaking backward compatibility).

This call shouldn't be performance sensitive.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/remap_range.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/remap_range.c b/fs/remap_range.c
index e6099beefa97..77dba3a49e65 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -456,8 +456,16 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	if (ret)
 		return ret;
 
+	/*
+	 * This is redundant if called from vfs_dedupe_file_range(), but other
+	 * callers need it and it's not performance sesitive...
+	 */
+	ret = remap_verify_area(src_file, src_pos, len, false);
+	if (ret)
+		goto out_drop_write;
+
 	ret = remap_verify_area(dst_file, dst_pos, len, true);
-	if (ret < 0)
+	if (ret)
 		goto out_drop_write;
 
 	ret = -EPERM;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 03/10] ovl: check privs before decoding file handle
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 02/10] vfs: verify source area in vfs_dedupe_file_range_one() Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-08 13:49   ` Amir Goldstein
  2020-12-07 16:32 ` [PATCH v2 04/10] ovl: make ioctl() safe Miklos Szeredi
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in
ovl_decode_real_fh() as well to prevent privilege escalation for
unprivileged overlay mounts.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/namei.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index a6162c4076db..82a55fdb1e7a 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -156,6 +156,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
 	struct dentry *real;
 	int bytes;
 
+	if (!capable(CAP_DAC_READ_SEARCH))
+		return NULL;
+
 	/*
 	 * Make sure that the stored uuid matches the uuid of the lower
 	 * layer where file handle will be decoded.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (2 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 03/10] ovl: check privs before decoding file handle Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-08 11:11   ` Amir Goldstein
  2020-12-09  1:57   ` James Morris
  2020-12-07 16:32 ` [PATCH v2 05/10] ovl: simplify file splice Miklos Szeredi
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module,
	linux-kernel, Dmitry Vyukov

ovl_ioctl_set_flags() does a capability check using flags, but then the
real ioctl double-fetches flags and uses potentially different value.

The "Check the capability before cred override" comment misleading: user
can skip this check by presenting benign flags first and then overwriting
them to non-benign flags.

Just remove the cred override for now, hoping this doesn't cause a
regression.

The proper solution is to create a new setxflags i_op (patches are in the
works).

Xfstests don't show a regression.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/file.c | 75 ++-------------------------------------------
 1 file changed, 3 insertions(+), 72 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..3cd1590f2030 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -541,46 +541,26 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
 			   unsigned long arg)
 {
 	struct fd real;
-	const struct cred *old_cred;
 	long ret;
 
 	ret = ovl_real_fdget(file, &real);
 	if (ret)
 		return ret;
 
-	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = security_file_ioctl(real.file, cmd, arg);
 	if (!ret)
 		ret = vfs_ioctl(real.file, cmd, arg);
-	revert_creds(old_cred);
 
 	fdput(real);
 
 	return ret;
 }
 
-static unsigned int ovl_iflags_to_fsflags(unsigned int iflags)
-{
-	unsigned int flags = 0;
-
-	if (iflags & S_SYNC)
-		flags |= FS_SYNC_FL;
-	if (iflags & S_APPEND)
-		flags |= FS_APPEND_FL;
-	if (iflags & S_IMMUTABLE)
-		flags |= FS_IMMUTABLE_FL;
-	if (iflags & S_NOATIME)
-		flags |= FS_NOATIME_FL;
-
-	return flags;
-}
-
 static long ovl_ioctl_set_flags(struct file *file, unsigned int cmd,
-				unsigned long arg, unsigned int flags)
+				unsigned long arg)
 {
 	long ret;
 	struct inode *inode = file_inode(file);
-	unsigned int oldflags;
 
 	if (!inode_owner_or_capable(inode))
 		return -EACCES;
@@ -591,12 +571,6 @@ static long ovl_ioctl_set_flags(struct file *file, unsigned int cmd,
 
 	inode_lock(inode);
 
-	/* Check the capability before cred override */
-	oldflags = ovl_iflags_to_fsflags(READ_ONCE(inode->i_flags));
-	ret = vfs_ioc_setflags_prepare(inode, oldflags, flags);
-	if (ret)
-		goto unlock;
-
 	ret = ovl_maybe_copy_up(file_dentry(file), O_WRONLY);
 	if (ret)
 		goto unlock;
@@ -613,46 +587,6 @@ static long ovl_ioctl_set_flags(struct file *file, unsigned int cmd,
 
 }
 
-static long ovl_ioctl_set_fsflags(struct file *file, unsigned int cmd,
-				  unsigned long arg)
-{
-	unsigned int flags;
-
-	if (get_user(flags, (int __user *) arg))
-		return -EFAULT;
-
-	return ovl_ioctl_set_flags(file, cmd, arg, flags);
-}
-
-static unsigned int ovl_fsxflags_to_fsflags(unsigned int xflags)
-{
-	unsigned int flags = 0;
-
-	if (xflags & FS_XFLAG_SYNC)
-		flags |= FS_SYNC_FL;
-	if (xflags & FS_XFLAG_APPEND)
-		flags |= FS_APPEND_FL;
-	if (xflags & FS_XFLAG_IMMUTABLE)
-		flags |= FS_IMMUTABLE_FL;
-	if (xflags & FS_XFLAG_NOATIME)
-		flags |= FS_NOATIME_FL;
-
-	return flags;
-}
-
-static long ovl_ioctl_set_fsxflags(struct file *file, unsigned int cmd,
-				   unsigned long arg)
-{
-	struct fsxattr fa;
-
-	memset(&fa, 0, sizeof(fa));
-	if (copy_from_user(&fa, (void __user *) arg, sizeof(fa)))
-		return -EFAULT;
-
-	return ovl_ioctl_set_flags(file, cmd, arg,
-				   ovl_fsxflags_to_fsflags(fa.fsx_xflags));
-}
-
 long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	long ret;
@@ -663,12 +597,9 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		ret = ovl_real_ioctl(file, cmd, arg);
 		break;
 
-	case FS_IOC_SETFLAGS:
-		ret = ovl_ioctl_set_fsflags(file, cmd, arg);
-		break;
-
 	case FS_IOC_FSSETXATTR:
-		ret = ovl_ioctl_set_fsxflags(file, cmd, arg);
+	case FS_IOC_SETFLAGS:
+		ret = ovl_ioctl_set_flags(file, cmd, arg);
 		break;
 
 	default:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 05/10] ovl: simplify file splice
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (3 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 04/10] ovl: make ioctl() safe Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 06/10] ovl: user xattr Miklos Szeredi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

generic_file_splice_read() and iter_file_splice_write() will call back into
f_op->iter_read() and f_op->iter_write() respectively.  These already do
the real file lookup and cred override.  So the code in ovl_splice_read()
and ovl_splice_write() is redundant.

In addition the ovl_file_accessed() call in ovl_splice_write() is
incorrect, though probably harmless.

Fix by calling generic_file_splice_read() and iter_file_splice_write()
directly.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/file.c | 46 ++-------------------------------------------
 1 file changed, 2 insertions(+), 44 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 3cd1590f2030..dc767034d37b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -397,48 +397,6 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
-static ssize_t ovl_splice_read(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
-{
-	ssize_t ret;
-	struct fd real;
-	const struct cred *old_cred;
-
-	ret = ovl_real_fdget(in, &real);
-	if (ret)
-		return ret;
-
-	old_cred = ovl_override_creds(file_inode(in)->i_sb);
-	ret = generic_file_splice_read(real.file, ppos, pipe, len, flags);
-	revert_creds(old_cred);
-
-	ovl_file_accessed(in);
-	fdput(real);
-	return ret;
-}
-
-static ssize_t
-ovl_splice_write(struct pipe_inode_info *pipe, struct file *out,
-			  loff_t *ppos, size_t len, unsigned int flags)
-{
-	struct fd real;
-	const struct cred *old_cred;
-	ssize_t ret;
-
-	ret = ovl_real_fdget(out, &real);
-	if (ret)
-		return ret;
-
-	old_cred = ovl_override_creds(file_inode(out)->i_sb);
-	ret = iter_file_splice_write(pipe, real.file, ppos, len, flags);
-	revert_creds(old_cred);
-
-	ovl_file_accessed(out);
-	fdput(real);
-	return ret;
-}
-
 static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct fd real;
@@ -732,8 +690,8 @@ const struct file_operations ovl_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ovl_compat_ioctl,
 #endif
-	.splice_read    = ovl_splice_read,
-	.splice_write   = ovl_splice_write,
+	.splice_read    = generic_file_splice_read,
+	.splice_write   = iter_file_splice_write,
 
 	.copy_file_range	= ovl_copy_file_range,
 	.remap_file_range	= ovl_remap_file_range,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 06/10] ovl: user xattr
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (4 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 05/10] ovl: simplify file splice Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-08 13:10   ` Amir Goldstein
  2020-12-07 16:32 ` [PATCH v2 07/10] ovl: do not fail when setting origin xattr Miklos Szeredi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

Optionally allow using "user.overlay." namespace instead of
"trusted.overlay."

This is necessary for overlayfs to be able to be mounted in an unprivileged
namepsace.

Make the option explicit, since it makes the filesystem format be
incompatible.

Disable redirect_dir and metacopy options, because these would allow
privilege escalation through direct manipulation of the
"user.overlay.redirect" or "user.overlay.metacopy" xattrs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/inode.c     | 10 ++++++--
 fs/overlayfs/overlayfs.h |  8 +++---
 fs/overlayfs/ovl_entry.h |  1 +
 fs/overlayfs/super.c     | 55 ++++++++++++++++++++++++++++++++++++----
 fs/overlayfs/util.c      |  5 ++--
 5 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index b584dca845ba..8ec3062999a9 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -329,8 +329,14 @@ static const char *ovl_get_link(struct dentry *dentry,
 
 bool ovl_is_private_xattr(struct super_block *sb, const char *name)
 {
-	return strncmp(name, OVL_XATTR_PREFIX,
-		       sizeof(OVL_XATTR_PREFIX) - 1) == 0;
+	struct ovl_fs *ofs = sb->s_fs_info;
+
+	if (ofs->config.userxattr)
+		return strncmp(name, OVL_XATTR_USER_PREFIX,
+			       sizeof(OVL_XATTR_USER_PREFIX) - 1) == 0;
+	else
+		return strncmp(name, OVL_XATTR_TRUSTED_PREFIX,
+			       sizeof(OVL_XATTR_TRUSTED_PREFIX) - 1) == 0;
 }
 
 int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..46282111d6e6 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -22,7 +22,9 @@ enum ovl_path_type {
 #define OVL_TYPE_MERGE(type)	((type) & __OVL_PATH_MERGE)
 #define OVL_TYPE_ORIGIN(type)	((type) & __OVL_PATH_ORIGIN)
 
-#define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
+#define OVL_XATTR_NAMESPACE "overlay."
+#define OVL_XATTR_TRUSTED_PREFIX XATTR_TRUSTED_PREFIX OVL_XATTR_NAMESPACE
+#define OVL_XATTR_USER_PREFIX XATTR_USER_PREFIX OVL_XATTR_NAMESPACE
 
 enum ovl_xattr {
 	OVL_XATTR_OPAQUE,
@@ -113,10 +115,10 @@ struct ovl_fh {
 #define OVL_FH_FID_OFFSET	(OVL_FH_WIRE_OFFSET + \
 				 offsetof(struct ovl_fb, fid))
 
-extern const char *ovl_xattr_table[];
+extern const char *ovl_xattr_table[][2];
 static inline const char *ovl_xattr(struct ovl_fs *ofs, enum ovl_xattr ox)
 {
-	return ovl_xattr_table[ox];
+	return ovl_xattr_table[ox][ofs->config.userxattr];
 }
 
 static inline int ovl_do_rmdir(struct inode *dir, struct dentry *dentry)
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..d634c7ba3b9c 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -17,6 +17,7 @@ struct ovl_config {
 	bool nfs_export;
 	int xino;
 	bool metacopy;
+	bool userxattr;
 	bool ovl_volatile;
 };
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..189380b946be 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -411,6 +411,7 @@ enum {
 	OPT_INDEX_ON,
 	OPT_INDEX_OFF,
 	OPT_NFS_EXPORT_ON,
+	OPT_USERXATTR,
 	OPT_NFS_EXPORT_OFF,
 	OPT_XINO_ON,
 	OPT_XINO_OFF,
@@ -429,6 +430,7 @@ static const match_table_t ovl_tokens = {
 	{OPT_REDIRECT_DIR,		"redirect_dir=%s"},
 	{OPT_INDEX_ON,			"index=on"},
 	{OPT_INDEX_OFF,			"index=off"},
+	{OPT_USERXATTR,			"userxattr"},
 	{OPT_NFS_EXPORT_ON,		"nfs_export=on"},
 	{OPT_NFS_EXPORT_OFF,		"nfs_export=off"},
 	{OPT_XINO_ON,			"xino=on"},
@@ -585,6 +587,10 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 			config->ovl_volatile = true;
 			break;
 
+		case OPT_USERXATTR:
+			config->userxattr = true;
+			break;
+
 		default:
 			pr_err("unrecognized mount option \"%s\" or missing value\n",
 					p);
@@ -688,6 +694,28 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
 		}
 	}
 
+
+	/* Resolve userxattr -> !redirect && !metacopy dependency */
+	if (config->userxattr) {
+		if (config->redirect_follow && redirect_opt) {
+			pr_err("conflicting options: userxattr,redirect_dir=%s\n",
+			       config->redirect_mode);
+			return -EINVAL;
+		}
+		if (config->metacopy && metacopy_opt) {
+			pr_err("conflicting options: userxattr,metacopy=on\n");
+			return -EINVAL;
+		}
+		/*
+		 * Silently disable default setting of redirect and metacopy.
+		 * This shall be the default in the future as well: these
+		 * options must be explicitly enabled if used together with
+		 * userxattr.
+		 */
+		config->redirect_dir = config->redirect_follow = false;
+		config->metacopy = false;
+	}
+
 	return 0;
 }
 
@@ -1037,8 +1065,14 @@ ovl_posix_acl_default_xattr_handler = {
 	.set = ovl_posix_acl_xattr_set,
 };
 
-static const struct xattr_handler ovl_own_xattr_handler = {
-	.prefix	= OVL_XATTR_PREFIX,
+static const struct xattr_handler ovl_own_trusted_xattr_handler = {
+	.prefix	= OVL_XATTR_TRUSTED_PREFIX,
+	.get = ovl_own_xattr_get,
+	.set = ovl_own_xattr_set,
+};
+
+static const struct xattr_handler ovl_own_user_xattr_handler = {
+	.prefix	= OVL_XATTR_USER_PREFIX,
 	.get = ovl_own_xattr_get,
 	.set = ovl_own_xattr_set,
 };
@@ -1049,12 +1083,22 @@ static const struct xattr_handler ovl_other_xattr_handler = {
 	.set = ovl_other_xattr_set,
 };
 
-static const struct xattr_handler *ovl_xattr_handlers[] = {
+static const struct xattr_handler *ovl_trusted_xattr_handlers[] = {
+#ifdef CONFIG_FS_POSIX_ACL
+	&ovl_posix_acl_access_xattr_handler,
+	&ovl_posix_acl_default_xattr_handler,
+#endif
+	&ovl_own_trusted_xattr_handler,
+	&ovl_other_xattr_handler,
+	NULL
+};
+
+static const struct xattr_handler *ovl_user_xattr_handlers[] = {
 #ifdef CONFIG_FS_POSIX_ACL
 	&ovl_posix_acl_access_xattr_handler,
 	&ovl_posix_acl_default_xattr_handler,
 #endif
-	&ovl_own_xattr_handler,
+	&ovl_own_user_xattr_handler,
 	&ovl_other_xattr_handler,
 	NULL
 };
@@ -1991,7 +2035,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	cap_lower(cred->cap_effective, CAP_SYS_RESOURCE);
 
 	sb->s_magic = OVERLAYFS_SUPER_MAGIC;
-	sb->s_xattr = ovl_xattr_handlers;
+	sb->s_xattr = ofs->config.userxattr ? ovl_user_xattr_handlers :
+		ovl_trusted_xattr_handlers;
 	sb->s_fs_info = ofs;
 	sb->s_flags |= SB_POSIXACL;
 	sb->s_iflags |= SB_I_SKIP_SYNC;
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..66eaf4db027f 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -582,9 +582,10 @@ bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
 #define OVL_XATTR_METACOPY_POSTFIX	"metacopy"
 
 #define OVL_XATTR_TAB_ENTRY(x) \
-	[x] = OVL_XATTR_PREFIX x ## _POSTFIX
+	[x] = { [false] = OVL_XATTR_TRUSTED_PREFIX x ## _POSTFIX, \
+		[true] = OVL_XATTR_USER_PREFIX x ## _POSTFIX }
 
-const char *ovl_xattr_table[] = {
+const char *ovl_xattr_table[][2] = {
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_OPAQUE),
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_REDIRECT),
 	OVL_XATTR_TAB_ENTRY(OVL_XATTR_ORIGIN),
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 07/10] ovl: do not fail when setting origin xattr
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (5 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 06/10] ovl: user xattr Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 08/10] ovl: do not fail because of O_NOATIME Miklos Szeredi
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

Comment above call already says this, but only EOPNOTSUPP is ignored, other
failures are not.

For example setting "user.*" will fail with EPERM on symlink/special.

Ignore this error as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/copy_up.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 955ecd4030f0..8a7ef40d98f8 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -352,7 +352,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 				 fh ? fh->fb.len : 0, 0);
 	kfree(fh);
 
-	return err;
+	/* Ignore -EPERM from setting "user.*" on symlink/special */
+	return err == -EPERM ? 0 : err;
 }
 
 /* Store file handle of @upper dir in @index dir entry */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 08/10] ovl: do not fail because of O_NOATIME
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (6 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 07/10] ovl: do not fail when setting origin xattr Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-08 11:29   ` Amir Goldstein
  2020-12-07 16:32 ` [PATCH v2 09/10] ovl: do not get metacopy for userxattr Miklos Szeredi
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

In case the file cannot be opened with O_NOATIME because of lack of
capabilities, then clear O_NOATIME instead of failing.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index dc767034d37b..d6ac7ac66410 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -53,9 +53,10 @@ static struct file *ovl_open_realfile(const struct file *file,
 	err = inode_permission(realinode, MAY_OPEN | acc_mode);
 	if (err) {
 		realfile = ERR_PTR(err);
-	} else if (!inode_owner_or_capable(realinode)) {
-		realfile = ERR_PTR(-EPERM);
 	} else {
+		if (!inode_owner_or_capable(realinode))
+			flags &= ~O_NOATIME;
+
 		realfile = open_with_fake_path(&file->f_path, flags, realinode,
 					       current_cred());
 	}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 09/10] ovl: do not get metacopy for userxattr
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (7 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 08/10] ovl: do not fail because of O_NOATIME Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-07 16:32 ` [PATCH v2 10/10] ovl: unprivieged mounts Miklos Szeredi
  2020-12-08 10:27 ` [PATCH v2 00/10] allow unprivileged overlay mounts Tetsuo Handa
  10 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

When looking up an inode on the lower layer for which the mounter lacks
read permisison the metacopy check will fail.  This causes the lookup to
fail as well, even though the directory is readable.

So ignore EACCES for the "userxattr" case and assume no metacopy for the
unreadable file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/util.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 66eaf4db027f..703c6e529f39 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -880,6 +880,13 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct dentry *dentry)
 	if (res < 0) {
 		if (res == -ENODATA || res == -EOPNOTSUPP)
 			return 0;
+		/*
+		 * getxattr on user.* may fail with EACCES in case there's no
+		 * read permission on the inode.  Not much we can do, other than
+		 * tell the caller that this is not a metacopy inode.
+		 */
+		if (ofs->config.userxattr && res == -EACCES)
+			return 0;
 		goto out;
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 10/10] ovl: unprivieged mounts
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (8 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 09/10] ovl: do not get metacopy for userxattr Miklos Szeredi
@ 2020-12-07 16:32 ` Miklos Szeredi
  2020-12-08 10:27 ` [PATCH v2 00/10] allow unprivileged overlay mounts Tetsuo Handa
  10 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

Enable unprivileged user namespace mounts of overlayfs.  Overlayfs's
permission model (*) ensures that the mounter itself cannot gain additional
privileges by the act of creating an overlayfs mount.

This feature request is coming from the "rootless" container crowd.

(*) Documentation/filesystems/overlayfs.txt#Permission model

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/overlayfs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 189380b946be..019e6f1834b0 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -2073,6 +2073,7 @@ static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
 static struct file_system_type ovl_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "overlay",
+	.fs_flags	= FS_USERNS_MOUNT,
 	.mount		= ovl_mount,
 	.kill_sb	= kill_anon_super,
 };
-- 
2.26.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 00/10] allow unprivileged overlay mounts
  2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
                   ` (9 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v2 10/10] ovl: unprivieged mounts Miklos Szeredi
@ 2020-12-08 10:27 ` Tetsuo Handa
  2020-12-10  8:56   ` John Johansen
  10 siblings, 1 reply; 39+ messages in thread
From: Tetsuo Handa @ 2020-12-08 10:27 UTC (permalink / raw)
  To: Miklos Szeredi, Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module,
	linux-kernel, John Johansen

On 2020/12/08 1:32, Miklos Szeredi wrote:
> A general observation is that overlayfs does not call security_path_*()
> hooks on the underlying fs.  I don't see this as a problem, because a
> simple bind mount done inside a private mount namespace also defeats the
> path based security checks.  Maybe I'm missing something here, so I'm
> interested in comments from AppArmor and Tomoyo developers.

Regarding TOMOYO, I don't want overlayfs to call security_path_*() hooks on the
underlying fs, but the reason is different. It is not because a simple bind mount
done inside a private mount namespace defeats the path based security checks.
TOMOYO does want to check what device/filesystem is mounted on which location. But
currently TOMOYO is failing to check it due to fsopen()/fsmount()/move_mount() API.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-07 16:32 ` [PATCH v2 04/10] ovl: make ioctl() safe Miklos Szeredi
@ 2020-12-08 11:11   ` Amir Goldstein
  2020-12-10 15:18     ` Miklos Szeredi
  2020-12-09  1:57   ` James Morris
  1 sibling, 1 reply; 39+ messages in thread
From: Amir Goldstein @ 2020-12-08 11:11 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, overlayfs, LSM List,
	linux-kernel, Dmitry Vyukov

On Mon, Dec 7, 2020 at 6:36 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> ovl_ioctl_set_flags() does a capability check using flags, but then the
> real ioctl double-fetches flags and uses potentially different value.
>
> The "Check the capability before cred override" comment misleading: user
> can skip this check by presenting benign flags first and then overwriting
> them to non-benign flags.
>
> Just remove the cred override for now, hoping this doesn't cause a
> regression.
>
> The proper solution is to create a new setxflags i_op (patches are in the
> works).
>
> Xfstests don't show a regression.
>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

Looks reasonable

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/overlayfs/file.c | 75 ++-------------------------------------------
>  1 file changed, 3 insertions(+), 72 deletions(-)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..3cd1590f2030 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -541,46 +541,26 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
>                            unsigned long arg)
>  {
>         struct fd real;
> -       const struct cred *old_cred;
>         long ret;
>
>         ret = ovl_real_fdget(file, &real);
>         if (ret)
>                 return ret;
>
> -       old_cred = ovl_override_creds(file_inode(file)->i_sb);
>         ret = security_file_ioctl(real.file, cmd, arg);
>         if (!ret)
>                 ret = vfs_ioctl(real.file, cmd, arg);
> -       revert_creds(old_cred);
>
>         fdput(real);
>
>         return ret;
>  }
>


I wonder if we shouldn't leave a comment behind to explain
that no override is intentional.

I also wonder if "Permission model" sections shouldn't be saying
something about ioctl() (current task checks only)? or we just treat
this is a breakage of the permission model that needs to be fixed?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 08/10] ovl: do not fail because of O_NOATIME
  2020-12-07 16:32 ` [PATCH v2 08/10] ovl: do not fail because of O_NOATIME Miklos Szeredi
@ 2020-12-08 11:29   ` Amir Goldstein
  2020-12-11 14:44     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Amir Goldstein @ 2020-12-08 11:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, overlayfs, LSM List, linux-kernel

On Mon, Dec 7, 2020 at 6:37 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> In case the file cannot be opened with O_NOATIME because of lack of
> capabilities, then clear O_NOATIME instead of failing.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/overlayfs/file.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index dc767034d37b..d6ac7ac66410 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -53,9 +53,10 @@ static struct file *ovl_open_realfile(const struct file *file,
>         err = inode_permission(realinode, MAY_OPEN | acc_mode);
>         if (err) {
>                 realfile = ERR_PTR(err);
> -       } else if (!inode_owner_or_capable(realinode)) {
> -               realfile = ERR_PTR(-EPERM);
>         } else {
> +               if (!inode_owner_or_capable(realinode))
> +                       flags &= ~O_NOATIME;
> +

Isn't that going to break:

        flags |= OVL_OPEN_FLAGS;

        /* If some flag changed that cannot be changed then something's amiss */
        if (WARN_ON((file->f_flags ^ flags) & ~OVL_SETFL_MASK))

IOW setting a flag that is allowed to change will fail because of
missing O_ATIME in file->f_flags.

I guess we need test coverage for SETFL.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 06/10] ovl: user xattr
  2020-12-07 16:32 ` [PATCH v2 06/10] ovl: user xattr Miklos Szeredi
@ 2020-12-08 13:10   ` Amir Goldstein
  2020-12-11 14:55     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Amir Goldstein @ 2020-12-08 13:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, overlayfs, LSM List, linux-kernel

On Mon, Dec 7, 2020 at 6:37 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> Optionally allow using "user.overlay." namespace instead of
> "trusted.overlay."

There are several occurrences of "trusted.overlay" string in code and
Documentation, which is fine. But maybe only adjust the comment for
testing xattr support:

         * Check if upper/work fs supports trusted.overlay.* xattr

>
> This is necessary for overlayfs to be able to be mounted in an unprivileged
> namepsace.
>
> Make the option explicit, since it makes the filesystem format be
> incompatible.
>
> Disable redirect_dir and metacopy options, because these would allow
> privilege escalation through direct manipulation of the
> "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -582,9 +582,10 @@ bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry,
>  #define OVL_XATTR_METACOPY_POSTFIX     "metacopy"
>
>  #define OVL_XATTR_TAB_ENTRY(x) \
> -       [x] = OVL_XATTR_PREFIX x ## _POSTFIX
> +       [x] = { [false] = OVL_XATTR_TRUSTED_PREFIX x ## _POSTFIX, \
> +               [true] = OVL_XATTR_USER_PREFIX x ## _POSTFIX }
>
> -const char *ovl_xattr_table[] = {
> +const char *ovl_xattr_table[][2] = {
>         OVL_XATTR_TAB_ENTRY(OVL_XATTR_OPAQUE),
>         OVL_XATTR_TAB_ENTRY(OVL_XATTR_REDIRECT),
>         OVL_XATTR_TAB_ENTRY(OVL_XATTR_ORIGIN),
> --

Can you constify this 2D array? I don't even know the syntax for that...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 03/10] ovl: check privs before decoding file handle
  2020-12-07 16:32 ` [PATCH v2 03/10] ovl: check privs before decoding file handle Miklos Szeredi
@ 2020-12-08 13:49   ` Amir Goldstein
  2020-12-09 10:13     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Amir Goldstein @ 2020-12-08 13:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, overlayfs, LSM List, linux-kernel

On Mon, Dec 7, 2020 at 6:36 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in
> ovl_decode_real_fh() as well to prevent privilege escalation for
> unprivileged overlay mounts.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/overlayfs/namei.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index a6162c4076db..82a55fdb1e7a 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -156,6 +156,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
>         struct dentry *real;
>         int bytes;
>
> +       if (!capable(CAP_DAC_READ_SEARCH))
> +               return NULL;
> +

If the mounter is not capable in init ns, ovl_check_origin() and
ovl_verify_index()
will not function as expected and this will break index and nfs export features.
So I think we need to also check capability in ovl_can_decode_fh(), to auto
disable those features.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
@ 2020-12-09  1:53   ` James Morris
  2021-01-01 17:35   ` Eric W. Biederman
  1 sibling, 0 replies; 39+ messages in thread
From: James Morris @ 2020-12-09  1:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, linux-unionfs,
	linux-security-module, linux-kernel

On Mon, 7 Dec 2020, Miklos Szeredi wrote:

> cap_convert_nscap() does permission checking as well as conversion of the
> xattr value conditionally based on fs's user-ns.
> 
> This is needed by overlayfs and probably other layered fs (ecryptfs) and is
> what vfs_foo() is supposed to do anyway.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


Acked-by: James Morris <jamorris@linux.microsoft.com>


-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-07 16:32 ` [PATCH v2 04/10] ovl: make ioctl() safe Miklos Szeredi
  2020-12-08 11:11   ` Amir Goldstein
@ 2020-12-09  1:57   ` James Morris
  2020-12-10 15:19     ` Miklos Szeredi
  1 sibling, 1 reply; 39+ messages in thread
From: James Morris @ 2020-12-09  1:57 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W . Biederman, linux-fsdevel, linux-unionfs,
	linux-security-module, linux-kernel, Dmitry Vyukov

On Mon, 7 Dec 2020, Miklos Szeredi wrote:

> ovl_ioctl_set_flags() does a capability check using flags, but then the
> real ioctl double-fetches flags and uses potentially different value.
> 
> The "Check the capability before cred override" comment misleading: user
> can skip this check by presenting benign flags first and then overwriting
> them to non-benign flags.

Is this a security bug which should be fixed in stable?

-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 03/10] ovl: check privs before decoding file handle
  2020-12-08 13:49   ` Amir Goldstein
@ 2020-12-09 10:13     ` Miklos Szeredi
  2020-12-09 16:20       ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-09 10:13 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Tue, Dec 8, 2020 at 2:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 6:36 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in
> > ovl_decode_real_fh() as well to prevent privilege escalation for
> > unprivileged overlay mounts.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >  fs/overlayfs/namei.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> > index a6162c4076db..82a55fdb1e7a 100644
> > --- a/fs/overlayfs/namei.c
> > +++ b/fs/overlayfs/namei.c
> > @@ -156,6 +156,9 @@ struct dentry *ovl_decode_real_fh(struct ovl_fh *fh, struct vfsmount *mnt,
> >         struct dentry *real;
> >         int bytes;
> >
> > +       if (!capable(CAP_DAC_READ_SEARCH))
> > +               return NULL;
> > +
>
> If the mounter is not capable in init ns, ovl_check_origin() and
> ovl_verify_index()
> will not function as expected and this will break index and nfs export features.

NFS export is clear-cut.

Hard link indexing should work without fh decoding, since it is only
encoding the file handle to search for the index entry, and encoding
is not privileged.

Not sure how ovl_verify_index will choke on that, will have to try...
but worse case we just need to disable verification.

And yeah, using .overlay.origin attribute for inode number consistency
won't work either, but it should fail silently (which is probably a
good thing).  Haven't tested this yet, though.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 03/10] ovl: check privs before decoding file handle
  2020-12-09 10:13     ` Miklos Szeredi
@ 2020-12-09 16:20       ` Miklos Szeredi
  2020-12-09 18:16         ` Amir Goldstein
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-09 16:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Wed, Dec 9, 2020 at 11:13 AM Miklos Szeredi <miklos@szeredi.hu> wrote:

> Hard link indexing should work without fh decoding, since it is only
> encoding the file handle to search for the index entry, and encoding
> is not privileged.

Tested this a bit and while hard link indexing does work,  inode
lookup is broken since it uses the origin inode as a key (which is not
available) instead of using the origin value directly.  This is
fixable, but needs a fair amount of restructuring, so let's just
postpone this and disable index for now, as you suggested.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 03/10] ovl: check privs before decoding file handle
  2020-12-09 16:20       ` Miklos Szeredi
@ 2020-12-09 18:16         ` Amir Goldstein
  0 siblings, 0 replies; 39+ messages in thread
From: Amir Goldstein @ 2020-12-09 18:16 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Wed, Dec 9, 2020 at 6:20 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, Dec 9, 2020 at 11:13 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Hard link indexing should work without fh decoding, since it is only
> > encoding the file handle to search for the index entry, and encoding
> > is not privileged.
>
> Tested this a bit and while hard link indexing does work,  inode
> lookup is broken since it uses the origin inode as a key (which is not

Yes, that is what I meant by ovl_check_origin() is broken.

> available) instead of using the origin value directly.  This is
> fixable, but needs a fair amount of restructuring, so let's just

Maybe it also requires on-disk changes.
We should be able to use the origin fh as the key for lower inode,
but we need the lower st_inode for initializing the ovl inode with
correct ino. If we cannot decode lower inode from origin fh, I think
we would need to store the ino in user.overlay.ino on copy up or
maintain redirect, but redirect is not supported either with user ns mount...

> postpone this and disable index for now, as you suggested.
>

Nobody seems to be enabling it anyway :-/

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 00/10] allow unprivileged overlay mounts
  2020-12-08 10:27 ` [PATCH v2 00/10] allow unprivileged overlay mounts Tetsuo Handa
@ 2020-12-10  8:56   ` John Johansen
  2020-12-10  9:39     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: John Johansen @ 2020-12-10  8:56 UTC (permalink / raw)
  To: Tetsuo Handa, Miklos Szeredi, Eric W . Biederman
  Cc: linux-fsdevel, linux-unionfs, linux-security-module, linux-kernel

On 12/8/20 2:27 AM, Tetsuo Handa wrote:
> On 2020/12/08 1:32, Miklos Szeredi wrote:
>> A general observation is that overlayfs does not call security_path_*()
>> hooks on the underlying fs.  I don't see this as a problem, because a
>> simple bind mount done inside a private mount namespace also defeats the
>> path based security checks.  Maybe I'm missing something here, so I'm
>> interested in comments from AppArmor and Tomoyo developers.
> 
> Regarding TOMOYO, I don't want overlayfs to call security_path_*() hooks on the
> underlying fs, but the reason is different. It is not because a simple bind mount
> done inside a private mount namespace defeats the path based security checks.
> TOMOYO does want to check what device/filesystem is mounted on which location. But
> currently TOMOYO is failing to check it due to fsopen()/fsmount()/move_mount() API.
> 

Regardless of TOMOYO's approach I would say that overlays should call the
security_path_*() hooks, making it possible for an LSM to do something based off of
them when needed.

The current state of private mounts with regard to path based mediation is broken.
I just haven't had time to try and come up with an acceptable fix for it. overlayfs
is actually broken under apparmor mediation, and accesses to the lower layer end up
getting denied but there is no way to properly allow them. So policy that hits this
needs a flag set that allows for it in a very hacky way (its on the list of things
to fix).

Path based mediation has to carefully control mounts otherwise policy can be
circumvented as Miklos rightly points out. Ideally path based LSM wouldn't allow
you to do the simple bind mount inside a private mount namespace (at least not
unless policy allowed for it). AppArmor does mediate the mount hooks and bind
mounts in a private mount namespace (if they go through the LSM mount hooks) will
be denied. Again the problem is how to allow them, and this is broken.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 00/10] allow unprivileged overlay mounts
  2020-12-10  8:56   ` John Johansen
@ 2020-12-10  9:39     ` Miklos Szeredi
  2020-12-15 11:03       ` John Johansen
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-10  9:39 UTC (permalink / raw)
  To: John Johansen
  Cc: Tetsuo Handa, Miklos Szeredi, Eric W . Biederman, linux-fsdevel,
	overlayfs, LSM, linux-kernel

On Thu, Dec 10, 2020 at 10:00 AM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 12/8/20 2:27 AM, Tetsuo Handa wrote:
> > On 2020/12/08 1:32, Miklos Szeredi wrote:
> >> A general observation is that overlayfs does not call security_path_*()
> >> hooks on the underlying fs.  I don't see this as a problem, because a
> >> simple bind mount done inside a private mount namespace also defeats the
> >> path based security checks.  Maybe I'm missing something here, so I'm
> >> interested in comments from AppArmor and Tomoyo developers.
> >
> > Regarding TOMOYO, I don't want overlayfs to call security_path_*() hooks on the
> > underlying fs, but the reason is different. It is not because a simple bind mount
> > done inside a private mount namespace defeats the path based security checks.
> > TOMOYO does want to check what device/filesystem is mounted on which location. But
> > currently TOMOYO is failing to check it due to fsopen()/fsmount()/move_mount() API.
> >
>
> Regardless of TOMOYO's approach I would say that overlays should call the
> security_path_*() hooks, making it possible for an LSM to do something based off of
> them when needed.
>
> The current state of private mounts with regard to path based mediation is broken.
> I just haven't had time to try and come up with an acceptable fix for it. overlayfs
> is actually broken under apparmor mediation, and accesses to the lower layer end up
> getting denied but there is no way to properly allow them. So policy that hits this
> needs a flag set that allows for it in a very hacky way (its on the list of things
> to fix).
>
> Path based mediation has to carefully control mounts otherwise policy can be
> circumvented as Miklos rightly points out. Ideally path based LSM wouldn't allow
> you to do the simple bind mount inside a private mount namespace (at least not
> unless policy allowed for it). AppArmor does mediate the mount hooks and bind
> mounts in a private mount namespace (if they go through the LSM mount hooks) will
> be denied. Again the problem is how to allow them, and this is broken.

Okay, so what does that mean for overlayfs?

AA can deny the overlay mount just as well as the bind mount, and it
can allow it just as well as the bind mount.  Policy could be the
same.

Also all the security_path_ hooks will still get called for each
access on overlayfs itself.  They won't be called for the accesses
which overlayfs does on underlying layers, but is that needed?

Overlay could call those hooks itself (since the vfs_ helpers don't)
but the big question is whether that makes any sense.  AFAICS it might
make sense, but only if AA would correctly handle bind mounts, and
especially detached bind mounts (which is what overlayfs technically
uses).

Thanks,
Miklos

Tja

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-08 11:11   ` Amir Goldstein
@ 2020-12-10 15:18     ` Miklos Szeredi
  2020-12-14  5:44       ` Amir Goldstein
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-10 15:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel, Dmitry Vyukov

On Tue, Dec 8, 2020 at 12:15 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 6:36 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > ovl_ioctl_set_flags() does a capability check using flags, but then the
> > real ioctl double-fetches flags and uses potentially different value.
> >
> > The "Check the capability before cred override" comment misleading: user
> > can skip this check by presenting benign flags first and then overwriting
> > them to non-benign flags.
> >
> > Just remove the cred override for now, hoping this doesn't cause a
> > regression.
> >
> > The proper solution is to create a new setxflags i_op (patches are in the
> > works).
> >
> > Xfstests don't show a regression.
> >
> > Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
>
> Looks reasonable
>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> > ---
> >  fs/overlayfs/file.c | 75 ++-------------------------------------------
> >  1 file changed, 3 insertions(+), 72 deletions(-)
> >
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..3cd1590f2030 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -541,46 +541,26 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
> >                            unsigned long arg)
> >  {
> >         struct fd real;
> > -       const struct cred *old_cred;
> >         long ret;
> >
> >         ret = ovl_real_fdget(file, &real);
> >         if (ret)
> >                 return ret;
> >
> > -       old_cred = ovl_override_creds(file_inode(file)->i_sb);
> >         ret = security_file_ioctl(real.file, cmd, arg);
> >         if (!ret)
> >                 ret = vfs_ioctl(real.file, cmd, arg);
> > -       revert_creds(old_cred);
> >
> >         fdput(real);
> >
> >         return ret;
> >  }
> >
>
>
> I wonder if we shouldn't leave a comment behind to explain
> that no override is intentional.

Comment added.

> I also wonder if "Permission model" sections shouldn't be saying
> something about ioctl() (current task checks only)? or we just treat
> this is a breakage of the permission model that needs to be fixed?

This is a breakage of the permission model.  But I don't think this is
a serious breakage, or one that actually matters.

Not sure which is better: adding exceptions to the model or applying
the model in situations where it's unnecessary.  I'd rather go with
the latter, but clearly in this case that was the wrong decision.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-09  1:57   ` James Morris
@ 2020-12-10 15:19     ` Miklos Szeredi
  0 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-10 15:19 UTC (permalink / raw)
  To: James Morris
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM, linux-kernel, Dmitry Vyukov

On Wed, Dec 9, 2020 at 3:01 AM James Morris <jmorris@namei.org> wrote:
>
> On Mon, 7 Dec 2020, Miklos Szeredi wrote:
>
> > ovl_ioctl_set_flags() does a capability check using flags, but then the
> > real ioctl double-fetches flags and uses potentially different value.
> >
> > The "Check the capability before cred override" comment misleading: user
> > can skip this check by presenting benign flags first and then overwriting
> > them to non-benign flags.
>
> Is this a security bug which should be fixed in stable?

Yes, good point.  Added Cc: stable@...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 08/10] ovl: do not fail because of O_NOATIME
  2020-12-08 11:29   ` Amir Goldstein
@ 2020-12-11 14:44     ` Miklos Szeredi
  2020-12-14  5:49       ` Amir Goldstein
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-11 14:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Tue, Dec 8, 2020 at 12:32 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 6:37 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > In case the file cannot be opened with O_NOATIME because of lack of
> > capabilities, then clear O_NOATIME instead of failing.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > ---
> >  fs/overlayfs/file.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index dc767034d37b..d6ac7ac66410 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -53,9 +53,10 @@ static struct file *ovl_open_realfile(const struct file *file,
> >         err = inode_permission(realinode, MAY_OPEN | acc_mode);
> >         if (err) {
> >                 realfile = ERR_PTR(err);
> > -       } else if (!inode_owner_or_capable(realinode)) {
> > -               realfile = ERR_PTR(-EPERM);
> >         } else {
> > +               if (!inode_owner_or_capable(realinode))
> > +                       flags &= ~O_NOATIME;
> > +
>
> Isn't that going to break:
>
>         flags |= OVL_OPEN_FLAGS;
>
>         /* If some flag changed that cannot be changed then something's amiss */
>         if (WARN_ON((file->f_flags ^ flags) & ~OVL_SETFL_MASK))
>
> IOW setting a flag that is allowed to change will fail because of
> missing O_ATIME in file->f_flags.

Well spotted.  I just removed those lines as a fix.   The check never
triggered since its introduction in 4.19, so I guess it isn't worth
adding more complexity for.

>
> I guess we need test coverage for SETFL.

There might be some in ltp, haven't checked.  Would be nice if the fs
related ltp tests could be integrated into xfstests.


Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 06/10] ovl: user xattr
  2020-12-08 13:10   ` Amir Goldstein
@ 2020-12-11 14:55     ` Miklos Szeredi
  0 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-11 14:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Tue, Dec 8, 2020 at 2:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 6:37 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > Optionally allow using "user.overlay." namespace instead of
> > "trusted.overlay."
>
> There are several occurrences of "trusted.overlay" string in code and
> Documentation, which is fine. But maybe only adjust the comment for
> testing xattr support:
>
>          * Check if upper/work fs supports trusted.overlay.* xattr


Updated documentation and comments.

Pushed new series to:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git#ovl-unpriv-v3

Based on the feedback, I feel it's ready for v5.11, so merged this
into #overlayfs-next as well.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-10 15:18     ` Miklos Szeredi
@ 2020-12-14  5:44       ` Amir Goldstein
  2020-12-14 13:23         ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Amir Goldstein @ 2020-12-14  5:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel, Dmitry Vyukov

On Thu, Dec 10, 2020 at 5:18 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Dec 8, 2020 at 12:15 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Dec 7, 2020 at 6:36 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > > ovl_ioctl_set_flags() does a capability check using flags, but then the
> > > real ioctl double-fetches flags and uses potentially different value.
> > >
> > > The "Check the capability before cred override" comment misleading: user
> > > can skip this check by presenting benign flags first and then overwriting
> > > them to non-benign flags.
> > >
> > > Just remove the cred override for now, hoping this doesn't cause a
> > > regression.
> > >
> > > The proper solution is to create a new setxflags i_op (patches are in the
> > > works).
> > >
> > > Xfstests don't show a regression.
> > >
> > > Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> >
> > Looks reasonable
> >
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > > ---
> > >  fs/overlayfs/file.c | 75 ++-------------------------------------------
> > >  1 file changed, 3 insertions(+), 72 deletions(-)
> > >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index efccb7c1f9bc..3cd1590f2030 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -541,46 +541,26 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd,
> > >                            unsigned long arg)
> > >  {
> > >         struct fd real;
> > > -       const struct cred *old_cred;
> > >         long ret;
> > >
> > >         ret = ovl_real_fdget(file, &real);
> > >         if (ret)
> > >                 return ret;
> > >
> > > -       old_cred = ovl_override_creds(file_inode(file)->i_sb);
> > >         ret = security_file_ioctl(real.file, cmd, arg);
> > >         if (!ret)
> > >                 ret = vfs_ioctl(real.file, cmd, arg);
> > > -       revert_creds(old_cred);
> > >
> > >         fdput(real);
> > >
> > >         return ret;
> > >  }
> > >
> >
> >
> > I wonder if we shouldn't leave a comment behind to explain
> > that no override is intentional.
>
> Comment added.
>
> > I also wonder if "Permission model" sections shouldn't be saying
> > something about ioctl() (current task checks only)? or we just treat
> > this is a breakage of the permission model that needs to be fixed?
>
> This is a breakage of the permission model.  But I don't think this is
> a serious breakage, or one that actually matters.
>

Perhaps, but there is a much bigger issue with this change IMO.
Not because of dropping rule (b) of the permission model, but because
of relaxing rule (a).

Should overlayfs respect the conservative interpretation as it partly did
until this commit, a lower file must not lose IMMUTABLE/APPEND_ONLY
after copy up, but that is exactly what is going to happen if we first
copy up and then fail permission check on setting the flags.

It's true that before this change, file could still be copied up and system
crash would leave it without the flags, but after the change it is much
worse as the flags completely lose their meaning on lower files when
any unprivileged process can remove them.

So I suggest that you undo all the changes except for the no override.

And this calls for a fork of generic/545 to overlay test with lower files.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 08/10] ovl: do not fail because of O_NOATIME
  2020-12-11 14:44     ` Miklos Szeredi
@ 2020-12-14  5:49       ` Amir Goldstein
  0 siblings, 0 replies; 39+ messages in thread
From: Amir Goldstein @ 2020-12-14  5:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel

On Fri, Dec 11, 2020 at 4:44 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Dec 8, 2020 at 12:32 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Dec 7, 2020 at 6:37 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > > In case the file cannot be opened with O_NOATIME because of lack of
> > > capabilities, then clear O_NOATIME instead of failing.
> > >
> > > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> > > ---
> > >  fs/overlayfs/file.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index dc767034d37b..d6ac7ac66410 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -53,9 +53,10 @@ static struct file *ovl_open_realfile(const struct file *file,
> > >         err = inode_permission(realinode, MAY_OPEN | acc_mode);
> > >         if (err) {
> > >                 realfile = ERR_PTR(err);
> > > -       } else if (!inode_owner_or_capable(realinode)) {
> > > -               realfile = ERR_PTR(-EPERM);
> > >         } else {
> > > +               if (!inode_owner_or_capable(realinode))
> > > +                       flags &= ~O_NOATIME;
> > > +
> >
> > Isn't that going to break:
> >
> >         flags |= OVL_OPEN_FLAGS;
> >
> >         /* If some flag changed that cannot be changed then something's amiss */
> >         if (WARN_ON((file->f_flags ^ flags) & ~OVL_SETFL_MASK))
> >
> > IOW setting a flag that is allowed to change will fail because of
> > missing O_ATIME in file->f_flags.
>
> Well spotted.  I just removed those lines as a fix.   The check never
> triggered since its introduction in 4.19, so I guess it isn't worth
> adding more complexity for.
>
> >
> > I guess we need test coverage for SETFL.
>
> There might be some in ltp, haven't checked.  Would be nice if the fs
> related ltp tests could be integrated into xfstests.
>

There is some test coverage for SETFL in xfstests.

The t_immutable tests for one, but those would not run if the mounter
has no CAP_LINUX_IMMUTABLE, so would not have been useful to
detect the problem above.

fsstress also seems to have support for SETFL ops, but I am not sure
in how many tests it is exercises and perhaps the relevant problem
would have been covered by some stress test that is not in the 'quick'
tests group.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-14  5:44       ` Amir Goldstein
@ 2020-12-14 13:23         ` Miklos Szeredi
  2020-12-14 14:47           ` Amir Goldstein
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2020-12-14 13:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel, Dmitry Vyukov

On Mon, Dec 14, 2020 at 6:44 AM Amir Goldstein <amir73il@gmail.com> wrote:

> Perhaps, but there is a much bigger issue with this change IMO.
> Not because of dropping rule (b) of the permission model, but because
> of relaxing rule (a).
>
> Should overlayfs respect the conservative interpretation as it partly did
> until this commit, a lower file must not lose IMMUTABLE/APPEND_ONLY
> after copy up, but that is exactly what is going to happen if we first
> copy up and then fail permission check on setting the flags.

Yeah, it's a mess.   This will hopefully sort it out, as it will allow
easier copy up of flags:

https://lore.kernel.org/linux-fsdevel/20201123141207.GC327006@miu.piliscsaba.redhat.com/

In actual fact losing S_APPEND is not currently prevented by copy-up
triggered by anything other than FS_IOC_SETX*, and even that is prone
to races as indicated by the bug report that resulted in this patch.

Let's just fix the IMMUTABLE case:

  - if the file is already copied up with data (since the overlay
ioctl implementation currently uses the realdata), then we're fine to
copy up

  - if the file is not IMMUTABLE to start with, then also fine to copy
up; even if the op will fail after copy up we haven't done anything
that wouldn't be possible without this particular codepath

  - if task has CAP_LINUX_IMMUTABLE (can add/remove immutable) then
it's also fine to copy up since we can be fairly sure that the actual
setflags will succeed as well.  If not, that can be a problem, but as
I've said copying up IMMUTABLE and other flags should really be done
by the copy up code, otherwise it won't work properly.

Something like this incremental should be good,  I think:

@@ -576,6 +576,15 @@ static long ovl_ioctl_set_flags(struct f

  inode_lock(inode);

+ /*
+ * Prevent copy up if immutable and has no CAP_LINUX_IMMUTABLE
+ * capability.
+ */
+ ret = -EPERM;
+ if (!ovl_has_upperdata(inode) && IS_IMMUTABLE(inode) &&
+     !capable(CAP_LINUX_IMMUTABLE))
+ goto unlock;
+
  ret = ovl_maybe_copy_up(file_dentry(file), O_WRONLY);
  if (ret)
  goto unlock;

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 04/10] ovl: make ioctl() safe
  2020-12-14 13:23         ` Miklos Szeredi
@ 2020-12-14 14:47           ` Amir Goldstein
  0 siblings, 0 replies; 39+ messages in thread
From: Amir Goldstein @ 2020-12-14 14:47 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, Eric W . Biederman, linux-fsdevel, overlayfs,
	LSM List, linux-kernel, Dmitry Vyukov

On Mon, Dec 14, 2020 at 3:24 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Dec 14, 2020 at 6:44 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > Perhaps, but there is a much bigger issue with this change IMO.
> > Not because of dropping rule (b) of the permission model, but because
> > of relaxing rule (a).
> >
> > Should overlayfs respect the conservative interpretation as it partly did
> > until this commit, a lower file must not lose IMMUTABLE/APPEND_ONLY
> > after copy up, but that is exactly what is going to happen if we first
> > copy up and then fail permission check on setting the flags.
>
> Yeah, it's a mess.   This will hopefully sort it out, as it will allow
> easier copy up of flags:
>
> https://lore.kernel.org/linux-fsdevel/20201123141207.GC327006@miu.piliscsaba.redhat.com/
>
> In actual fact losing S_APPEND is not currently prevented by copy-up
> triggered by anything other than FS_IOC_SETX*, and even that is prone
> to races as indicated by the bug report that resulted in this patch.
>
> Let's just fix the IMMUTABLE case:
>
>   - if the file is already copied up with data (since the overlay
> ioctl implementation currently uses the realdata), then we're fine to
> copy up
>
>   - if the file is not IMMUTABLE to start with, then also fine to copy
> up; even if the op will fail after copy up we haven't done anything
> that wouldn't be possible without this particular codepath
>
>   - if task has CAP_LINUX_IMMUTABLE (can add/remove immutable) then
> it's also fine to copy up since we can be fairly sure that the actual
> setflags will succeed as well.  If not, that can be a problem, but as
> I've said copying up IMMUTABLE and other flags should really be done
> by the copy up code, otherwise it won't work properly.
>
> Something like this incremental should be good,  I think:
>
> @@ -576,6 +576,15 @@ static long ovl_ioctl_set_flags(struct f
>
>   inode_lock(inode);
>
> + /*
> + * Prevent copy up if immutable and has no CAP_LINUX_IMMUTABLE
> + * capability.
> + */
> + ret = -EPERM;
> + if (!ovl_has_upperdata(inode) && IS_IMMUTABLE(inode) &&
> +     !capable(CAP_LINUX_IMMUTABLE))
> + goto unlock;
> +
>   ret = ovl_maybe_copy_up(file_dentry(file), O_WRONLY);
>   if (ret)
>   goto unlock;
>

I guess that looks ok for a band aid.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 00/10] allow unprivileged overlay mounts
  2020-12-10  9:39     ` Miklos Szeredi
@ 2020-12-15 11:03       ` John Johansen
  0 siblings, 0 replies; 39+ messages in thread
From: John Johansen @ 2020-12-15 11:03 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Tetsuo Handa, Miklos Szeredi, Eric W . Biederman, linux-fsdevel,
	overlayfs, LSM, linux-kernel

On 12/10/20 1:39 AM, Miklos Szeredi wrote:
> On Thu, Dec 10, 2020 at 10:00 AM John Johansen
> <john.johansen@canonical.com> wrote:
>>
>> On 12/8/20 2:27 AM, Tetsuo Handa wrote:
>>> On 2020/12/08 1:32, Miklos Szeredi wrote:
>>>> A general observation is that overlayfs does not call security_path_*()
>>>> hooks on the underlying fs.  I don't see this as a problem, because a
>>>> simple bind mount done inside a private mount namespace also defeats the
>>>> path based security checks.  Maybe I'm missing something here, so I'm
>>>> interested in comments from AppArmor and Tomoyo developers.
>>>
>>> Regarding TOMOYO, I don't want overlayfs to call security_path_*() hooks on the
>>> underlying fs, but the reason is different. It is not because a simple bind mount
>>> done inside a private mount namespace defeats the path based security checks.
>>> TOMOYO does want to check what device/filesystem is mounted on which location. But
>>> currently TOMOYO is failing to check it due to fsopen()/fsmount()/move_mount() API.
>>>
>>
>> Regardless of TOMOYO's approach I would say that overlays should call the
>> security_path_*() hooks, making it possible for an LSM to do something based off of
>> them when needed.
>>
>> The current state of private mounts with regard to path based mediation is broken.
>> I just haven't had time to try and come up with an acceptable fix for it. overlayfs
>> is actually broken under apparmor mediation, and accesses to the lower layer end up
>> getting denied but there is no way to properly allow them. So policy that hits this
>> needs a flag set that allows for it in a very hacky way (its on the list of things
>> to fix).
>>
>> Path based mediation has to carefully control mounts otherwise policy can be
>> circumvented as Miklos rightly points out. Ideally path based LSM wouldn't allow
>> you to do the simple bind mount inside a private mount namespace (at least not
>> unless policy allowed for it). AppArmor does mediate the mount hooks and bind
>> mounts in a private mount namespace (if they go through the LSM mount hooks) will
>> be denied. Again the problem is how to allow them, and this is broken.
> 
> Okay, so what does that mean for overlayfs?
> 
> AA can deny the overlay mount just as well as the bind mount, and it
> can allow it just as well as the bind mount.  Policy could be the
> same.
> 
not entirely the private mount is always detached from the namespace and we have
no way to correlate the private mount to the overlay so the only safe thing is
to deny operations on the private mount.

Ideally we would have a way to provide some kind of correlation so we could
make an informed decision.

> Also all the security_path_ hooks will still get called for each
> access on overlayfs itself.  They won't be called for the accesses
> which overlayfs does on underlying layers, but is that needed?
> 
maybe but maybe not. The thing is apparmor doesn't just used security_path_
hooks which means we get some operation on leaking through that are using
the private mount. Its this inconsistent partial view that is problematic.

> Overlay could call those hooks itself (since the vfs_ helpers don't)
> but the big question is whether that makes any sense.  AFAICS it might
> make sense, but only if AA would correctly handle bind mounts, and
> especially detached bind mounts (which is what overlayfs technically
> uses).
> 

I haven't investigated enough to say for sure whether AA needs the path
hooks called, but I think it probably doesn't. What AA does need is a
way to determine what to do with private mounts when it encounters them
in the none path hooks.

That could be a simple as a hook when the private mount is setup so it
can setup some state.

> Thanks,
> Miklos
> 
> Tja
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
  2020-12-09  1:53   ` James Morris
@ 2021-01-01 17:35   ` Eric W. Biederman
  2021-01-11 13:49     ` Miklos Szeredi
  1 sibling, 1 reply; 39+ messages in thread
From: Eric W. Biederman @ 2021-01-01 17:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-unionfs, linux-security-module,
	linux-kernel, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> cap_convert_nscap() does permission checking as well as conversion of the
> xattr value conditionally based on fs's user-ns.
>
> This is needed by overlayfs and probably other layered fs (ecryptfs) and is
> what vfs_foo() is supposed to do anyway.

Well crap.

I just noticed this and it turns out this change is wrong.

The problem is that it reads the rootid from the v3 fscap, using
current_user_ns() and then writes it using the sb->s_user_ns.

So any time the stacked filesystems sb->s_user_ns do not match or
current_user_ns does not match sb->s_user_ns this could be a problem.

In a stacked filesystem a second pass through vfs_setxattr will result
in the rootid being translated a second time (with potentially the wrong
namespaces).  I think because of the security checks a we won't write
something we shouldn't be able to write to the filesystem.  Still we
will be writing the wrong v3 fscap which can go quite badly.

This doesn't look terribly difficult to fix.

Probably convert this into a fs independent form using uids in
init_user_ns at input and have cap_convert_nscap convert the v3 fscap
into the filesystem dependent form.  With some way for stackable
filesystems to just skip converting it from the filesystem independent
format.

Uids in xattrs that are expected to go directly to disk, but aren't
always suitable for going directly to disk are tricky.

Eric

> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/xattr.c                 | 17 +++++++++++------
>  include/linux/capability.h |  2 +-
>  security/commoncap.c       |  3 +--
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index cd7a563e8bcd..fd57153b1f61 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -276,8 +276,16 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
>  {
>  	struct inode *inode = dentry->d_inode;
>  	struct inode *delegated_inode = NULL;
> +	const void  *orig_value = value;
>  	int error;
>  
> +	if (size && strcmp(name, XATTR_NAME_CAPS) == 0) {
> +		error = cap_convert_nscap(dentry, &value, size);
> +		if (error < 0)
> +			return error;
> +		size = error;
> +	}
> +
>  retry_deleg:
>  	inode_lock(inode);
>  	error = __vfs_setxattr_locked(dentry, name, value, size, flags,
> @@ -289,6 +297,9 @@ vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
>  		if (!error)
>  			goto retry_deleg;
>  	}
> +	if (value != orig_value)
> +		kfree(value);
> +
>  	return error;
>  }
>  EXPORT_SYMBOL_GPL(vfs_setxattr);
> @@ -537,12 +548,6 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
>  		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
>  		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
>  			posix_acl_fix_xattr_from_user(kvalue, size);
> -		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
> -			error = cap_convert_nscap(d, &kvalue, size);
> -			if (error < 0)
> -				goto out;
> -			size = error;
> -		}
>  	}
>  
>  	error = vfs_setxattr(d, kname, kvalue, size, flags);
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 1e7fe311cabe..b2f698915c0f 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -270,6 +270,6 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> -extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
> +extern int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size);
>  
>  #endif /* !_LINUX_CAPABILITY_H */
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 59bf3c1674c8..bacc1111d871 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -473,7 +473,7 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
>   *
>   * If all is ok, we return the new size, on error return < 0.
>   */
> -int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
> +int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size)
>  {
>  	struct vfs_ns_cap_data *nscap;
>  	uid_t nsrootid;
> @@ -516,7 +516,6 @@ int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
>  	nscap->magic_etc = cpu_to_le32(nsmagic);
>  	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
>  
> -	kvfree(*ivalue);
>  	*ivalue = nscap;
>  	return newsize;
>  }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-01 17:35   ` Eric W. Biederman
@ 2021-01-11 13:49     ` Miklos Szeredi
  2021-01-12  0:14       ` Eric W. Biederman
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2021-01-11 13:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, linux-fsdevel, linux-unionfs,
	linux-security-module, linux-kernel, Serge E. Hallyn

On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:
> Miklos Szeredi <mszeredi@redhat.com> writes:
> 
> > cap_convert_nscap() does permission checking as well as conversion of the
> > xattr value conditionally based on fs's user-ns.
> >
> > This is needed by overlayfs and probably other layered fs (ecryptfs) and is
> > what vfs_foo() is supposed to do anyway.
> 
> Well crap.
> 
> I just noticed this and it turns out this change is wrong.
> 
> The problem is that it reads the rootid from the v3 fscap, using
> current_user_ns() and then writes it using the sb->s_user_ns.
> 
> So any time the stacked filesystems sb->s_user_ns do not match or
> current_user_ns does not match sb->s_user_ns this could be a problem.
> 
> In a stacked filesystem a second pass through vfs_setxattr will result
> in the rootid being translated a second time (with potentially the wrong
> namespaces).  I think because of the security checks a we won't write
> something we shouldn't be able to write to the filesystem.  Still we
> will be writing the wrong v3 fscap which can go quite badly.
> 
> This doesn't look terribly difficult to fix.
> 
> Probably convert this into a fs independent form using uids in
> init_user_ns at input and have cap_convert_nscap convert the v3 fscap
> into the filesystem dependent form.  With some way for stackable
> filesystems to just skip converting it from the filesystem independent
> format.
> 
> Uids in xattrs that are expected to go directly to disk, but aren't
> always suitable for going directly to disk are tricky.

I've been looking at this for a couple of days and can't say I clearly
understand everything yet.

For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid of
zero, right?

If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
succeeding unconditionally while v3 one being either converted to v2, rejected
or left as v3 depending on current_user_ns())?

Anyway, here's a patch that I think fixes getxattr() layering for
security.capability.  Does basically what you suggested.  Slight change of
semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
hasn't worked for these since the introduction of v3 in 4.14.  Untested.

I still need to wrap my head around the permission requirements for the
setxattr() case...

Thanks,
Miklos

---
 fs/overlayfs/super.c       |   15 +++
 include/linux/capability.h |    2 
 include/linux/fs.h         |    1 
 security/commoncap.c       |  210 ++++++++++++++++++++++++---------------------
 4 files changed, 132 insertions(+), 96 deletions(-)

--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -395,6 +395,20 @@ static int ovl_remount(struct super_bloc
 	return ret;
 }
 
+static int ovl_cap_get(struct dentry *dentry,
+		       struct vfs_ns_cap_data *nscap)
+{
+	int res;
+	const struct cred *old_cred;
+	struct dentry *realdentry = ovl_dentry_real(dentry);
+
+	old_cred = ovl_override_creds(dentry->d_sb);
+	res = vfs_cap_get(realdentry, nscap);
+	revert_creds(old_cred);
+
+	return res;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
@@ -405,6 +419,7 @@ static const struct super_operations ovl
 	.statfs		= ovl_statfs,
 	.show_options	= ovl_show_options,
 	.remount_fs	= ovl_remount,
+	.cap_get	= ovl_cap_get,
 };
 
 enum {
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -272,4 +272,6 @@ extern int get_vfs_caps_from_disk(const
 
 extern int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size);
 
+int vfs_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap);
+
 #endif /* !_LINUX_CAPABILITY_H */
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1963,6 +1963,7 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	int (*cap_get)(struct dentry *dentry, struct vfs_ns_cap_data *nscap);
 };
 
 /*
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -341,6 +341,13 @@ static __u32 sansflags(__u32 m)
 	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
 }
 
+static bool is_v1header(size_t size, const struct vfs_cap_data *cap)
+{
+	if (size != XATTR_CAPS_SZ_1)
+		return false;
+	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_1;
+}
+
 static bool is_v2header(size_t size, const struct vfs_cap_data *cap)
 {
 	if (size != XATTR_CAPS_SZ_2)
@@ -355,6 +362,72 @@ static bool is_v3header(size_t size, con
 	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_3;
 }
 
+static bool validheader(size_t size, const struct vfs_cap_data *cap)
+{
+	return is_v1header(size, cap) || is_v2header(size, cap) || is_v3header(size, cap);
+}
+
+static void cap_to_v3(const struct vfs_cap_data *cap,
+			 struct vfs_ns_cap_data *nscap)
+{
+	u32 magic, nsmagic;
+
+	nsmagic = VFS_CAP_REVISION_3;
+	magic = le32_to_cpu(cap->magic_etc);
+	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
+		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
+	nscap->magic_etc = cpu_to_le32(nsmagic);
+	nscap->rootid = cpu_to_le32(0);
+}
+
+static int default_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap)
+{
+	int err;
+	ssize_t size;
+	kuid_t kroot;
+	uid_t root, mappedroot;
+	char *tmpbuf = NULL;
+	struct vfs_cap_data *cap;
+	struct user_namespace *fs_ns = dentry->d_sb->s_user_ns;
+
+	size = vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS, &tmpbuf,
+				  sizeof(struct vfs_ns_cap_data), GFP_NOFS);
+	if (size < 0)
+		return size;
+
+	cap = (struct vfs_cap_data *) tmpbuf;
+	err = -EINVAL;
+	if (!validheader(size, cap))
+		goto out;
+
+	memset(nscap, 0, sizeof(*nscap));
+	memcpy(nscap, tmpbuf, size);
+	if (!is_v3header(size, cap))
+		cap_to_v3(cap, nscap);
+
+	/* Convert rootid from fs user namespace to init user namespace */
+	root = le32_to_cpu(nscap->rootid);
+	kroot = make_kuid(fs_ns, root);
+	mappedroot = from_kuid(&init_user_ns, kroot);
+	nscap->rootid = cpu_to_le32(mappedroot);
+
+	err = 0;
+out:
+	kfree(tmpbuf);
+	return err;
+}
+
+int vfs_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	if (sb->s_op->cap_get)
+		return sb->s_op->cap_get(dentry, nscap);
+	else
+		return default_cap_get(dentry, nscap);
+}
+EXPORT_SYMBOL(vfs_cap_get);
+
 /*
  * getsecurity: We are called for security.* before any attempt to read the
  * xattr from the inode itself.
@@ -369,14 +442,15 @@ static bool is_v3header(size_t size, con
 int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 			  bool alloc)
 {
-	int size, ret;
+	int ret;
+	ssize_t size;
 	kuid_t kroot;
+	__le32 nsmagic, magic;
 	uid_t root, mappedroot;
-	char *tmpbuf = NULL;
+	void *tmpbuf = NULL;
 	struct vfs_cap_data *cap;
-	struct vfs_ns_cap_data *nscap;
+	struct vfs_ns_cap_data nscap;
 	struct dentry *dentry;
-	struct user_namespace *fs_ns;
 
 	if (strcmp(name, "capability") != 0)
 		return -EOPNOTSUPP;
@@ -385,67 +459,50 @@ int cap_inode_getsecurity(struct inode *
 	if (!dentry)
 		return -EINVAL;
 
-	size = sizeof(struct vfs_ns_cap_data);
-	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
-				 &tmpbuf, size, GFP_NOFS);
+	ret = vfs_cap_get(dentry, &nscap);
 	dput(dentry);
 
 	if (ret < 0)
 		return ret;
 
-	fs_ns = inode->i_sb->s_user_ns;
-	cap = (struct vfs_cap_data *) tmpbuf;
-	if (is_v2header((size_t) ret, cap)) {
-		/* If this is sizeof(vfs_cap_data) then we're ok with the
-		 * on-disk value, so return that.  */
-		if (alloc)
-			*buffer = tmpbuf;
-		else
-			kfree(tmpbuf);
-		return ret;
-	} else if (!is_v3header((size_t) ret, cap)) {
-		kfree(tmpbuf);
-		return -EINVAL;
-	}
+	tmpbuf = kmalloc(sizeof(struct vfs_ns_cap_data), GFP_NOFS);
+	if (!tmpbuf)
+		return -ENOMEM;
 
-	nscap = (struct vfs_ns_cap_data *) tmpbuf;
-	root = le32_to_cpu(nscap->rootid);
-	kroot = make_kuid(fs_ns, root);
+	root = le32_to_cpu(nscap.rootid);
+	kroot = make_kuid(&init_user_ns, root);
 
 	/* If the root kuid maps to a valid uid in current ns, then return
 	 * this as a nscap. */
 	mappedroot = from_kuid(current_user_ns(), kroot);
 	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
+		size = sizeof(struct vfs_cap_data);
 		if (alloc) {
 			*buffer = tmpbuf;
-			nscap->rootid = cpu_to_le32(mappedroot);
-		} else
-			kfree(tmpbuf);
-		return size;
+			tmpbuf = NULL;
+			nscap.rootid = cpu_to_le32(mappedroot);
+			memcpy(*buffer, &nscap, size);
+		}
+		goto out;
 	}
 
-	if (!rootid_owns_currentns(kroot)) {
-		kfree(tmpbuf);
-		return -EOPNOTSUPP;
-	}
+	size = -EOPNOTSUPP;
+	if (!rootid_owns_currentns(kroot))
+		goto out;
 
 	/* This comes from a parent namespace.  Return as a v2 capability */
 	size = sizeof(struct vfs_cap_data);
 	if (alloc) {
-		*buffer = kmalloc(size, GFP_ATOMIC);
-		if (*buffer) {
-			struct vfs_cap_data *cap = *buffer;
-			__le32 nsmagic, magic;
-			magic = VFS_CAP_REVISION_2;
-			nsmagic = le32_to_cpu(nscap->magic_etc);
-			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
-				magic |= VFS_CAP_FLAGS_EFFECTIVE;
-			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
-			cap->magic_etc = cpu_to_le32(magic);
-		} else {
-			size = -ENOMEM;
-		}
+		cap = *buffer = tmpbuf;
+		tmpbuf = NULL;
+		magic = VFS_CAP_REVISION_2;
+		nsmagic = le32_to_cpu(nscap.magic_etc);
+		if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
+			magic |= VFS_CAP_FLAGS_EFFECTIVE;
+		memcpy(&cap->data, &nscap.data, sizeof(__le32) * 2 * VFS_CAP_U32);
+		cap->magic_etc = cpu_to_le32(magic);
 	}
+out:
 	kfree(tmpbuf);
 	return size;
 }
@@ -462,11 +519,6 @@ static kuid_t rootid_from_xattr(const vo
 	return make_kuid(task_ns, rootid);
 }
 
-static bool validheader(size_t size, const struct vfs_cap_data *cap)
-{
-	return is_v2header(size, cap) || is_v3header(size, cap);
-}
-
 /*
  * User requested a write of security.capability.  If needed, update the
  * xattr to change from v2 to v3, or to fixup the v3 rootid.
@@ -570,74 +622,40 @@ static inline int bprm_caps_from_vfs_cap
 int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
 {
 	struct inode *inode = d_backing_inode(dentry);
-	__u32 magic_etc;
-	unsigned tocopy, i;
-	int size;
-	struct vfs_ns_cap_data data, *nscaps = &data;
-	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
-	kuid_t rootkuid;
-	struct user_namespace *fs_ns;
+	unsigned int i;
+	int ret;
+	struct vfs_ns_cap_data nscaps;
 
 	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
 
 	if (!inode)
 		return -ENODATA;
 
-	fs_ns = inode->i_sb->s_user_ns;
-	size = __vfs_getxattr((struct dentry *)dentry, inode,
-			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
-	if (size == -ENODATA || size == -EOPNOTSUPP)
+	ret = vfs_cap_get((struct dentry *) dentry, &nscaps);
+	if (ret == -ENODATA || ret == -EOPNOTSUPP)
 		/* no data, that's ok */
 		return -ENODATA;
 
-	if (size < 0)
-		return size;
-
-	if (size < sizeof(magic_etc))
-		return -EINVAL;
-
-	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
+	if (ret < 0)
+		return ret;
 
-	rootkuid = make_kuid(fs_ns, 0);
-	switch (magic_etc & VFS_CAP_REVISION_MASK) {
-	case VFS_CAP_REVISION_1:
-		if (size != XATTR_CAPS_SZ_1)
-			return -EINVAL;
-		tocopy = VFS_CAP_U32_1;
-		break;
-	case VFS_CAP_REVISION_2:
-		if (size != XATTR_CAPS_SZ_2)
-			return -EINVAL;
-		tocopy = VFS_CAP_U32_2;
-		break;
-	case VFS_CAP_REVISION_3:
-		if (size != XATTR_CAPS_SZ_3)
-			return -EINVAL;
-		tocopy = VFS_CAP_U32_3;
-		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
-		break;
+	cpu_caps->magic_etc = le32_to_cpu(nscaps.magic_etc);
+	cpu_caps->rootid = make_kuid(&init_user_ns, le32_to_cpu(nscaps.rootid));
 
-	default:
-		return -EINVAL;
-	}
 	/* Limit the caps to the mounter of the filesystem
 	 * or the more limited uid specified in the xattr.
 	 */
-	if (!rootid_owns_currentns(rootkuid))
+	if (!rootid_owns_currentns(cpu_caps->rootid))
 		return -ENODATA;
 
 	CAP_FOR_EACH_U32(i) {
-		if (i >= tocopy)
-			break;
-		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
-		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscaps.data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscaps.data[i].inheritable);
 	}
 
 	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
 	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
 
-	cpu_caps->rootid = rootkuid;
-
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-11 13:49     ` Miklos Szeredi
@ 2021-01-12  0:14       ` Eric W. Biederman
  2021-01-12  9:43         ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Eric W. Biederman @ 2021-01-12  0:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, linux-unionfs,
	linux-security-module, linux-kernel, Serge E. Hallyn

Miklos Szeredi <miklos@szeredi.hu> writes:

> On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:
>> Miklos Szeredi <mszeredi@redhat.com> writes:
>> 
>> > cap_convert_nscap() does permission checking as well as conversion of the
>> > xattr value conditionally based on fs's user-ns.
>> >
>> > This is needed by overlayfs and probably other layered fs (ecryptfs) and is
>> > what vfs_foo() is supposed to do anyway.
>> 
>> Well crap.
>> 
>> I just noticed this and it turns out this change is wrong.
>> 
>> The problem is that it reads the rootid from the v3 fscap, using
>> current_user_ns() and then writes it using the sb->s_user_ns.
>> 
>> So any time the stacked filesystems sb->s_user_ns do not match or
>> current_user_ns does not match sb->s_user_ns this could be a problem.
>> 
>> In a stacked filesystem a second pass through vfs_setxattr will result
>> in the rootid being translated a second time (with potentially the wrong
>> namespaces).  I think because of the security checks a we won't write
>> something we shouldn't be able to write to the filesystem.  Still we
>> will be writing the wrong v3 fscap which can go quite badly.
>> 
>> This doesn't look terribly difficult to fix.
>> 
>> Probably convert this into a fs independent form using uids in
>> init_user_ns at input and have cap_convert_nscap convert the v3 fscap
>> into the filesystem dependent form.  With some way for stackable
>> filesystems to just skip converting it from the filesystem independent
>> format.
>> 
>> Uids in xattrs that are expected to go directly to disk, but aren't
>> always suitable for going directly to disk are tricky.
>
> I've been looking at this for a couple of days and can't say I clearly
> understand everything yet.
>
> For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid of
> zero, right?

Yes.  This assumes that everything is translated into the uids of the
target filesystem.

> If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
> succeeding unconditionally while v3 one being either converted to v2, rejected
> or left as v3 depending on current_user_ns())?

As I understand it v2 fscaps have always succeeded unconditionally.  The
only case I can see for a v2 fscap might not succeed when read is if the
filesystem is outside of the initial user namespace.


> Anyway, here's a patch that I think fixes getxattr() layering for
> security.capability.  Does basically what you suggested.  Slight change of
> semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
> hasn't worked for these since the introduction of v3 in 4.14.
> Untested.

Taking a look.  The goal of change how these operate is to make it so
that layered filesystems can just pass through the data if they don't
want to change anything (even with the user namespaces of the
filesystems in question are different).

Feedback on the code below:
- cap_get should be in inode_operations like get_acl and set_acl.

- cap_get should return a cpu_vfs_cap_data.

  Which means that only make_kuid is needed when reading the cap from
  disk.

  Which means that except for the rootid_owns_currentns check (which
  needs to happen elsewhere) default_cap_get should be today's
  get_vfs_cap_from_disk.

- With the introduction of cap_get I believe commoncap should stop
  implementing the security_inode_getsecurity hook, and rather have
  getxattr observe is the file capability xatter and call the new
  vfs_cap_get then translate to a v2 or v3 cap as appropriate when
  returning the cap to userspace.

I think that would put the code on a solid comprehensible foundation.

Eric


> I still need to wrap my head around the permission requirements for the
> setxattr() case...
>
> Thanks,
> Miklos
>
> ---
>  fs/overlayfs/super.c       |   15 +++
>  include/linux/capability.h |    2 
>  include/linux/fs.h         |    1 
>  security/commoncap.c       |  210 ++++++++++++++++++++++++---------------------
>  4 files changed, 132 insertions(+), 96 deletions(-)
>
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -395,6 +395,20 @@ static int ovl_remount(struct super_bloc
>  	return ret;
>  }
>  
> +static int ovl_cap_get(struct dentry *dentry,
> +		       struct vfs_ns_cap_data *nscap)
> +{
> +	int res;
> +	const struct cred *old_cred;
> +	struct dentry *realdentry = ovl_dentry_real(dentry);
> +
> +	old_cred = ovl_override_creds(dentry->d_sb);
> +	res = vfs_cap_get(realdentry, nscap);
> +	revert_creds(old_cred);
> +
> +	return res;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>  	.alloc_inode	= ovl_alloc_inode,
>  	.free_inode	= ovl_free_inode,
> @@ -405,6 +419,7 @@ static const struct super_operations ovl
>  	.statfs		= ovl_statfs,
>  	.show_options	= ovl_show_options,
>  	.remount_fs	= ovl_remount,
> +	.cap_get	= ovl_cap_get,
>  };
>  
>  enum {
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -272,4 +272,6 @@ extern int get_vfs_caps_from_disk(const
>  
>  extern int cap_convert_nscap(struct dentry *dentry, const void **ivalue, size_t size);
>  
> +int vfs_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap);
> +
>  #endif /* !_LINUX_CAPABILITY_H */
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1963,6 +1963,7 @@ struct super_operations {
>  				  struct shrink_control *);
>  	long (*free_cached_objects)(struct super_block *,
>  				    struct shrink_control *);
> +	int (*cap_get)(struct dentry *dentry, struct vfs_ns_cap_data *nscap);
>  };
>  
>  /*
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -341,6 +341,13 @@ static __u32 sansflags(__u32 m)
>  	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
>  }
>  
> +static bool is_v1header(size_t size, const struct vfs_cap_data *cap)
> +{
> +	if (size != XATTR_CAPS_SZ_1)
> +		return false;
> +	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_1;
> +}
> +
>  static bool is_v2header(size_t size, const struct vfs_cap_data *cap)
>  {
>  	if (size != XATTR_CAPS_SZ_2)
> @@ -355,6 +362,72 @@ static bool is_v3header(size_t size, con
>  	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_3;
>  }
>  
> +static bool validheader(size_t size, const struct vfs_cap_data *cap)
> +{
> +	return is_v1header(size, cap) || is_v2header(size, cap) || is_v3header(size, cap);
> +}
> +
> +static void cap_to_v3(const struct vfs_cap_data *cap,
> +			 struct vfs_ns_cap_data *nscap)
> +{
> +	u32 magic, nsmagic;
> +
> +	nsmagic = VFS_CAP_REVISION_3;
> +	magic = le32_to_cpu(cap->magic_etc);
> +	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> +		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> +	nscap->magic_etc = cpu_to_le32(nsmagic);
> +	nscap->rootid = cpu_to_le32(0);
> +}
> +
> +static int default_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap)
> +{
> +	int err;
> +	ssize_t size;
> +	kuid_t kroot;
> +	uid_t root, mappedroot;
> +	char *tmpbuf = NULL;
> +	struct vfs_cap_data *cap;
> +	struct user_namespace *fs_ns = dentry->d_sb->s_user_ns;
> +
> +	size = vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS, &tmpbuf,
> +				  sizeof(struct vfs_ns_cap_data), GFP_NOFS);
> +	if (size < 0)
> +		return size;
> +
> +	cap = (struct vfs_cap_data *) tmpbuf;
> +	err = -EINVAL;
> +	if (!validheader(size, cap))
> +		goto out;
> +
> +	memset(nscap, 0, sizeof(*nscap));
> +	memcpy(nscap, tmpbuf, size);
> +	if (!is_v3header(size, cap))
> +		cap_to_v3(cap, nscap);
> +
> +	/* Convert rootid from fs user namespace to init user namespace */
> +	root = le32_to_cpu(nscap->rootid);
> +	kroot = make_kuid(fs_ns, root);
> +	mappedroot = from_kuid(&init_user_ns, kroot);
> +	nscap->rootid = cpu_to_le32(mappedroot);
> +
> +	err = 0;
> +out:
> +	kfree(tmpbuf);
> +	return err;
> +}
> +
> +int vfs_cap_get(struct dentry *dentry, struct vfs_ns_cap_data *nscap)
> +{
> +	struct super_block *sb = dentry->d_sb;
> +
> +	if (sb->s_op->cap_get)
> +		return sb->s_op->cap_get(dentry, nscap);
> +	else
> +		return default_cap_get(dentry, nscap);
> +}
> +EXPORT_SYMBOL(vfs_cap_get);
> +
>  /*
>   * getsecurity: We are called for security.* before any attempt to read the
>   * xattr from the inode itself.
> @@ -369,14 +442,15 @@ static bool is_v3header(size_t size, con
>  int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>  			  bool alloc)
>  {
> -	int size, ret;
> +	int ret;
> +	ssize_t size;
>  	kuid_t kroot;
> +	__le32 nsmagic, magic;
>  	uid_t root, mappedroot;
> -	char *tmpbuf = NULL;
> +	void *tmpbuf = NULL;
>  	struct vfs_cap_data *cap;
> -	struct vfs_ns_cap_data *nscap;
> +	struct vfs_ns_cap_data nscap;
>  	struct dentry *dentry;
> -	struct user_namespace *fs_ns;
>  
>  	if (strcmp(name, "capability") != 0)
>  		return -EOPNOTSUPP;
> @@ -385,67 +459,50 @@ int cap_inode_getsecurity(struct inode *
>  	if (!dentry)
>  		return -EINVAL;
>  
> -	size = sizeof(struct vfs_ns_cap_data);
> -	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
> -				 &tmpbuf, size, GFP_NOFS);
> +	ret = vfs_cap_get(dentry, &nscap);
>  	dput(dentry);
>  
>  	if (ret < 0)
>  		return ret;
>  
> -	fs_ns = inode->i_sb->s_user_ns;
> -	cap = (struct vfs_cap_data *) tmpbuf;
> -	if (is_v2header((size_t) ret, cap)) {
> -		/* If this is sizeof(vfs_cap_data) then we're ok with the
> -		 * on-disk value, so return that.  */
> -		if (alloc)
> -			*buffer = tmpbuf;
> -		else
> -			kfree(tmpbuf);
> -		return ret;
> -	} else if (!is_v3header((size_t) ret, cap)) {
> -		kfree(tmpbuf);
> -		return -EINVAL;
> -	}
> +	tmpbuf = kmalloc(sizeof(struct vfs_ns_cap_data), GFP_NOFS);
> +	if (!tmpbuf)
> +		return -ENOMEM;
>  
> -	nscap = (struct vfs_ns_cap_data *) tmpbuf;
> -	root = le32_to_cpu(nscap->rootid);
> -	kroot = make_kuid(fs_ns, root);
> +	root = le32_to_cpu(nscap.rootid);
> +	kroot = make_kuid(&init_user_ns, root);
>  
>  	/* If the root kuid maps to a valid uid in current ns, then return
>  	 * this as a nscap. */
>  	mappedroot = from_kuid(current_user_ns(), kroot);
>  	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> +		size = sizeof(struct vfs_cap_data);
>  		if (alloc) {
>  			*buffer = tmpbuf;
> -			nscap->rootid = cpu_to_le32(mappedroot);
> -		} else
> -			kfree(tmpbuf);
> -		return size;
> +			tmpbuf = NULL;
> +			nscap.rootid = cpu_to_le32(mappedroot);
> +			memcpy(*buffer, &nscap, size);
> +		}
> +		goto out;
>  	}
>  
> -	if (!rootid_owns_currentns(kroot)) {
> -		kfree(tmpbuf);
> -		return -EOPNOTSUPP;
> -	}
> +	size = -EOPNOTSUPP;
> +	if (!rootid_owns_currentns(kroot))
> +		goto out;
>  
>  	/* This comes from a parent namespace.  Return as a v2 capability */
>  	size = sizeof(struct vfs_cap_data);
>  	if (alloc) {
> -		*buffer = kmalloc(size, GFP_ATOMIC);
> -		if (*buffer) {
> -			struct vfs_cap_data *cap = *buffer;
> -			__le32 nsmagic, magic;
> -			magic = VFS_CAP_REVISION_2;
> -			nsmagic = le32_to_cpu(nscap->magic_etc);
> -			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> -				magic |= VFS_CAP_FLAGS_EFFECTIVE;
> -			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> -			cap->magic_etc = cpu_to_le32(magic);
> -		} else {
> -			size = -ENOMEM;
> -		}
> +		cap = *buffer = tmpbuf;
> +		tmpbuf = NULL;
> +		magic = VFS_CAP_REVISION_2;
> +		nsmagic = le32_to_cpu(nscap.magic_etc);
> +		if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> +			magic |= VFS_CAP_FLAGS_EFFECTIVE;
> +		memcpy(&cap->data, &nscap.data, sizeof(__le32) * 2 * VFS_CAP_U32);
> +		cap->magic_etc = cpu_to_le32(magic);
>  	}
> +out:
>  	kfree(tmpbuf);
>  	return size;
>  }
> @@ -462,11 +519,6 @@ static kuid_t rootid_from_xattr(const vo
>  	return make_kuid(task_ns, rootid);
>  }
>  
> -static bool validheader(size_t size, const struct vfs_cap_data *cap)
> -{
> -	return is_v2header(size, cap) || is_v3header(size, cap);
> -}
> -
>  /*
>   * User requested a write of security.capability.  If needed, update the
>   * xattr to change from v2 to v3, or to fixup the v3 rootid.
> @@ -570,74 +622,40 @@ static inline int bprm_caps_from_vfs_cap
>  int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
>  {
>  	struct inode *inode = d_backing_inode(dentry);
> -	__u32 magic_etc;
> -	unsigned tocopy, i;
> -	int size;
> -	struct vfs_ns_cap_data data, *nscaps = &data;
> -	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> -	kuid_t rootkuid;
> -	struct user_namespace *fs_ns;
> +	unsigned int i;
> +	int ret;
> +	struct vfs_ns_cap_data nscaps;
>  
>  	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>  
>  	if (!inode)
>  		return -ENODATA;
>  
> -	fs_ns = inode->i_sb->s_user_ns;
> -	size = __vfs_getxattr((struct dentry *)dentry, inode,
> -			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
> -	if (size == -ENODATA || size == -EOPNOTSUPP)
> +	ret = vfs_cap_get((struct dentry *) dentry, &nscaps);
> +	if (ret == -ENODATA || ret == -EOPNOTSUPP)
>  		/* no data, that's ok */
>  		return -ENODATA;
>  
> -	if (size < 0)
> -		return size;
> -
> -	if (size < sizeof(magic_etc))
> -		return -EINVAL;
> -
> -	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
> +	if (ret < 0)
> +		return ret;
>  
> -	rootkuid = make_kuid(fs_ns, 0);
> -	switch (magic_etc & VFS_CAP_REVISION_MASK) {
> -	case VFS_CAP_REVISION_1:
> -		if (size != XATTR_CAPS_SZ_1)
> -			return -EINVAL;
> -		tocopy = VFS_CAP_U32_1;
> -		break;
> -	case VFS_CAP_REVISION_2:
> -		if (size != XATTR_CAPS_SZ_2)
> -			return -EINVAL;
> -		tocopy = VFS_CAP_U32_2;
> -		break;
> -	case VFS_CAP_REVISION_3:
> -		if (size != XATTR_CAPS_SZ_3)
> -			return -EINVAL;
> -		tocopy = VFS_CAP_U32_3;
> -		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> -		break;
> +	cpu_caps->magic_etc = le32_to_cpu(nscaps.magic_etc);
> +	cpu_caps->rootid = make_kuid(&init_user_ns, le32_to_cpu(nscaps.rootid));
>  
> -	default:
> -		return -EINVAL;
> -	}
>  	/* Limit the caps to the mounter of the filesystem
>  	 * or the more limited uid specified in the xattr.
>  	 */
> -	if (!rootid_owns_currentns(rootkuid))
> +	if (!rootid_owns_currentns(cpu_caps->rootid))
>  		return -ENODATA;
>  
>  	CAP_FOR_EACH_U32(i) {
> -		if (i >= tocopy)
> -			break;
> -		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
> -		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
> +		cpu_caps->permitted.cap[i] = le32_to_cpu(nscaps.data[i].permitted);
> +		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscaps.data[i].inheritable);
>  	}
>  
>  	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>  	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>  
> -	cpu_caps->rootid = rootkuid;
> -
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-12  0:14       ` Eric W. Biederman
@ 2021-01-12  9:43         ` Miklos Szeredi
  2021-01-12 10:04           ` Miklos Szeredi
  2021-01-12 18:36           ` Eric W. Biederman
  0 siblings, 2 replies; 39+ messages in thread
From: Miklos Szeredi @ 2021-01-12  9:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, LSM, linux-kernel,
	Serge E. Hallyn

On Tue, Jan 12, 2021 at 1:15 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> writes:
>
> > On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:

> > For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid of
> > zero, right?
>
> Yes.  This assumes that everything is translated into the uids of the
> target filesystem.
>
> > If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
> > succeeding unconditionally while v3 one being either converted to v2, rejected
> > or left as v3 depending on current_user_ns())?
>
> As I understand it v2 fscaps have always succeeded unconditionally.  The
> only case I can see for a v2 fscap might not succeed when read is if the
> filesystem is outside of the initial user namespace.

Looking again, it's rather confusing.  cap_inode_getsecurity()
currently handles the following cases:

v1: -> fails with -EINVAL

v2: -> returns unconverted xattr

v3:
 a) rootid is mapped in the current namespace to non-zero:
     -> convert rootid

 b) rootid owns the current or ancerstor namespace:
     -> convert to v2

 c) rootid is not mapped and is not owner:
     -> return -EOPNOTSUPP -> falls back to unconverted v3

So lets take the example, where a tmpfs is created in a private user
namespace and one file has a v2 cap and the other an equivalent v3 cap
with a zero rootid.  This is the result when looking at it from

1) the namespace of the fs:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip

2) the initial namespace:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip [rootid=1000]

3) an unrelated namespace:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip

Note: in this last case getxattr will actually return a v3 cap with
zero rootid for "tt" which getcap does not display due to being zero.
I could do a setup with a nested namespaces that better demonstrate
the confusing nature of this, but I think this also proves the point.

At this point userspace simply cannot determine whether the returned
cap is in any way valid or not.

The following semantics would make a ton more sense, since getting a
v2 would indicate that rootid is unknown:

- if cap is v2 convert to v3 with zero rootid
- after this, check if rootid needs to be translated, if not return v3
- if yes, try to translate to current ns, if succeeds return translated v3
- if not mappable, return v2

Hmm?

> > Anyway, here's a patch that I think fixes getxattr() layering for
> > security.capability.  Does basically what you suggested.  Slight change of
> > semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
> > hasn't worked for these since the introduction of v3 in 4.14.
> > Untested.
>
> Taking a look.  The goal of change how these operate is to make it so
> that layered filesystems can just pass through the data if they don't
> want to change anything (even with the user namespaces of the
> filesystems in question are different).
>
> Feedback on the code below:
> - cap_get should be in inode_operations like get_acl and set_acl.

So it's not clear to me why xattr ops are per-sb and acl ops are per-inode.

> - cap_get should return a cpu_vfs_cap_data.
>
>   Which means that only make_kuid is needed when reading the cap from
>   disk.

It also means translating the cap bits back and forth between disk and
cpu endian.  Not a big deal, but...

>   Which means that except for the rootid_owns_currentns check (which
>   needs to happen elsewhere) default_cap_get should be today's
>   get_vfs_cap_from_disk.

That's true.   So what's the deal with v1 caps?  Support was silently
dropped for getxattr/setxattr but remained in get_vfs_caps_from_disk()
(I guess to not break legacy disk images), but maybe it's time to
deprecate v1 caps completely?

> - With the introduction of cap_get I believe commoncap should stop
>   implementing the security_inode_getsecurity hook, and rather have
>   getxattr observe is the file capability xatter and call the new
>   vfs_cap_get then translate to a v2 or v3 cap as appropriate when
>   returning the cap to userspace.

Confused.  vfs_cap_get() is the one the layered filesystem will
recurse with, so it must not translate the cap.   The one to do that
would be __vfs_getxattr(), right?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-12  9:43         ` Miklos Szeredi
@ 2021-01-12 10:04           ` Miklos Szeredi
  2021-01-12 18:36           ` Eric W. Biederman
  1 sibling, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2021-01-12 10:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, LSM, linux-kernel,
	Serge E. Hallyn

On Tue, Jan 12, 2021 at 10:43 AM Miklos Szeredi <miklos@szeredi.hu> wrote:

> The following semantics would make a ton more sense, since getting a
> v2 would indicate that rootid is unknown:
>
> - if cap is v2 convert to v3 with zero rootid
> - after this, check if rootid needs to be translated, if not return v3
> - if yes, try to translate to current ns, if succeeds return translated v3
> - if not mappable, return v2
>
> Hmm?

Actually, it would make even more sense to simply skip unconvertible
caps and return -ENODATA.  In fact that's what I thought would happen
until I looked at the -EOPNOTSUPP fallback code in vfs_getxattr().

Serge, do you remember what was the reason for the unconverted fallback?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-12  9:43         ` Miklos Szeredi
  2021-01-12 10:04           ` Miklos Szeredi
@ 2021-01-12 18:36           ` Eric W. Biederman
  2021-01-12 18:49             ` Eric W. Biederman
  1 sibling, 1 reply; 39+ messages in thread
From: Eric W. Biederman @ 2021-01-12 18:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, LSM, linux-kernel,
	Serge E. Hallyn, linux-api

Miklos Szeredi <miklos@szeredi.hu> writes:

> On Tue, Jan 12, 2021 at 1:15 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Miklos Szeredi <miklos@szeredi.hu> writes:
>>
>> > On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:
>
>> > For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid of
>> > zero, right?
>>
>> Yes.  This assumes that everything is translated into the uids of the
>> target filesystem.
>>
>> > If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
>> > succeeding unconditionally while v3 one being either converted to v2, rejected
>> > or left as v3 depending on current_user_ns())?
>>
>> As I understand it v2 fscaps have always succeeded unconditionally.  The
>> only case I can see for a v2 fscap might not succeed when read is if the
>> filesystem is outside of the initial user namespace.
>
> Looking again, it's rather confusing.  cap_inode_getsecurity()
> currently handles the following cases:
>
> v1: -> fails with -EINVAL
>
> v2: -> returns unconverted xattr
>
> v3:
>  a) rootid is mapped in the current namespace to non-zero:
>      -> convert rootid
>
>  b) rootid owns the current or ancerstor namespace:
>      -> convert to v2
>
>  c) rootid is not mapped and is not owner:
>      -> return -EOPNOTSUPP -> falls back to unconverted v3
>
> So lets take the example, where a tmpfs is created in a private user
> namespace and one file has a v2 cap and the other an equivalent v3 cap
> with a zero rootid.  This is the result when looking at it from
>
> 1) the namespace of the fs:
> ---------------------------------------
> t = cap_dac_override+eip
> tt = cap_dac_override+eip
>
> 2) the initial namespace:
> ---------------------------------------
> t = cap_dac_override+eip
> tt = cap_dac_override+eip [rootid=1000]
>
> 3) an unrelated namespace:
> ---------------------------------------
> t = cap_dac_override+eip
> tt = cap_dac_override+eip
>
> Note: in this last case getxattr will actually return a v3 cap with
> zero rootid for "tt" which getcap does not display due to being zero.
> I could do a setup with a nested namespaces that better demonstrate
> the confusing nature of this, but I think this also proves the point.

Yes.  There is real confusion on the reading case when the namespaces
do not match.

> At this point userspace simply cannot determine whether the returned
> cap is in any way valid or not.
>
> The following semantics would make a ton more sense, since getting a
> v2 would indicate that rootid is unknown:

> - if cap is v2 convert to v3 with zero rootid
> - after this, check if rootid needs to be translated, if not return v3
> - if yes, try to translate to current ns, if succeeds return translated v3
> - if not mappable, return v2
>
> Hmm?

So there is the basic question do we want to read the raw bytes on disk
or do we want to return something meaningful to the reader.  As the
existing tools use the xattr interface to set/clear fscaps returning
data to user space rather than raw bytes seems the perfered interface.

My ideal semantics would be:

- If current_user_ns() == sb->s_user_ns return the raw data.

  I don't know how to implement this first scenario while permitting
  stacked filesystems.
  
- Calculate the cpu_vfs_cap_data as get_vfs_caps_from_disk does.
  That gives the meaning of the xattr.

- If "from_kuid(current_userns(), krootid) == 0" return a v2 cap.

- If "rootid_owns_currentns()" return a v2 cap.

- Else return an error.  Probably a permission error.

  The fscap simply can not make sense to the user if the rootid does not
  map.  Return a v2 cap would imply that the caps are present on the
  executable (in the current context) which they are not.


>> > Anyway, here's a patch that I think fixes getxattr() layering for
>> > security.capability.  Does basically what you suggested.  Slight change of
>> > semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
>> > hasn't worked for these since the introduction of v3 in 4.14.
>> > Untested.
>>
>> Taking a look.  The goal of change how these operate is to make it so
>> that layered filesystems can just pass through the data if they don't
>> want to change anything (even with the user namespaces of the
>> filesystems in question are different).
>>
>> Feedback on the code below:
>> - cap_get should be in inode_operations like get_acl and set_acl.
>
> So it's not clear to me why xattr ops are per-sb and acl ops are per-inode.

I don't know why either.  What I do see is everything except
inode->i_sb->s_xattr (the list of xattr handlers) is in
inode_operations.

Especially permission.  So just for consistency I would keep everything
in the inode_operations.

>> - cap_get should return a cpu_vfs_cap_data.
>>
>>   Which means that only make_kuid is needed when reading the cap from
>>   disk.
>
> It also means translating the cap bits back and forth between disk and
> cpu endian.  Not a big deal, but...

For the very rare case of userspace reading and writing them.

For the common case of actually using them it should be optimal, and
it allows for non-standard implementations.  Anything where someone gets
clever we want the bits to be in cpu format to make mistakes harder.

>>   Which means that except for the rootid_owns_currentns check (which
>>   needs to happen elsewhere) default_cap_get should be today's
>>   get_vfs_cap_from_disk.
>
> That's true.   So what's the deal with v1 caps?  Support was silently
> dropped for getxattr/setxattr but remained in get_vfs_caps_from_disk()
> (I guess to not break legacy disk images), but maybe it's time to
> deprecate v1 caps completely?

I really don't remember.  When I look I see they appear to be a subset
of v2 caps.  I would have to look to see if they bits might have a
different meaning.

>> - With the introduction of cap_get I believe commoncap should stop
>>   implementing the security_inode_getsecurity hook, and rather have
>>   getxattr observe is the file capability xatter and call the new
>>   vfs_cap_get then translate to a v2 or v3 cap as appropriate when
>>   returning the cap to userspace.
>
> Confused.  vfs_cap_get() is the one the layered filesystem will
> recurse with, so it must not translate the cap.   The one to do that
> would be __vfs_getxattr(), right?

So there are two layers that I am worrying about.

The layer dealing with fscaps knowing they are fscaps.  That is
vfs_cap_get and friends.

Then there is the layer dealing with xattrs which should also handle
fscaps.  So that the xattr layer stacks properly I believe we need to
completely bypass vfs_getxattr and friends and at the edge of userspace
before there is any possibility of interception call vfs_cap_get and
translate the result to a form userspace can consume.

With xattrs and caps we are in part of the kernel that hasn't seen a lot
of attention and so is not the best factored.  So I am proposing we fix
up the abstractions to make it easier to implement stacked fscap support.

Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
  2021-01-12 18:36           ` Eric W. Biederman
@ 2021-01-12 18:49             ` Eric W. Biederman
  0 siblings, 0 replies; 39+ messages in thread
From: Eric W. Biederman @ 2021-01-12 18:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, LSM, linux-kernel,
	Serge E. Hallyn, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> So there is the basic question do we want to read the raw bytes on disk
> or do we want to return something meaningful to the reader.  As the
> existing tools use the xattr interface to set/clear fscaps returning
> data to user space rather than raw bytes seems the perfered interface.
>
> My ideal semantics would be:
>
> - If current_user_ns() == sb->s_user_ns return the raw data.
>
>   I don't know how to implement this first scenario while permitting
>   stacked filesystems.

    After a little more thought I do.

    In getxattr if the get_cap method is not implemented by the
    filesystem if current_user_ns() == sb->s_user_ns simply treat it as
    an ordinary xattr read/write.

    Otherwise call vfs_get_cap and translate the result as described
    below.
    
    The key point of this is it allows for seeing what is actually on
    disk (when it is not confusing).

> - Calculate the cpu_vfs_cap_data as get_vfs_caps_from_disk does.
>   That gives the meaning of the xattr.
>
> - If "from_kuid(current_userns(), krootid) == 0" return a v2 cap.
>
> - If "rootid_owns_currentns()" return a v2 cap.
>
> - Else return an error.  Probably a permission error.
>
>   The fscap simply can not make sense to the user if the rootid does not
>   map.  Return a v2 cap would imply that the caps are present on the
>   executable (in the current context) which they are not.


Eric

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, back to index

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 16:32 [PATCH v2 00/10] allow unprivileged overlay mounts Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr() Miklos Szeredi
2020-12-09  1:53   ` James Morris
2021-01-01 17:35   ` Eric W. Biederman
2021-01-11 13:49     ` Miklos Szeredi
2021-01-12  0:14       ` Eric W. Biederman
2021-01-12  9:43         ` Miklos Szeredi
2021-01-12 10:04           ` Miklos Szeredi
2021-01-12 18:36           ` Eric W. Biederman
2021-01-12 18:49             ` Eric W. Biederman
2020-12-07 16:32 ` [PATCH v2 02/10] vfs: verify source area in vfs_dedupe_file_range_one() Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 03/10] ovl: check privs before decoding file handle Miklos Szeredi
2020-12-08 13:49   ` Amir Goldstein
2020-12-09 10:13     ` Miklos Szeredi
2020-12-09 16:20       ` Miklos Szeredi
2020-12-09 18:16         ` Amir Goldstein
2020-12-07 16:32 ` [PATCH v2 04/10] ovl: make ioctl() safe Miklos Szeredi
2020-12-08 11:11   ` Amir Goldstein
2020-12-10 15:18     ` Miklos Szeredi
2020-12-14  5:44       ` Amir Goldstein
2020-12-14 13:23         ` Miklos Szeredi
2020-12-14 14:47           ` Amir Goldstein
2020-12-09  1:57   ` James Morris
2020-12-10 15:19     ` Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 05/10] ovl: simplify file splice Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 06/10] ovl: user xattr Miklos Szeredi
2020-12-08 13:10   ` Amir Goldstein
2020-12-11 14:55     ` Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 07/10] ovl: do not fail when setting origin xattr Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 08/10] ovl: do not fail because of O_NOATIME Miklos Szeredi
2020-12-08 11:29   ` Amir Goldstein
2020-12-11 14:44     ` Miklos Szeredi
2020-12-14  5:49       ` Amir Goldstein
2020-12-07 16:32 ` [PATCH v2 09/10] ovl: do not get metacopy for userxattr Miklos Szeredi
2020-12-07 16:32 ` [PATCH v2 10/10] ovl: unprivieged mounts Miklos Szeredi
2020-12-08 10:27 ` [PATCH v2 00/10] allow unprivileged overlay mounts Tetsuo Handa
2020-12-10  8:56   ` John Johansen
2020-12-10  9:39     ` Miklos Szeredi
2020-12-15 11:03       ` John Johansen

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git