linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Introduce v3 namespaced file capabilities
@ 2017-04-19 16:48 Serge E. Hallyn
  2017-04-21 21:37 ` Eric W. Biederman
  2017-04-21 23:58 ` Eric W. Biederman
  0 siblings, 2 replies; 10+ messages in thread
From: Serge E. Hallyn @ 2017-04-19 16:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan

Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr.  If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable.  Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege.  For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise.  The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr.  It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change.  Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid.  Subsequently, any task
executing the file which has the noted kuid as its root uid, or which is
in a descendent user_ns of such a user_ns, will run the file with
capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace.  The kernel will
translate that rootid to an absolute uid, and write that to disk.  After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file.  A task may overwrite an existing xattr so long as it is
privileged over the inode.  Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr.  This check can be re-added if deemed useful.

This allows a simple setcap/setxattr to work, should allow tar to work,
and should allow us to support tar in one namespace and untar in another
while preserving the capability, without risking leaking privilege into
a parent namespace.

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
   Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
   Nov 07 2016: convert rootid from and to fs user_ns
   (From ebiederm: mar 28 2017)
     commoncap.c: fix typos - s/v4/v3
     get_vfs_caps_from_disk: clarify the fs_ns root access check
     nsfscaps: change the code split for cap_inode_setxattr()
   Apr 09 2017:
       don't return v3 cap for caps owned by current root.
      return a v2 cap for a true v2 cap in non-init ns
   Apr 18 2017:
      . Change the flow of fscap writing to support s_user_ns writing.
      . Remove refuse_fcap_overwrite().  The value of the previous
        xattr doesn't matter.
---
 fs/xattr.c                      |  30 ++++-
 include/linux/capability.h      |   5 +-
 include/linux/security.h        |   2 +
 include/uapi/linux/capability.h |  22 +++-
 security/commoncap.c            | 237 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 268 insertions(+), 28 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 7e3317c..75cc65a 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 		const void *value, size_t size, int flags)
 {
 	struct inode *inode = dentry->d_inode;
-	int error = -EAGAIN;
+	int error;
+	void *wvalue = NULL;
+	size_t wsize = 0;
 	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
 				   XATTR_SECURITY_PREFIX_LEN);
 
-	if (issec)
+	if (issec) {
 		inode->i_flags &= ~S_NOSEC;
+
+		if (!strcmp(name, "security.capability")) {
+			error = cap_setxattr_convert_nscap(dentry, value, size,
+					&wvalue, &wsize);
+			if (error < 0)
+				return error;
+			if (wvalue) {
+				value = wvalue;
+				size = wsize;
+			}
+		}
+	}
+
+	error = -EAGAIN;
+
 	if (inode->i_opflags & IOP_XATTR) {
 		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
 		if (!error) {
@@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 						     size, flags);
 		}
 	} else {
-		if (unlikely(is_bad_inode(inode)))
-			return -EIO;
+		if (unlikely(is_bad_inode(inode))) {
+			error = -EIO;
+			goto out;
+		}
 	}
 	if (error == -EAGAIN) {
 		error = -EOPNOTSUPP;
@@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 		}
 	}
 
+out:
+	kfree(wvalue);
 	return error;
 }
 
-
 int
 vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
 		size_t size, int flags)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6ffb67e..b973433 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
@@ -248,4 +248,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+extern int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value,
+		size_t size, void **wvalue, size_t *wsize);
+
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index 96899fa..bd49cc1 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
 extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
 extern int cap_inode_need_killpriv(struct dentry *dentry);
 extern int cap_inode_killpriv(struct dentry *dentry);
+extern int cap_inode_getsecurity(struct inode *inode, const char *name,
+				 void **buffer, bool alloc);
 extern int cap_mmap_addr(unsigned long addr);
 extern int cap_mmap_file(struct file *file, unsigned long reqprot,
 			 unsigned long prot, unsigned long flags);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 49bc062..fd4f87d 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
-#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
-#define VFS_CAP_U32             VFS_CAP_U32_2
-#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
+#define VFS_CAP_REVISION_3	0x03000000
+#define VFS_CAP_U32_3           2
+#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
+
+#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
+#define VFS_CAP_U32             VFS_CAP_U32_3
+#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
 
 struct vfs_cap_data {
 	__le32 magic_etc;            /* Little endian */
@@ -72,6 +76,18 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+/*
+ * same as vfs_cap_data but with a rootid at the end
+ */
+struct vfs_ns_cap_data {
+	__le32 magic_etc;
+	struct {
+		__le32 permitted;    /* Little endian */
+		__le32 inheritable;  /* Little endian */
+	} data[VFS_CAP_U32];
+	__le32 rootid;
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/security/commoncap.c b/security/commoncap.c
index 78b3783..8abb9bf 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -332,6 +332,179 @@ int cap_inode_killpriv(struct dentry *dentry)
 	return error;
 }
 
+static bool rootid_owns_currentns(kuid_t kroot)
+{
+	struct user_namespace *ns;
+
+	if (!uid_valid(kroot))
+		return false;
+
+	for (ns = current_user_ns(); ; ns = ns->parent) {
+		if (from_kuid(ns, kroot) == 0)
+			return true;
+		if (ns == &init_user_ns)
+			break;
+	}
+
+	return false;
+}
+
+/*
+ * getsecurity: We are called for security.* before any attempt to read the
+ * xattr from the inode itself.
+ *
+ * This gives us a chance to read the on-disk value and convert it.  If we
+ * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
+ *
+ * Note we are not called by vfs_getxattr_alloc(), but that is only called
+ * by the integrity subsystem, which really wants the unconverted values -
+ * so that's good.
+ */
+int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
+			  bool alloc)
+{
+	int size, ret;
+	kuid_t kroot;
+	uid_t root, mappedroot;
+	char *tmpbuf = NULL;
+	struct vfs_ns_cap_data *nscap;
+	struct dentry *dentry;
+	struct user_namespace *fs_ns;
+
+	if (strcmp(name, "capability") != 0)
+		return -EOPNOTSUPP;
+
+	dentry = d_find_alias(inode);
+	if (!dentry)
+		return -EINVAL;
+
+	size = sizeof(struct vfs_ns_cap_data);
+	ret = vfs_getxattr_alloc(dentry, "security.capability",
+				 &tmpbuf, size, GFP_NOFS);
+
+	if (ret < 0)
+		return ret;
+
+	fs_ns = inode->i_sb->s_user_ns;
+	if (ret == sizeof(struct vfs_cap_data)) {
+		/* If this is sizeof(vfs_cap_data) then we're ok with the
+		 * on-disk value, so return that.  */
+		if (alloc)
+			*buffer = tmpbuf;
+		else
+			kfree(tmpbuf);
+		return ret;
+	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
+		kfree(tmpbuf);
+		return -EINVAL;
+	}
+
+	nscap = (struct vfs_ns_cap_data *) tmpbuf;
+	root = le32_to_cpu(nscap->rootid);
+	kroot = make_kuid(fs_ns, root);
+
+	/* If the root kuid maps to a valid uid in current ns, then return
+	 * this as a nscap. */
+	mappedroot = from_kuid(current_user_ns(), kroot);
+	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
+		if (alloc) {
+			*buffer = tmpbuf;
+			nscap->rootid = cpu_to_le32(mappedroot);
+		} else
+			kfree(tmpbuf);
+		return size;
+	}
+
+	if (!rootid_owns_currentns(kroot)) {
+		kfree(tmpbuf);
+		return -EOPNOTSUPP;
+	}
+
+	/* This comes from a parent namespace.  Return as a v2 capability */
+	size = sizeof(struct vfs_cap_data);
+	if (alloc) {
+		*buffer = kmalloc(size, GFP_ATOMIC);
+		if (*buffer) {
+			struct vfs_cap_data *cap = *buffer;
+			__le32 nsmagic, magic;
+			magic = VFS_CAP_REVISION_2;
+			nsmagic = le32_to_cpu(nscap->magic_etc);
+			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
+				magic |= VFS_CAP_FLAGS_EFFECTIVE;
+			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+			cap->magic_etc = cpu_to_le32(magic);
+		}
+	}
+	kfree(tmpbuf);
+	return size;
+}
+
+static kuid_t rootid_from_xattr(const void *value, size_t size,
+				struct user_namespace *task_ns)
+{
+	const struct vfs_ns_cap_data *nscap = value;
+	uid_t rootid = 0;
+
+	if (size == XATTR_CAPS_SZ_3)
+		rootid = le32_to_cpu(nscap->rootid);
+
+	return make_kuid(task_ns, rootid);
+}
+
+/*
+ * User requested a write of security.capability.
+ *
+ * If all is ok, we return 0.  If the capability needs to be converted,
+ * wvalue will be allocated (and needs to be freed) with the new value.
+ * On error, return < 0.
+ */
+int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t size,
+				    void **wvalue, size_t *wsize)
+{
+	struct vfs_ns_cap_data *nscap;
+	uid_t nsrootid;
+	const struct vfs_cap_data *cap = value;
+	__u32 magic, nsmagic;
+	struct inode *inode = d_backing_inode(dentry);
+	struct user_namespace *task_ns = current_user_ns(),
+		*fs_ns = inode->i_sb->s_user_ns;
+	kuid_t rootid;
+
+	if (!value)
+		return -EINVAL;
+	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
+		return -EINVAL;
+	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+		return -EPERM;
+	if (size == XATTR_CAPS_SZ_2)
+		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
+			// user is privileged, just write the v2
+			return 0;
+
+	rootid = rootid_from_xattr(value, size, task_ns);
+	if (!uid_valid(rootid))
+		return -EINVAL;
+
+	nsrootid = from_kuid(fs_ns, rootid);
+	if (nsrootid == -1)
+		return -EINVAL;
+
+	*wsize = sizeof(struct vfs_ns_cap_data);
+	nscap = kmalloc(*wsize, GFP_ATOMIC);
+	if (!nscap)
+		return -ENOMEM;
+	nscap->rootid = cpu_to_le32(nsrootid);
+	nsmagic = VFS_CAP_REVISION_3;
+	magic = le32_to_cpu(cap->magic_etc);
+	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
+		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
+	nscap->magic_etc = cpu_to_le32(nsmagic);
+	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+
+	*wvalue = nscap;
+	return 0;
+}
+
 /*
  * Calculate the new process capability sets from the capability sets attached
  * to a file.
@@ -385,7 +558,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	__u32 magic_etc;
 	unsigned tocopy, i;
 	int size;
-	struct vfs_cap_data caps;
+	struct vfs_ns_cap_data data, *nscaps = &data;
+	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
+	kuid_t rootkuid;
+	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
 
 	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
 
@@ -393,18 +569,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 		return -ENODATA;
 
 	size = __vfs_getxattr((struct dentry *)dentry, inode,
-			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
+			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
 	if (size == -ENODATA || size == -EOPNOTSUPP)
 		/* no data, that's ok */
 		return -ENODATA;
+
 	if (size < 0)
 		return size;
 
 	if (size < sizeof(magic_etc))
 		return -EINVAL;
 
-	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
+	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
 
+	rootkuid = make_kuid(fs_ns, 0);
 	switch (magic_etc & VFS_CAP_REVISION_MASK) {
 	case VFS_CAP_REVISION_1:
 		if (size != XATTR_CAPS_SZ_1)
@@ -416,15 +594,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 			return -EINVAL;
 		tocopy = VFS_CAP_U32_2;
 		break;
+	case VFS_CAP_REVISION_3:
+		if (size != XATTR_CAPS_SZ_3)
+			return -EINVAL;
+		tocopy = VFS_CAP_U32_3;
+		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
+		break;
+
 	default:
 		return -EINVAL;
 	}
+	/* Limit the caps to the mounter of the filesystem
+	 * or the more limited uid specified in the xattr.
+	 */
+	if (!rootid_owns_currentns(rootkuid))
+		return -ENODATA;
 
 	CAP_FOR_EACH_U32(i) {
 		if (i >= tocopy)
 			break;
-		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
-		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
+		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
 	}
 
 	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
@@ -462,8 +652,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -660,15 +850,16 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
-		if (!capable(CAP_SETFCAP))
-			return -EPERM;
+	/* Ignore non-security xattrs */
+	if (strncmp(name, XATTR_SECURITY_PREFIX,
+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+		return 0;
+
+	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
+	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
-	}
 
-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-	    !capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -686,15 +877,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
-		if (!capable(CAP_SETFCAP))
+	/* Ignore non-security xattrs */
+	if (strncmp(name, XATTR_SECURITY_PREFIX,
+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+		return 0;
+
+	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
+		/* security.capability gets namespaced */
+		struct inode *inode = d_backing_inode(dentry);
+		if (!inode)
+			return -EINVAL;
+		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
 			return -EPERM;
 		return 0;
 	}
 
-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-	    !capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -1082,6 +1280,7 @@ struct security_hook_list capability_hooks[] = {
 	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
 	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
 	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
+	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
 	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
 	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
 	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
-- 
2.7.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-19 16:48 [PATCH] Introduce v3 namespaced file capabilities Serge E. Hallyn
@ 2017-04-21 21:37 ` Eric W. Biederman
  2017-04-21 21:50   ` Serge E. Hallyn
  2017-04-21 23:58 ` Eric W. Biederman
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-21 21:37 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan


Serge,

Is there any change of a Signed-off-by on this patch?  Otherwise I don't
think we can merge it.

Eric

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Root in a non-initial user ns cannot be trusted to write a traditional
> security.capability xattr.  If it were allowed to do so, then any
> unprivileged user on the host could map his own uid to root in a private
> namespace, write the xattr, and execute the file with privilege on the
> host.
>
> However supporting file capabilities in a user namespace is very
> desirable.  Not doing so means that any programs designed to run with
> limited privilege must continue to support other methods of gaining and
> dropping privilege.  For instance a program installer must detect
> whether file capabilities can be assigned, and assign them if so but set
> setuid-root otherwise.  The program in turn must know how to drop
> partial capabilities, and do so only if setuid-root.
>
> This patch introduces v3 of the security.capability xattr.  It builds a
> vfs_ns_cap_data struct by appending a uid_t rootid to struct
> vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
> namespace which mounted the filesystem, usually init_user_ns) of the
> root id in whose namespaces the file capabilities may take effect.
>
> When a task asks to write a v2 security.capability xattr, if it is
> privileged with respect to the userns which mounted the filesystem, then
> nothing should change.  Otherwise, the kernel will transparently rewrite
> the xattr as a v3 with the appropriate rootid.  Subsequently, any task
> executing the file which has the noted kuid as its root uid, or which is
> in a descendent user_ns of such a user_ns, will run the file with
> capabilities.
>
> Similarly when asking to read file capabilities, a v3 capability will
> be presented as v2 if it applies to the caller's namespace.
>
> If a task writes a v3 security.capability, then it can provide a uid for
> the xattr so long as the uid is valid in its own user namespace, and it
> is privileged with CAP_SETFCAP over its namespace.  The kernel will
> translate that rootid to an absolute uid, and write that to disk.  After
> this, a task in the writer's namespace will not be able to use those
> capabilities (unless rootid was 0), but a task in a namespace where the
> given uid is root will.
>
> Only a single security.capability xattr may exist at a time for a given
> file.  A task may overwrite an existing xattr so long as it is
> privileged over the inode.  Note this is a departure from previous
> semantics, which required privilege to remove a security.capability
> xattr.  This check can be re-added if deemed useful.
>
> This allows a simple setcap/setxattr to work, should allow tar to work,
> and should allow us to support tar in one namespace and untar in another
> while preserving the capability, without risking leaking privilege into
> a parent namespace.
>
> A patch to linux-test-project adding a new set of tests for this
> functionality is in the nsfscaps branch at github.com/hallyn/ltp
>
> Changelog:
>    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
>    Nov 07 2016: convert rootid from and to fs user_ns
>    (From ebiederm: mar 28 2017)
>      commoncap.c: fix typos - s/v4/v3
>      get_vfs_caps_from_disk: clarify the fs_ns root access check
>      nsfscaps: change the code split for cap_inode_setxattr()
>    Apr 09 2017:
>        don't return v3 cap for caps owned by current root.
>       return a v2 cap for a true v2 cap in non-init ns
>    Apr 18 2017:
>       . Change the flow of fscap writing to support s_user_ns writing.
>       . Remove refuse_fcap_overwrite().  The value of the previous
>         xattr doesn't matter.
> ---
>  fs/xattr.c                      |  30 ++++-
>  include/linux/capability.h      |   5 +-
>  include/linux/security.h        |   2 +
>  include/uapi/linux/capability.h |  22 +++-
>  security/commoncap.c            | 237 ++++++++++++++++++++++++++++++++++++----
>  5 files changed, 268 insertions(+), 28 deletions(-)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 7e3317c..75cc65a 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  		const void *value, size_t size, int flags)
>  {
>  	struct inode *inode = dentry->d_inode;
> -	int error = -EAGAIN;
> +	int error;
> +	void *wvalue = NULL;
> +	size_t wsize = 0;
>  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>  				   XATTR_SECURITY_PREFIX_LEN);
>  
> -	if (issec)
> +	if (issec) {
>  		inode->i_flags &= ~S_NOSEC;
> +
> +		if (!strcmp(name, "security.capability")) {
> +			error = cap_setxattr_convert_nscap(dentry, value, size,
> +					&wvalue, &wsize);
> +			if (error < 0)
> +				return error;
> +			if (wvalue) {
> +				value = wvalue;
> +				size = wsize;
> +			}
> +		}
> +	}
> +
> +	error = -EAGAIN;
> +
>  	if (inode->i_opflags & IOP_XATTR) {
>  		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
>  		if (!error) {
> @@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  						     size, flags);
>  		}
>  	} else {
> -		if (unlikely(is_bad_inode(inode)))
> -			return -EIO;
> +		if (unlikely(is_bad_inode(inode))) {
> +			error = -EIO;
> +			goto out;
> +		}
>  	}
>  	if (error == -EAGAIN) {
>  		error = -EOPNOTSUPP;
> @@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  		}
>  	}
>  
> +out:
> +	kfree(wvalue);
>  	return error;
>  }
>  
> -
>  int
>  vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
>  		size_t size, int flags)
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 6ffb67e..b973433 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,7 +13,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include <uapi/linux/capability.h>
> -
> +#include <linux/uidgid.h>
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> @@ -248,4 +248,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> +extern int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value,
> +		size_t size, void **wvalue, size_t *wsize);
> +
>  #endif /* !_LINUX_CAPABILITY_H */
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 96899fa..bd49cc1 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
>  extern int cap_inode_need_killpriv(struct dentry *dentry);
>  extern int cap_inode_killpriv(struct dentry *dentry);
> +extern int cap_inode_getsecurity(struct inode *inode, const char *name,
> +				 void **buffer, bool alloc);
>  extern int cap_mmap_addr(unsigned long addr);
>  extern int cap_mmap_file(struct file *file, unsigned long reqprot,
>  			 unsigned long prot, unsigned long flags);
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 49bc062..fd4f87d 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
>  #define VFS_CAP_U32_2           2
>  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>  
> -#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> -#define VFS_CAP_U32             VFS_CAP_U32_2
> -#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
> +#define VFS_CAP_REVISION_3	0x03000000
> +#define VFS_CAP_U32_3           2
> +#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
> +
> +#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
> +#define VFS_CAP_U32             VFS_CAP_U32_3
> +#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
>  
>  struct vfs_cap_data {
>  	__le32 magic_etc;            /* Little endian */
> @@ -72,6 +76,18 @@ struct vfs_cap_data {
>  	} data[VFS_CAP_U32];
>  };
>  
> +/*
> + * same as vfs_cap_data but with a rootid at the end
> + */
> +struct vfs_ns_cap_data {
> +	__le32 magic_etc;
> +	struct {
> +		__le32 permitted;    /* Little endian */
> +		__le32 inheritable;  /* Little endian */
> +	} data[VFS_CAP_U32];
> +	__le32 rootid;
> +};
> +
>  #ifndef __KERNEL__
>  
>  /*
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 78b3783..8abb9bf 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -332,6 +332,179 @@ int cap_inode_killpriv(struct dentry *dentry)
>  	return error;
>  }
>  
> +static bool rootid_owns_currentns(kuid_t kroot)
> +{
> +	struct user_namespace *ns;
> +
> +	if (!uid_valid(kroot))
> +		return false;
> +
> +	for (ns = current_user_ns(); ; ns = ns->parent) {
> +		if (from_kuid(ns, kroot) == 0)
> +			return true;
> +		if (ns == &init_user_ns)
> +			break;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * getsecurity: We are called for security.* before any attempt to read the
> + * xattr from the inode itself.
> + *
> + * This gives us a chance to read the on-disk value and convert it.  If we
> + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> + *
> + * Note we are not called by vfs_getxattr_alloc(), but that is only called
> + * by the integrity subsystem, which really wants the unconverted values -
> + * so that's good.
> + */
> +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> +			  bool alloc)
> +{
> +	int size, ret;
> +	kuid_t kroot;
> +	uid_t root, mappedroot;
> +	char *tmpbuf = NULL;
> +	struct vfs_ns_cap_data *nscap;
> +	struct dentry *dentry;
> +	struct user_namespace *fs_ns;
> +
> +	if (strcmp(name, "capability") != 0)
> +		return -EOPNOTSUPP;
> +
> +	dentry = d_find_alias(inode);
> +	if (!dentry)
> +		return -EINVAL;
> +
> +	size = sizeof(struct vfs_ns_cap_data);
> +	ret = vfs_getxattr_alloc(dentry, "security.capability",
> +				 &tmpbuf, size, GFP_NOFS);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	fs_ns = inode->i_sb->s_user_ns;
> +	if (ret == sizeof(struct vfs_cap_data)) {
> +		/* If this is sizeof(vfs_cap_data) then we're ok with the
> +		 * on-disk value, so return that.  */
> +		if (alloc)
> +			*buffer = tmpbuf;
> +		else
> +			kfree(tmpbuf);
> +		return ret;
> +	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
> +		kfree(tmpbuf);
> +		return -EINVAL;
> +	}
> +
> +	nscap = (struct vfs_ns_cap_data *) tmpbuf;
> +	root = le32_to_cpu(nscap->rootid);
> +	kroot = make_kuid(fs_ns, root);
> +
> +	/* If the root kuid maps to a valid uid in current ns, then return
> +	 * this as a nscap. */
> +	mappedroot = from_kuid(current_user_ns(), kroot);
> +	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> +		if (alloc) {
> +			*buffer = tmpbuf;
> +			nscap->rootid = cpu_to_le32(mappedroot);
> +		} else
> +			kfree(tmpbuf);
> +		return size;
> +	}
> +
> +	if (!rootid_owns_currentns(kroot)) {
> +		kfree(tmpbuf);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* This comes from a parent namespace.  Return as a v2 capability */
> +	size = sizeof(struct vfs_cap_data);
> +	if (alloc) {
> +		*buffer = kmalloc(size, GFP_ATOMIC);
> +		if (*buffer) {
> +			struct vfs_cap_data *cap = *buffer;
> +			__le32 nsmagic, magic;
> +			magic = VFS_CAP_REVISION_2;
> +			nsmagic = le32_to_cpu(nscap->magic_etc);
> +			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> +				magic |= VFS_CAP_FLAGS_EFFECTIVE;
> +			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> +			cap->magic_etc = cpu_to_le32(magic);
> +		}
> +	}
> +	kfree(tmpbuf);
> +	return size;
> +}
> +
> +static kuid_t rootid_from_xattr(const void *value, size_t size,
> +				struct user_namespace *task_ns)
> +{
> +	const struct vfs_ns_cap_data *nscap = value;
> +	uid_t rootid = 0;
> +
> +	if (size == XATTR_CAPS_SZ_3)
> +		rootid = le32_to_cpu(nscap->rootid);
> +
> +	return make_kuid(task_ns, rootid);
> +}
> +
> +/*
> + * User requested a write of security.capability.
> + *
> + * If all is ok, we return 0.  If the capability needs to be converted,
> + * wvalue will be allocated (and needs to be freed) with the new value.
> + * On error, return < 0.
> + */
> +int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t size,
> +				    void **wvalue, size_t *wsize)
> +{
> +	struct vfs_ns_cap_data *nscap;
> +	uid_t nsrootid;
> +	const struct vfs_cap_data *cap = value;
> +	__u32 magic, nsmagic;
> +	struct inode *inode = d_backing_inode(dentry);
> +	struct user_namespace *task_ns = current_user_ns(),
> +		*fs_ns = inode->i_sb->s_user_ns;
> +	kuid_t rootid;
> +
> +	if (!value)
> +		return -EINVAL;
> +	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
> +		return -EINVAL;
> +	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> +		return -EPERM;
> +	if (size == XATTR_CAPS_SZ_2)
> +		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
> +			// user is privileged, just write the v2
> +			return 0;
> +
> +	rootid = rootid_from_xattr(value, size, task_ns);
> +	if (!uid_valid(rootid))
> +		return -EINVAL;
> +
> +	nsrootid = from_kuid(fs_ns, rootid);
> +	if (nsrootid == -1)
> +		return -EINVAL;
> +
> +	*wsize = sizeof(struct vfs_ns_cap_data);
> +	nscap = kmalloc(*wsize, GFP_ATOMIC);
> +	if (!nscap)
> +		return -ENOMEM;
> +	nscap->rootid = cpu_to_le32(nsrootid);
> +	nsmagic = VFS_CAP_REVISION_3;
> +	magic = le32_to_cpu(cap->magic_etc);
> +	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> +		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> +	nscap->magic_etc = cpu_to_le32(nsmagic);
> +	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> +
> +	*wvalue = nscap;
> +	return 0;
> +}
> +
>  /*
>   * Calculate the new process capability sets from the capability sets attached
>   * to a file.
> @@ -385,7 +558,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>  	__u32 magic_etc;
>  	unsigned tocopy, i;
>  	int size;
> -	struct vfs_cap_data caps;
> +	struct vfs_ns_cap_data data, *nscaps = &data;
> +	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> +	kuid_t rootkuid;
> +	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
>  
>  	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>  
> @@ -393,18 +569,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>  		return -ENODATA;
>  
>  	size = __vfs_getxattr((struct dentry *)dentry, inode,
> -			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
> +			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
>  	if (size == -ENODATA || size == -EOPNOTSUPP)
>  		/* no data, that's ok */
>  		return -ENODATA;
> +
>  	if (size < 0)
>  		return size;
>  
>  	if (size < sizeof(magic_etc))
>  		return -EINVAL;
>  
> -	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
> +	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
>  
> +	rootkuid = make_kuid(fs_ns, 0);
>  	switch (magic_etc & VFS_CAP_REVISION_MASK) {
>  	case VFS_CAP_REVISION_1:
>  		if (size != XATTR_CAPS_SZ_1)
> @@ -416,15 +594,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>  			return -EINVAL;
>  		tocopy = VFS_CAP_U32_2;
>  		break;
> +	case VFS_CAP_REVISION_3:
> +		if (size != XATTR_CAPS_SZ_3)
> +			return -EINVAL;
> +		tocopy = VFS_CAP_U32_3;
> +		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> +		break;
> +
>  	default:
>  		return -EINVAL;
>  	}
> +	/* Limit the caps to the mounter of the filesystem
> +	 * or the more limited uid specified in the xattr.
> +	 */
> +	if (!rootid_owns_currentns(rootkuid))
> +		return -ENODATA;
>  
>  	CAP_FOR_EACH_U32(i) {
>  		if (i >= tocopy)
>  			break;
> -		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
> -		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
> +		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
> +		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
>  	}
>  
>  	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> @@ -462,8 +652,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>  	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>  	if (rc < 0) {
>  		if (rc == -EINVAL)
> -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> -				__func__, rc, bprm->filename);
> +			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> +					bprm->filename);
>  		else if (rc == -ENODATA)
>  			rc = 0;
>  		goto out;
> @@ -660,15 +850,16 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> -		if (!capable(CAP_SETFCAP))
> -			return -EPERM;
> +	/* Ignore non-security xattrs */
> +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> +		return 0;
> +
> +	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>  		return 0;
> -	}
>  
> -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> -	    !capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> @@ -686,15 +877,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> -		if (!capable(CAP_SETFCAP))
> +	/* Ignore non-security xattrs */
> +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> +		return 0;
> +
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
> +		/* security.capability gets namespaced */
> +		struct inode *inode = d_backing_inode(dentry);
> +		if (!inode)
> +			return -EINVAL;
> +		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
>  			return -EPERM;
>  		return 0;
>  	}
>  
> -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> -	    !capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> @@ -1082,6 +1280,7 @@ struct security_hook_list capability_hooks[] = {
>  	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
>  	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
>  	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
> +	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
>  	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
>  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
>  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-21 21:37 ` Eric W. Biederman
@ 2017-04-21 21:50   ` Serge E. Hallyn
  2017-04-21 21:53     ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2017-04-21 21:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Seth Forshee, lkml, linux-api,
	linux-security-module, Kees Cook, Andreas Gruenbacher,
	Andy Lutomirski, Andrew G. Morgan

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> Serge,
> 
> Is there any change of a Signed-off-by on this patch?  Otherwise I don't
> think we can merge it.

For pete's sake!  I'm sorry, i seem to remember with just about every
other project other than this. particular. patch.

Does this

Signed-off-by: Serge Hallyn <serge@hallyn.com>

suffice, or should I resend?


> Eric
> 
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Root in a non-initial user ns cannot be trusted to write a traditional
> > security.capability xattr.  If it were allowed to do so, then any
> > unprivileged user on the host could map his own uid to root in a private
> > namespace, write the xattr, and execute the file with privilege on the
> > host.
> >
> > However supporting file capabilities in a user namespace is very
> > desirable.  Not doing so means that any programs designed to run with
> > limited privilege must continue to support other methods of gaining and
> > dropping privilege.  For instance a program installer must detect
> > whether file capabilities can be assigned, and assign them if so but set
> > setuid-root otherwise.  The program in turn must know how to drop
> > partial capabilities, and do so only if setuid-root.
> >
> > This patch introduces v3 of the security.capability xattr.  It builds a
> > vfs_ns_cap_data struct by appending a uid_t rootid to struct
> > vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
> > namespace which mounted the filesystem, usually init_user_ns) of the
> > root id in whose namespaces the file capabilities may take effect.
> >
> > When a task asks to write a v2 security.capability xattr, if it is
> > privileged with respect to the userns which mounted the filesystem, then
> > nothing should change.  Otherwise, the kernel will transparently rewrite
> > the xattr as a v3 with the appropriate rootid.  Subsequently, any task
> > executing the file which has the noted kuid as its root uid, or which is
> > in a descendent user_ns of such a user_ns, will run the file with
> > capabilities.
> >
> > Similarly when asking to read file capabilities, a v3 capability will
> > be presented as v2 if it applies to the caller's namespace.
> >
> > If a task writes a v3 security.capability, then it can provide a uid for
> > the xattr so long as the uid is valid in its own user namespace, and it
> > is privileged with CAP_SETFCAP over its namespace.  The kernel will
> > translate that rootid to an absolute uid, and write that to disk.  After
> > this, a task in the writer's namespace will not be able to use those
> > capabilities (unless rootid was 0), but a task in a namespace where the
> > given uid is root will.
> >
> > Only a single security.capability xattr may exist at a time for a given
> > file.  A task may overwrite an existing xattr so long as it is
> > privileged over the inode.  Note this is a departure from previous
> > semantics, which required privilege to remove a security.capability
> > xattr.  This check can be re-added if deemed useful.
> >
> > This allows a simple setcap/setxattr to work, should allow tar to work,
> > and should allow us to support tar in one namespace and untar in another
> > while preserving the capability, without risking leaking privilege into
> > a parent namespace.
> >
> > A patch to linux-test-project adding a new set of tests for this
> > functionality is in the nsfscaps branch at github.com/hallyn/ltp
> >
> > Changelog:
> >    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
> >    Nov 07 2016: convert rootid from and to fs user_ns
> >    (From ebiederm: mar 28 2017)
> >      commoncap.c: fix typos - s/v4/v3
> >      get_vfs_caps_from_disk: clarify the fs_ns root access check
> >      nsfscaps: change the code split for cap_inode_setxattr()
> >    Apr 09 2017:
> >        don't return v3 cap for caps owned by current root.
> >       return a v2 cap for a true v2 cap in non-init ns
> >    Apr 18 2017:
> >       . Change the flow of fscap writing to support s_user_ns writing.
> >       . Remove refuse_fcap_overwrite().  The value of the previous
> >         xattr doesn't matter.
> > ---
> >  fs/xattr.c                      |  30 ++++-
> >  include/linux/capability.h      |   5 +-
> >  include/linux/security.h        |   2 +
> >  include/uapi/linux/capability.h |  22 +++-
> >  security/commoncap.c            | 237 ++++++++++++++++++++++++++++++++++++----
> >  5 files changed, 268 insertions(+), 28 deletions(-)
> >
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 7e3317c..75cc65a 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  		const void *value, size_t size, int flags)
> >  {
> >  	struct inode *inode = dentry->d_inode;
> > -	int error = -EAGAIN;
> > +	int error;
> > +	void *wvalue = NULL;
> > +	size_t wsize = 0;
> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >  				   XATTR_SECURITY_PREFIX_LEN);
> >  
> > -	if (issec)
> > +	if (issec) {
> >  		inode->i_flags &= ~S_NOSEC;
> > +
> > +		if (!strcmp(name, "security.capability")) {
> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
> > +					&wvalue, &wsize);
> > +			if (error < 0)
> > +				return error;
> > +			if (wvalue) {
> > +				value = wvalue;
> > +				size = wsize;
> > +			}
> > +		}
> > +	}
> > +
> > +	error = -EAGAIN;
> > +
> >  	if (inode->i_opflags & IOP_XATTR) {
> >  		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
> >  		if (!error) {
> > @@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  						     size, flags);
> >  		}
> >  	} else {
> > -		if (unlikely(is_bad_inode(inode)))
> > -			return -EIO;
> > +		if (unlikely(is_bad_inode(inode))) {
> > +			error = -EIO;
> > +			goto out;
> > +		}
> >  	}
> >  	if (error == -EAGAIN) {
> >  		error = -EOPNOTSUPP;
> > @@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  		}
> >  	}
> >  
> > +out:
> > +	kfree(wvalue);
> >  	return error;
> >  }
> >  
> > -
> >  int
> >  vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
> >  		size_t size, int flags)
> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> > index 6ffb67e..b973433 100644
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -13,7 +13,7 @@
> >  #define _LINUX_CAPABILITY_H
> >  
> >  #include <uapi/linux/capability.h>
> > -
> > +#include <linux/uidgid.h>
> >  
> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> > @@ -248,4 +248,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
> >  /* audit system wants to get cap info from files as well */
> >  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
> >  
> > +extern int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value,
> > +		size_t size, void **wvalue, size_t *wsize);
> > +
> >  #endif /* !_LINUX_CAPABILITY_H */
> > diff --git a/include/linux/security.h b/include/linux/security.h
> > index 96899fa..bd49cc1 100644
> > --- a/include/linux/security.h
> > +++ b/include/linux/security.h
> > @@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >  extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
> >  extern int cap_inode_need_killpriv(struct dentry *dentry);
> >  extern int cap_inode_killpriv(struct dentry *dentry);
> > +extern int cap_inode_getsecurity(struct inode *inode, const char *name,
> > +				 void **buffer, bool alloc);
> >  extern int cap_mmap_addr(unsigned long addr);
> >  extern int cap_mmap_file(struct file *file, unsigned long reqprot,
> >  			 unsigned long prot, unsigned long flags);
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index 49bc062..fd4f87d 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
> >  #define VFS_CAP_U32_2           2
> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >  
> > -#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> > -#define VFS_CAP_U32             VFS_CAP_U32_2
> > -#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
> > +#define VFS_CAP_REVISION_3	0x03000000
> > +#define VFS_CAP_U32_3           2
> > +#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
> > +
> > +#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
> > +#define VFS_CAP_U32             VFS_CAP_U32_3
> > +#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
> >  
> >  struct vfs_cap_data {
> >  	__le32 magic_etc;            /* Little endian */
> > @@ -72,6 +76,18 @@ struct vfs_cap_data {
> >  	} data[VFS_CAP_U32];
> >  };
> >  
> > +/*
> > + * same as vfs_cap_data but with a rootid at the end
> > + */
> > +struct vfs_ns_cap_data {
> > +	__le32 magic_etc;
> > +	struct {
> > +		__le32 permitted;    /* Little endian */
> > +		__le32 inheritable;  /* Little endian */
> > +	} data[VFS_CAP_U32];
> > +	__le32 rootid;
> > +};
> > +
> >  #ifndef __KERNEL__
> >  
> >  /*
> > diff --git a/security/commoncap.c b/security/commoncap.c
> > index 78b3783..8abb9bf 100644
> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -332,6 +332,179 @@ int cap_inode_killpriv(struct dentry *dentry)
> >  	return error;
> >  }
> >  
> > +static bool rootid_owns_currentns(kuid_t kroot)
> > +{
> > +	struct user_namespace *ns;
> > +
> > +	if (!uid_valid(kroot))
> > +		return false;
> > +
> > +	for (ns = current_user_ns(); ; ns = ns->parent) {
> > +		if (from_kuid(ns, kroot) == 0)
> > +			return true;
> > +		if (ns == &init_user_ns)
> > +			break;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * getsecurity: We are called for security.* before any attempt to read the
> > + * xattr from the inode itself.
> > + *
> > + * This gives us a chance to read the on-disk value and convert it.  If we
> > + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> > + *
> > + * Note we are not called by vfs_getxattr_alloc(), but that is only called
> > + * by the integrity subsystem, which really wants the unconverted values -
> > + * so that's good.
> > + */
> > +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> > +			  bool alloc)
> > +{
> > +	int size, ret;
> > +	kuid_t kroot;
> > +	uid_t root, mappedroot;
> > +	char *tmpbuf = NULL;
> > +	struct vfs_ns_cap_data *nscap;
> > +	struct dentry *dentry;
> > +	struct user_namespace *fs_ns;
> > +
> > +	if (strcmp(name, "capability") != 0)
> > +		return -EOPNOTSUPP;
> > +
> > +	dentry = d_find_alias(inode);
> > +	if (!dentry)
> > +		return -EINVAL;
> > +
> > +	size = sizeof(struct vfs_ns_cap_data);
> > +	ret = vfs_getxattr_alloc(dentry, "security.capability",
> > +				 &tmpbuf, size, GFP_NOFS);
> > +
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	fs_ns = inode->i_sb->s_user_ns;
> > +	if (ret == sizeof(struct vfs_cap_data)) {
> > +		/* If this is sizeof(vfs_cap_data) then we're ok with the
> > +		 * on-disk value, so return that.  */
> > +		if (alloc)
> > +			*buffer = tmpbuf;
> > +		else
> > +			kfree(tmpbuf);
> > +		return ret;
> > +	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
> > +		kfree(tmpbuf);
> > +		return -EINVAL;
> > +	}
> > +
> > +	nscap = (struct vfs_ns_cap_data *) tmpbuf;
> > +	root = le32_to_cpu(nscap->rootid);
> > +	kroot = make_kuid(fs_ns, root);
> > +
> > +	/* If the root kuid maps to a valid uid in current ns, then return
> > +	 * this as a nscap. */
> > +	mappedroot = from_kuid(current_user_ns(), kroot);
> > +	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> > +		if (alloc) {
> > +			*buffer = tmpbuf;
> > +			nscap->rootid = cpu_to_le32(mappedroot);
> > +		} else
> > +			kfree(tmpbuf);
> > +		return size;
> > +	}
> > +
> > +	if (!rootid_owns_currentns(kroot)) {
> > +		kfree(tmpbuf);
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	/* This comes from a parent namespace.  Return as a v2 capability */
> > +	size = sizeof(struct vfs_cap_data);
> > +	if (alloc) {
> > +		*buffer = kmalloc(size, GFP_ATOMIC);
> > +		if (*buffer) {
> > +			struct vfs_cap_data *cap = *buffer;
> > +			__le32 nsmagic, magic;
> > +			magic = VFS_CAP_REVISION_2;
> > +			nsmagic = le32_to_cpu(nscap->magic_etc);
> > +			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> > +				magic |= VFS_CAP_FLAGS_EFFECTIVE;
> > +			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> > +			cap->magic_etc = cpu_to_le32(magic);
> > +		}
> > +	}
> > +	kfree(tmpbuf);
> > +	return size;
> > +}
> > +
> > +static kuid_t rootid_from_xattr(const void *value, size_t size,
> > +				struct user_namespace *task_ns)
> > +{
> > +	const struct vfs_ns_cap_data *nscap = value;
> > +	uid_t rootid = 0;
> > +
> > +	if (size == XATTR_CAPS_SZ_3)
> > +		rootid = le32_to_cpu(nscap->rootid);
> > +
> > +	return make_kuid(task_ns, rootid);
> > +}
> > +
> > +/*
> > + * User requested a write of security.capability.
> > + *
> > + * If all is ok, we return 0.  If the capability needs to be converted,
> > + * wvalue will be allocated (and needs to be freed) with the new value.
> > + * On error, return < 0.
> > + */
> > +int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t size,
> > +				    void **wvalue, size_t *wsize)
> > +{
> > +	struct vfs_ns_cap_data *nscap;
> > +	uid_t nsrootid;
> > +	const struct vfs_cap_data *cap = value;
> > +	__u32 magic, nsmagic;
> > +	struct inode *inode = d_backing_inode(dentry);
> > +	struct user_namespace *task_ns = current_user_ns(),
> > +		*fs_ns = inode->i_sb->s_user_ns;
> > +	kuid_t rootid;
> > +
> > +	if (!value)
> > +		return -EINVAL;
> > +	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
> > +		return -EINVAL;
> > +	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> > +		return -EPERM;
> > +	if (size == XATTR_CAPS_SZ_2)
> > +		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
> > +			// user is privileged, just write the v2
> > +			return 0;
> > +
> > +	rootid = rootid_from_xattr(value, size, task_ns);
> > +	if (!uid_valid(rootid))
> > +		return -EINVAL;
> > +
> > +	nsrootid = from_kuid(fs_ns, rootid);
> > +	if (nsrootid == -1)
> > +		return -EINVAL;
> > +
> > +	*wsize = sizeof(struct vfs_ns_cap_data);
> > +	nscap = kmalloc(*wsize, GFP_ATOMIC);
> > +	if (!nscap)
> > +		return -ENOMEM;
> > +	nscap->rootid = cpu_to_le32(nsrootid);
> > +	nsmagic = VFS_CAP_REVISION_3;
> > +	magic = le32_to_cpu(cap->magic_etc);
> > +	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> > +		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> > +	nscap->magic_etc = cpu_to_le32(nsmagic);
> > +	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> > +
> > +	*wvalue = nscap;
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Calculate the new process capability sets from the capability sets attached
> >   * to a file.
> > @@ -385,7 +558,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  	__u32 magic_etc;
> >  	unsigned tocopy, i;
> >  	int size;
> > -	struct vfs_cap_data caps;
> > +	struct vfs_ns_cap_data data, *nscaps = &data;
> > +	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> > +	kuid_t rootkuid;
> > +	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
> >  
> >  	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >  
> > @@ -393,18 +569,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  		return -ENODATA;
> >  
> >  	size = __vfs_getxattr((struct dentry *)dentry, inode,
> > -			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
> > +			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
> >  	if (size == -ENODATA || size == -EOPNOTSUPP)
> >  		/* no data, that's ok */
> >  		return -ENODATA;
> > +
> >  	if (size < 0)
> >  		return size;
> >  
> >  	if (size < sizeof(magic_etc))
> >  		return -EINVAL;
> >  
> > -	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
> > +	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
> >  
> > +	rootkuid = make_kuid(fs_ns, 0);
> >  	switch (magic_etc & VFS_CAP_REVISION_MASK) {
> >  	case VFS_CAP_REVISION_1:
> >  		if (size != XATTR_CAPS_SZ_1)
> > @@ -416,15 +594,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  			return -EINVAL;
> >  		tocopy = VFS_CAP_U32_2;
> >  		break;
> > +	case VFS_CAP_REVISION_3:
> > +		if (size != XATTR_CAPS_SZ_3)
> > +			return -EINVAL;
> > +		tocopy = VFS_CAP_U32_3;
> > +		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> > +		break;
> > +
> >  	default:
> >  		return -EINVAL;
> >  	}
> > +	/* Limit the caps to the mounter of the filesystem
> > +	 * or the more limited uid specified in the xattr.
> > +	 */
> > +	if (!rootid_owns_currentns(rootkuid))
> > +		return -ENODATA;
> >  
> >  	CAP_FOR_EACH_U32(i) {
> >  		if (i >= tocopy)
> >  			break;
> > -		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
> > -		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
> > +		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
> > +		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
> >  	}
> >  
> >  	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> > @@ -462,8 +652,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >  	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >  	if (rc < 0) {
> >  		if (rc == -EINVAL)
> > -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> > -				__func__, rc, bprm->filename);
> > +			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> > +					bprm->filename);
> >  		else if (rc == -ENODATA)
> >  			rc = 0;
> >  		goto out;
> > @@ -660,15 +850,16 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
> >  int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >  		       const void *value, size_t size, int flags)
> >  {
> > -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> > -		if (!capable(CAP_SETFCAP))
> > -			return -EPERM;
> > +	/* Ignore non-security xattrs */
> > +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> > +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> > +		return 0;
> > +
> > +	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> >  		return 0;
> > -	}
> >  
> > -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> > -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> > -	    !capable(CAP_SYS_ADMIN))
> > +	if (!capable(CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  	return 0;
> >  }
> > @@ -686,15 +877,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >   */
> >  int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >  {
> > -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> > -		if (!capable(CAP_SETFCAP))
> > +	/* Ignore non-security xattrs */
> > +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> > +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> > +		return 0;
> > +
> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
> > +		/* security.capability gets namespaced */
> > +		struct inode *inode = d_backing_inode(dentry);
> > +		if (!inode)
> > +			return -EINVAL;
> > +		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> >  			return -EPERM;
> >  		return 0;
> >  	}
> >  
> > -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> > -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> > -	    !capable(CAP_SYS_ADMIN))
> > +	if (!capable(CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  	return 0;
> >  }
> > @@ -1082,6 +1280,7 @@ struct security_hook_list capability_hooks[] = {
> >  	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
> >  	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
> >  	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
> > +	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
> >  	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
> >  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
> >  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-21 21:50   ` Serge E. Hallyn
@ 2017-04-21 21:53     ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-21 21:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> Serge,
>> 
>> Is there any change of a Signed-off-by on this patch?  Otherwise I don't
>> think we can merge it.
>
> For pete's sake!  I'm sorry, i seem to remember with just about every
> other project other than this. particular. patch.
>
> Does this
>
> Signed-off-by: Serge Hallyn <serge@hallyn.com>
>
> suffice, or should I resend?

Good enough for me.  I figured it was an oversight but I had to check.

Eric

>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > Root in a non-initial user ns cannot be trusted to write a traditional
>> > security.capability xattr.  If it were allowed to do so, then any
>> > unprivileged user on the host could map his own uid to root in a private
>> > namespace, write the xattr, and execute the file with privilege on the
>> > host.
>> >
>> > However supporting file capabilities in a user namespace is very
>> > desirable.  Not doing so means that any programs designed to run with
>> > limited privilege must continue to support other methods of gaining and
>> > dropping privilege.  For instance a program installer must detect
>> > whether file capabilities can be assigned, and assign them if so but set
>> > setuid-root otherwise.  The program in turn must know how to drop
>> > partial capabilities, and do so only if setuid-root.
>> >
>> > This patch introduces v3 of the security.capability xattr.  It builds a
>> > vfs_ns_cap_data struct by appending a uid_t rootid to struct
>> > vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
>> > namespace which mounted the filesystem, usually init_user_ns) of the
>> > root id in whose namespaces the file capabilities may take effect.
>> >
>> > When a task asks to write a v2 security.capability xattr, if it is
>> > privileged with respect to the userns which mounted the filesystem, then
>> > nothing should change.  Otherwise, the kernel will transparently rewrite
>> > the xattr as a v3 with the appropriate rootid.  Subsequently, any task
>> > executing the file which has the noted kuid as its root uid, or which is
>> > in a descendent user_ns of such a user_ns, will run the file with
>> > capabilities.
>> >
>> > Similarly when asking to read file capabilities, a v3 capability will
>> > be presented as v2 if it applies to the caller's namespace.
>> >
>> > If a task writes a v3 security.capability, then it can provide a uid for
>> > the xattr so long as the uid is valid in its own user namespace, and it
>> > is privileged with CAP_SETFCAP over its namespace.  The kernel will
>> > translate that rootid to an absolute uid, and write that to disk.  After
>> > this, a task in the writer's namespace will not be able to use those
>> > capabilities (unless rootid was 0), but a task in a namespace where the
>> > given uid is root will.
>> >
>> > Only a single security.capability xattr may exist at a time for a given
>> > file.  A task may overwrite an existing xattr so long as it is
>> > privileged over the inode.  Note this is a departure from previous
>> > semantics, which required privilege to remove a security.capability
>> > xattr.  This check can be re-added if deemed useful.
>> >
>> > This allows a simple setcap/setxattr to work, should allow tar to work,
>> > and should allow us to support tar in one namespace and untar in another
>> > while preserving the capability, without risking leaking privilege into
>> > a parent namespace.
>> >
>> > A patch to linux-test-project adding a new set of tests for this
>> > functionality is in the nsfscaps branch at github.com/hallyn/ltp
>> >
>> > Changelog:
>> >    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
>> >    Nov 07 2016: convert rootid from and to fs user_ns
>> >    (From ebiederm: mar 28 2017)
>> >      commoncap.c: fix typos - s/v4/v3
>> >      get_vfs_caps_from_disk: clarify the fs_ns root access check
>> >      nsfscaps: change the code split for cap_inode_setxattr()
>> >    Apr 09 2017:
>> >        don't return v3 cap for caps owned by current root.
>> >       return a v2 cap for a true v2 cap in non-init ns
>> >    Apr 18 2017:
>> >       . Change the flow of fscap writing to support s_user_ns writing.
>> >       . Remove refuse_fcap_overwrite().  The value of the previous
>> >         xattr doesn't matter.
>> > ---
>> >  fs/xattr.c                      |  30 ++++-
>> >  include/linux/capability.h      |   5 +-
>> >  include/linux/security.h        |   2 +
>> >  include/uapi/linux/capability.h |  22 +++-
>> >  security/commoncap.c            | 237 ++++++++++++++++++++++++++++++++++++----
>> >  5 files changed, 268 insertions(+), 28 deletions(-)
>> >
>> > diff --git a/fs/xattr.c b/fs/xattr.c
>> > index 7e3317c..75cc65a 100644
>> > --- a/fs/xattr.c
>> > +++ b/fs/xattr.c
>> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >  		const void *value, size_t size, int flags)
>> >  {
>> >  	struct inode *inode = dentry->d_inode;
>> > -	int error = -EAGAIN;
>> > +	int error;
>> > +	void *wvalue = NULL;
>> > +	size_t wsize = 0;
>> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> >  				   XATTR_SECURITY_PREFIX_LEN);
>> >  
>> > -	if (issec)
>> > +	if (issec) {
>> >  		inode->i_flags &= ~S_NOSEC;
>> > +
>> > +		if (!strcmp(name, "security.capability")) {
>> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
>> > +					&wvalue, &wsize);
>> > +			if (error < 0)
>> > +				return error;
>> > +			if (wvalue) {
>> > +				value = wvalue;
>> > +				size = wsize;
>> > +			}
>> > +		}
>> > +	}
>> > +
>> > +	error = -EAGAIN;
>> > +
>> >  	if (inode->i_opflags & IOP_XATTR) {
>> >  		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
>> >  		if (!error) {
>> > @@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >  						     size, flags);
>> >  		}
>> >  	} else {
>> > -		if (unlikely(is_bad_inode(inode)))
>> > -			return -EIO;
>> > +		if (unlikely(is_bad_inode(inode))) {
>> > +			error = -EIO;
>> > +			goto out;
>> > +		}
>> >  	}
>> >  	if (error == -EAGAIN) {
>> >  		error = -EOPNOTSUPP;
>> > @@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >  		}
>> >  	}
>> >  
>> > +out:
>> > +	kfree(wvalue);
>> >  	return error;
>> >  }
>> >  
>> > -
>> >  int
>> >  vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
>> >  		size_t size, int flags)
>> > diff --git a/include/linux/capability.h b/include/linux/capability.h
>> > index 6ffb67e..b973433 100644
>> > --- a/include/linux/capability.h
>> > +++ b/include/linux/capability.h
>> > @@ -13,7 +13,7 @@
>> >  #define _LINUX_CAPABILITY_H
>> >  
>> >  #include <uapi/linux/capability.h>
>> > -
>> > +#include <linux/uidgid.h>
>> >  
>> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
>> > @@ -248,4 +248,7 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
>> >  /* audit system wants to get cap info from files as well */
>> >  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>> >  
>> > +extern int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value,
>> > +		size_t size, void **wvalue, size_t *wsize);
>> > +
>> >  #endif /* !_LINUX_CAPABILITY_H */
>> > diff --git a/include/linux/security.h b/include/linux/security.h
>> > index 96899fa..bd49cc1 100644
>> > --- a/include/linux/security.h
>> > +++ b/include/linux/security.h
>> > @@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
>> >  extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
>> >  extern int cap_inode_need_killpriv(struct dentry *dentry);
>> >  extern int cap_inode_killpriv(struct dentry *dentry);
>> > +extern int cap_inode_getsecurity(struct inode *inode, const char *name,
>> > +				 void **buffer, bool alloc);
>> >  extern int cap_mmap_addr(unsigned long addr);
>> >  extern int cap_mmap_file(struct file *file, unsigned long reqprot,
>> >  			 unsigned long prot, unsigned long flags);
>> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
>> > index 49bc062..fd4f87d 100644
>> > --- a/include/uapi/linux/capability.h
>> > +++ b/include/uapi/linux/capability.h
>> > @@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
>> >  #define VFS_CAP_U32_2           2
>> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>> >  
>> > -#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
>> > -#define VFS_CAP_U32             VFS_CAP_U32_2
>> > -#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
>> > +#define VFS_CAP_REVISION_3	0x03000000
>> > +#define VFS_CAP_U32_3           2
>> > +#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
>> > +
>> > +#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
>> > +#define VFS_CAP_U32             VFS_CAP_U32_3
>> > +#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
>> >  
>> >  struct vfs_cap_data {
>> >  	__le32 magic_etc;            /* Little endian */
>> > @@ -72,6 +76,18 @@ struct vfs_cap_data {
>> >  	} data[VFS_CAP_U32];
>> >  };
>> >  
>> > +/*
>> > + * same as vfs_cap_data but with a rootid at the end
>> > + */
>> > +struct vfs_ns_cap_data {
>> > +	__le32 magic_etc;
>> > +	struct {
>> > +		__le32 permitted;    /* Little endian */
>> > +		__le32 inheritable;  /* Little endian */
>> > +	} data[VFS_CAP_U32];
>> > +	__le32 rootid;
>> > +};
>> > +
>> >  #ifndef __KERNEL__
>> >  
>> >  /*
>> > diff --git a/security/commoncap.c b/security/commoncap.c
>> > index 78b3783..8abb9bf 100644
>> > --- a/security/commoncap.c
>> > +++ b/security/commoncap.c
>> > @@ -332,6 +332,179 @@ int cap_inode_killpriv(struct dentry *dentry)
>> >  	return error;
>> >  }
>> >  
>> > +static bool rootid_owns_currentns(kuid_t kroot)
>> > +{
>> > +	struct user_namespace *ns;
>> > +
>> > +	if (!uid_valid(kroot))
>> > +		return false;
>> > +
>> > +	for (ns = current_user_ns(); ; ns = ns->parent) {
>> > +		if (from_kuid(ns, kroot) == 0)
>> > +			return true;
>> > +		if (ns == &init_user_ns)
>> > +			break;
>> > +	}
>> > +
>> > +	return false;
>> > +}
>> > +
>> > +/*
>> > + * getsecurity: We are called for security.* before any attempt to read the
>> > + * xattr from the inode itself.
>> > + *
>> > + * This gives us a chance to read the on-disk value and convert it.  If we
>> > + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
>> > + *
>> > + * Note we are not called by vfs_getxattr_alloc(), but that is only called
>> > + * by the integrity subsystem, which really wants the unconverted values -
>> > + * so that's good.
>> > + */
>> > +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>> > +			  bool alloc)
>> > +{
>> > +	int size, ret;
>> > +	kuid_t kroot;
>> > +	uid_t root, mappedroot;
>> > +	char *tmpbuf = NULL;
>> > +	struct vfs_ns_cap_data *nscap;
>> > +	struct dentry *dentry;
>> > +	struct user_namespace *fs_ns;
>> > +
>> > +	if (strcmp(name, "capability") != 0)
>> > +		return -EOPNOTSUPP;
>> > +
>> > +	dentry = d_find_alias(inode);
>> > +	if (!dentry)
>> > +		return -EINVAL;
>> > +
>> > +	size = sizeof(struct vfs_ns_cap_data);
>> > +	ret = vfs_getxattr_alloc(dentry, "security.capability",
>> > +				 &tmpbuf, size, GFP_NOFS);
>> > +
>> > +	if (ret < 0)
>> > +		return ret;
>> > +
>> > +	fs_ns = inode->i_sb->s_user_ns;
>> > +	if (ret == sizeof(struct vfs_cap_data)) {
>> > +		/* If this is sizeof(vfs_cap_data) then we're ok with the
>> > +		 * on-disk value, so return that.  */
>> > +		if (alloc)
>> > +			*buffer = tmpbuf;
>> > +		else
>> > +			kfree(tmpbuf);
>> > +		return ret;
>> > +	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
>> > +		kfree(tmpbuf);
>> > +		return -EINVAL;
>> > +	}
>> > +
>> > +	nscap = (struct vfs_ns_cap_data *) tmpbuf;
>> > +	root = le32_to_cpu(nscap->rootid);
>> > +	kroot = make_kuid(fs_ns, root);
>> > +
>> > +	/* If the root kuid maps to a valid uid in current ns, then return
>> > +	 * this as a nscap. */
>> > +	mappedroot = from_kuid(current_user_ns(), kroot);
>> > +	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
>> > +		if (alloc) {
>> > +			*buffer = tmpbuf;
>> > +			nscap->rootid = cpu_to_le32(mappedroot);
>> > +		} else
>> > +			kfree(tmpbuf);
>> > +		return size;
>> > +	}
>> > +
>> > +	if (!rootid_owns_currentns(kroot)) {
>> > +		kfree(tmpbuf);
>> > +		return -EOPNOTSUPP;
>> > +	}
>> > +
>> > +	/* This comes from a parent namespace.  Return as a v2 capability */
>> > +	size = sizeof(struct vfs_cap_data);
>> > +	if (alloc) {
>> > +		*buffer = kmalloc(size, GFP_ATOMIC);
>> > +		if (*buffer) {
>> > +			struct vfs_cap_data *cap = *buffer;
>> > +			__le32 nsmagic, magic;
>> > +			magic = VFS_CAP_REVISION_2;
>> > +			nsmagic = le32_to_cpu(nscap->magic_etc);
>> > +			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
>> > +				magic |= VFS_CAP_FLAGS_EFFECTIVE;
>> > +			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
>> > +			cap->magic_etc = cpu_to_le32(magic);
>> > +		}
>> > +	}
>> > +	kfree(tmpbuf);
>> > +	return size;
>> > +}
>> > +
>> > +static kuid_t rootid_from_xattr(const void *value, size_t size,
>> > +				struct user_namespace *task_ns)
>> > +{
>> > +	const struct vfs_ns_cap_data *nscap = value;
>> > +	uid_t rootid = 0;
>> > +
>> > +	if (size == XATTR_CAPS_SZ_3)
>> > +		rootid = le32_to_cpu(nscap->rootid);
>> > +
>> > +	return make_kuid(task_ns, rootid);
>> > +}
>> > +
>> > +/*
>> > + * User requested a write of security.capability.
>> > + *
>> > + * If all is ok, we return 0.  If the capability needs to be converted,
>> > + * wvalue will be allocated (and needs to be freed) with the new value.
>> > + * On error, return < 0.
>> > + */
>> > +int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t size,
>> > +				    void **wvalue, size_t *wsize)
>> > +{
>> > +	struct vfs_ns_cap_data *nscap;
>> > +	uid_t nsrootid;
>> > +	const struct vfs_cap_data *cap = value;
>> > +	__u32 magic, nsmagic;
>> > +	struct inode *inode = d_backing_inode(dentry);
>> > +	struct user_namespace *task_ns = current_user_ns(),
>> > +		*fs_ns = inode->i_sb->s_user_ns;
>> > +	kuid_t rootid;
>> > +
>> > +	if (!value)
>> > +		return -EINVAL;
>> > +	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
>> > +		return -EINVAL;
>> > +	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
>> > +		return -EPERM;
>> > +	if (size == XATTR_CAPS_SZ_2)
>> > +		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
>> > +			// user is privileged, just write the v2
>> > +			return 0;
>> > +
>> > +	rootid = rootid_from_xattr(value, size, task_ns);
>> > +	if (!uid_valid(rootid))
>> > +		return -EINVAL;
>> > +
>> > +	nsrootid = from_kuid(fs_ns, rootid);
>> > +	if (nsrootid == -1)
>> > +		return -EINVAL;
>> > +
>> > +	*wsize = sizeof(struct vfs_ns_cap_data);
>> > +	nscap = kmalloc(*wsize, GFP_ATOMIC);
>> > +	if (!nscap)
>> > +		return -ENOMEM;
>> > +	nscap->rootid = cpu_to_le32(nsrootid);
>> > +	nsmagic = VFS_CAP_REVISION_3;
>> > +	magic = le32_to_cpu(cap->magic_etc);
>> > +	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
>> > +		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
>> > +	nscap->magic_etc = cpu_to_le32(nsmagic);
>> > +	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
>> > +
>> > +	*wvalue = nscap;
>> > +	return 0;
>> > +}
>> > +
>> >  /*
>> >   * Calculate the new process capability sets from the capability sets attached
>> >   * to a file.
>> > @@ -385,7 +558,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>> >  	__u32 magic_etc;
>> >  	unsigned tocopy, i;
>> >  	int size;
>> > -	struct vfs_cap_data caps;
>> > +	struct vfs_ns_cap_data data, *nscaps = &data;
>> > +	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
>> > +	kuid_t rootkuid;
>> > +	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
>> >  
>> >  	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>> >  
>> > @@ -393,18 +569,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>> >  		return -ENODATA;
>> >  
>> >  	size = __vfs_getxattr((struct dentry *)dentry, inode,
>> > -			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
>> > +			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
>> >  	if (size == -ENODATA || size == -EOPNOTSUPP)
>> >  		/* no data, that's ok */
>> >  		return -ENODATA;
>> > +
>> >  	if (size < 0)
>> >  		return size;
>> >  
>> >  	if (size < sizeof(magic_etc))
>> >  		return -EINVAL;
>> >  
>> > -	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
>> > +	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
>> >  
>> > +	rootkuid = make_kuid(fs_ns, 0);
>> >  	switch (magic_etc & VFS_CAP_REVISION_MASK) {
>> >  	case VFS_CAP_REVISION_1:
>> >  		if (size != XATTR_CAPS_SZ_1)
>> > @@ -416,15 +594,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>> >  			return -EINVAL;
>> >  		tocopy = VFS_CAP_U32_2;
>> >  		break;
>> > +	case VFS_CAP_REVISION_3:
>> > +		if (size != XATTR_CAPS_SZ_3)
>> > +			return -EINVAL;
>> > +		tocopy = VFS_CAP_U32_3;
>> > +		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
>> > +		break;
>> > +
>> >  	default:
>> >  		return -EINVAL;
>> >  	}
>> > +	/* Limit the caps to the mounter of the filesystem
>> > +	 * or the more limited uid specified in the xattr.
>> > +	 */
>> > +	if (!rootid_owns_currentns(rootkuid))
>> > +		return -ENODATA;
>> >  
>> >  	CAP_FOR_EACH_U32(i) {
>> >  		if (i >= tocopy)
>> >  			break;
>> > -		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
>> > -		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
>> > +		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
>> > +		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
>> >  	}
>> >  
>> >  	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>> > @@ -462,8 +652,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>> >  	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>> >  	if (rc < 0) {
>> >  		if (rc == -EINVAL)
>> > -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
>> > -				__func__, rc, bprm->filename);
>> > +			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
>> > +					bprm->filename);
>> >  		else if (rc == -ENODATA)
>> >  			rc = 0;
>> >  		goto out;
>> > @@ -660,15 +850,16 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>> >  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>> >  		       const void *value, size_t size, int flags)
>> >  {
>> > -	if (!strcmp(name, XATTR_NAME_CAPS)) {
>> > -		if (!capable(CAP_SETFCAP))
>> > -			return -EPERM;
>> > +	/* Ignore non-security xattrs */
>> > +	if (strncmp(name, XATTR_SECURITY_PREFIX,
>> > +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
>> > +		return 0;
>> > +
>> > +	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
>> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>> >  		return 0;
>> > -	}
>> >  
>> > -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> > -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> > -	    !capable(CAP_SYS_ADMIN))
>> > +	if (!capable(CAP_SYS_ADMIN))
>> >  		return -EPERM;
>> >  	return 0;
>> >  }
>> > @@ -686,15 +877,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>> >   */
>> >  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>> >  {
>> > -	if (!strcmp(name, XATTR_NAME_CAPS)) {
>> > -		if (!capable(CAP_SETFCAP))
>> > +	/* Ignore non-security xattrs */
>> > +	if (strncmp(name, XATTR_SECURITY_PREFIX,
>> > +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
>> > +		return 0;
>> > +
>> > +	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
>> > +		/* security.capability gets namespaced */
>> > +		struct inode *inode = d_backing_inode(dentry);
>> > +		if (!inode)
>> > +			return -EINVAL;
>> > +		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
>> >  			return -EPERM;
>> >  		return 0;
>> >  	}
>> >  
>> > -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> > -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> > -	    !capable(CAP_SYS_ADMIN))
>> > +	if (!capable(CAP_SYS_ADMIN))
>> >  		return -EPERM;
>> >  	return 0;
>> >  }
>> > @@ -1082,6 +1280,7 @@ struct security_hook_list capability_hooks[] = {
>> >  	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
>> >  	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
>> >  	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
>> > +	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
>> >  	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
>> >  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
>> >  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-19 16:48 [PATCH] Introduce v3 namespaced file capabilities Serge E. Hallyn
  2017-04-21 21:37 ` Eric W. Biederman
@ 2017-04-21 23:58 ` Eric W. Biederman
  2017-04-22 15:14   ` Serge E. Hallyn
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-21 23:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan


"Serge E. Hallyn" <serge@hallyn.com> writes:

> Root in a non-initial user ns cannot be trusted to write a traditional
> security.capability xattr.  If it were allowed to do so, then any
> unprivileged user on the host could map his own uid to root in a private
> namespace, write the xattr, and execute the file with privilege on the
> host.
>
> However supporting file capabilities in a user namespace is very
> desirable.  Not doing so means that any programs designed to run with
> limited privilege must continue to support other methods of gaining and
> dropping privilege.  For instance a program installer must detect
> whether file capabilities can be assigned, and assign them if so but set
> setuid-root otherwise.  The program in turn must know how to drop
> partial capabilities, and do so only if setuid-root.
>
> This patch introduces v3 of the security.capability xattr.  It builds a
> vfs_ns_cap_data struct by appending a uid_t rootid to struct
> vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
> namespace which mounted the filesystem, usually init_user_ns) of the
> root id in whose namespaces the file capabilities may take effect.
>
> When a task asks to write a v2 security.capability xattr, if it is
> privileged with respect to the userns which mounted the filesystem, then
> nothing should change.  Otherwise, the kernel will transparently rewrite
> the xattr as a v3 with the appropriate rootid.  Subsequently, any task
> executing the file which has the noted kuid as its root uid, or which is
> in a descendent user_ns of such a user_ns, will run the file with
> capabilities.
>
> Similarly when asking to read file capabilities, a v3 capability will
> be presented as v2 if it applies to the caller's namespace.
>
> If a task writes a v3 security.capability, then it can provide a uid for
> the xattr so long as the uid is valid in its own user namespace, and it
> is privileged with CAP_SETFCAP over its namespace.  The kernel will
> translate that rootid to an absolute uid, and write that to disk.  After
> this, a task in the writer's namespace will not be able to use those
> capabilities (unless rootid was 0), but a task in a namespace where the
> given uid is root will.
>
> Only a single security.capability xattr may exist at a time for a given
> file.  A task may overwrite an existing xattr so long as it is
> privileged over the inode.  Note this is a departure from previous
> semantics, which required privilege to remove a security.capability
> xattr.  This check can be re-added if deemed useful.
>
> This allows a simple setcap/setxattr to work, should allow tar to work,
> and should allow us to support tar in one namespace and untar in another
> while preserving the capability, without risking leaking privilege into
> a parent namespace.
>
> A patch to linux-test-project adding a new set of tests for this
> functionality is in the nsfscaps branch at github.com/hallyn/ltp
>
> Changelog:
>    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
>    Nov 07 2016: convert rootid from and to fs user_ns
>    (From ebiederm: mar 28 2017)
>      commoncap.c: fix typos - s/v4/v3
>      get_vfs_caps_from_disk: clarify the fs_ns root access check
>      nsfscaps: change the code split for cap_inode_setxattr()
>    Apr 09 2017:
>        don't return v3 cap for caps owned by current root.
>       return a v2 cap for a true v2 cap in non-init ns
>    Apr 18 2017:
>       . Change the flow of fscap writing to support s_user_ns writing.
>       . Remove refuse_fcap_overwrite().  The value of the previous
>         xattr doesn't matter.

Overall this looks quite reasonable.

My only big concern was the lack of verifying of magic_etc.  As without
that the code might not be future compatible with new versions of the
capability xattrs.  It it tends to be nice to be able to boot old
kernels when regression testing etc.  Even if they can't make use of
all of the features of a new filesystem.

> diff --git a/fs/xattr.c b/fs/xattr.c
> index 7e3317c..75cc65a 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  		const void *value, size_t size, int flags)
>  {
>  	struct inode *inode = dentry->d_inode;
> -	int error = -EAGAIN;
> +	int error;
> +	void *wvalue = NULL;
> +	size_t wsize = 0;
>  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>  				   XATTR_SECURITY_PREFIX_LEN);
>  
> -	if (issec)
> +	if (issec) {
>  		inode->i_flags &= ~S_NOSEC;
> +
> +		if (!strcmp(name, "security.capability")) {
> +			error = cap_setxattr_convert_nscap(dentry, value, size,
> +					&wvalue, &wsize);
> +			if (error < 0)
> +				return error;
> +			if (wvalue) {
> +				value = wvalue;
> +				size = wsize;
> +			}
> +		}
> +	}
> +
> +	error = -EAGAIN;
> +

Why is the conversion in __vfs_setxattr_noperm and not in setattr as
was done for posix_acl_fix_xattr_from_user?


>  	if (inode->i_opflags & IOP_XATTR) {
>  		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
>  		if (!error) {
> @@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  						     size, flags);
>  		}
>  	} else {
> -		if (unlikely(is_bad_inode(inode)))
> -			return -EIO;
> +		if (unlikely(is_bad_inode(inode))) {
> +			error = -EIO;
> +			goto out;
> +		}
>  	}
>  	if (error == -EAGAIN) {
>  		error = -EOPNOTSUPP;
> @@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  		}
>  	}
>  
> +out:
> +	kfree(wvalue);
>  	return error;
>  }
>  
> -
>  int
>  vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
>  		size_t size, int flags)

The rest of my comments I am going to express as an incremental diff.
Using "security.capability" directly looks like a typo waiting to
happen.

You have an unnecessary include of linux/uidgid.h

Missing version checks on the magic_etc field.
And the wrong error code when the code deliberately refuses to return a
capability.

Eric

diff --git a/fs/xattr.c b/fs/xattr.c
index 75cc65ac7ea9..f6d5ce3e1030 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -179,7 +179,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 	if (issec) {
 		inode->i_flags &= ~S_NOSEC;
 
-		if (!strcmp(name, "security.capability")) {
+		if (!strcmp(name, XATTR_NAME_CAPS)) {
 			error = cap_setxattr_convert_nscap(dentry, value, size,
 					&wvalue, &wsize);
 			if (error < 0)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b97343353a11..c47febf8448b 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,6 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
diff --git a/security/commoncap.c b/security/commoncap.c
index 8abb9bf4ec17..32e32f437ef5 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -367,6 +367,7 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 	kuid_t kroot;
 	uid_t root, mappedroot;
 	char *tmpbuf = NULL;
+	struct vfs_cap_data *cap;
 	struct vfs_ns_cap_data *nscap;
 	struct dentry *dentry;
 	struct user_namespace *fs_ns;
@@ -379,14 +380,16 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 		return -EINVAL;
 
 	size = sizeof(struct vfs_ns_cap_data);
-	ret = vfs_getxattr_alloc(dentry, "security.capability",
+	ret = vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
 				 &tmpbuf, size, GFP_NOFS);
 
 	if (ret < 0)
 		return ret;
 
 	fs_ns = inode->i_sb->s_user_ns;
-	if (ret == sizeof(struct vfs_cap_data)) {
+	cap = (struct vfs_cap_data *) tmpbuf;
+	if ((ret == sizeof(struct vfs_cap_data)) &&
+	    (cap->magic_etc == cpu_to_le32(VFS_CAP_REVISION_2))) {
 		/* If this is sizeof(vfs_cap_data) then we're ok with the
 		 * on-disk value, so return that.  */
 		if (alloc)
@@ -394,7 +397,8 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 		else
 			kfree(tmpbuf);
 		return ret;
-	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
+	} else if ((ret != sizeof(struct vfs_ns_cap_data)) ||
+		    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_3))) {
 		kfree(tmpbuf);
 		return -EINVAL;
 	}
@@ -417,7 +421,7 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 
 	if (!rootid_owns_currentns(kroot)) {
 		kfree(tmpbuf);
-		return -EOPNOTSUPP;
+		return -ENODATA;
 	}
 
 	/* This comes from a parent namespace.  Return as a v2 capability */
@@ -474,11 +478,17 @@ int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t
 		return -EINVAL;
 	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
 		return -EINVAL;
+	if ((size == XATTR_CAPS_SZ_2) &&
+	    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_2)))
+		return -EINVAL;
+	if ((size == XATTR_CAPS_SZ_3) &&
+	    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_3)))
+		return -EINVAL;
 	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
 		return -EPERM;
 	if (size == XATTR_CAPS_SZ_2)
 		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
-			// user is privileged, just write the v2
+			/* user is privileged, just write the v2 */
 			return 0;
 
 	rootid = rootid_from_xattr(value, size, task_ns);
@@ -855,7 +865,10 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
 		return 0;
 
-	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
+	/*
+	 * For XATTR_NAME_CAPS the check will be done in
+	 * __vfs_setxattr_noperm()
+	 */
 	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-21 23:58 ` Eric W. Biederman
@ 2017-04-22 15:14   ` Serge E. Hallyn
  2017-04-23  1:14     ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2017-04-22 15:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Seth Forshee, lkml, linux-api,
	linux-security-module, Kees Cook, Andreas Gruenbacher,
	Andy Lutomirski, Andrew G. Morgan

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> Overall this looks quite reasonable.
> 
> My only big concern was the lack of verifying of magic_etc.  As without

Yes, I was relying too much on the size check.

> that the code might not be future compatible with new versions of the
> capability xattrs.  It it tends to be nice to be able to boot old
> kernels when regression testing etc.  Even if they can't make use of
> all of the features of a new filesystem.

That certainly was my intent.

> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 7e3317c..75cc65a 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  		const void *value, size_t size, int flags)
> >  {
> >  	struct inode *inode = dentry->d_inode;
> > -	int error = -EAGAIN;
> > +	int error;
> > +	void *wvalue = NULL;
> > +	size_t wsize = 0;
> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >  				   XATTR_SECURITY_PREFIX_LEN);
> >  
> > -	if (issec)
> > +	if (issec) {
> >  		inode->i_flags &= ~S_NOSEC;
> > +
> > +		if (!strcmp(name, "security.capability")) {
> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
> > +					&wvalue, &wsize);
> > +			if (error < 0)
> > +				return error;
> > +			if (wvalue) {
> > +				value = wvalue;
> > +				size = wsize;
> > +			}
> > +		}
> > +	}
> > +
> > +	error = -EAGAIN;
> > +
> 
> Why is the conversion in __vfs_setxattr_noperm and not in setattr as
> was done for posix_acl_fix_xattr_from_user?

I think I was thinking I wanted to catch all the vfs_setxattr operations,
but I don't think that's right.  Moving to setxattr seems right.  I'll
look around a bit more.

> >  	if (inode->i_opflags & IOP_XATTR) {
> >  		error = __vfs_setxattr(dentry, inode, name, value, size, flags);
> >  		if (!error) {
> > @@ -184,8 +201,10 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  						     size, flags);
> >  		}
> >  	} else {
> > -		if (unlikely(is_bad_inode(inode)))
> > -			return -EIO;
> > +		if (unlikely(is_bad_inode(inode))) {
> > +			error = -EIO;
> > +			goto out;
> > +		}
> >  	}
> >  	if (error == -EAGAIN) {
> >  		error = -EOPNOTSUPP;
> > @@ -200,10 +219,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  		}
> >  	}
> >  
> > +out:
> > +	kfree(wvalue);
> >  	return error;
> >  }
> >  
> > -
> >  int
> >  vfs_setxattr(struct dentry *dentry, const char *name, const void *value,
> >  		size_t size, int flags)
> 
> The rest of my comments I am going to express as an incremental diff.
> Using "security.capability" directly looks like a typo waiting to
> happen.
> 
> You have an unnecessary include of linux/uidgid.h
> 
> Missing version checks on the magic_etc field.
> And the wrong error code when the code deliberately refuses to return a
> capability.

Thanks, all looks good.  Did you want to just squash these yourself and
move on, keep them as separate patches, or have me incorporate into mine
and resend?

> Eric
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 75cc65ac7ea9..f6d5ce3e1030 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -179,7 +179,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  	if (issec) {
>  		inode->i_flags &= ~S_NOSEC;
>  
> -		if (!strcmp(name, "security.capability")) {
> +		if (!strcmp(name, XATTR_NAME_CAPS)) {
>  			error = cap_setxattr_convert_nscap(dentry, value, size,
>  					&wvalue, &wsize);
>  			if (error < 0)
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b97343353a11..c47febf8448b 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,7 +13,6 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include <uapi/linux/capability.h>
> -#include <linux/uidgid.h>
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 8abb9bf4ec17..32e32f437ef5 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -367,6 +367,7 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>  	kuid_t kroot;
>  	uid_t root, mappedroot;
>  	char *tmpbuf = NULL;
> +	struct vfs_cap_data *cap;
>  	struct vfs_ns_cap_data *nscap;
>  	struct dentry *dentry;
>  	struct user_namespace *fs_ns;
> @@ -379,14 +380,16 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>  		return -EINVAL;
>  
>  	size = sizeof(struct vfs_ns_cap_data);
> -	ret = vfs_getxattr_alloc(dentry, "security.capability",
> +	ret = vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
>  				 &tmpbuf, size, GFP_NOFS);
>  
>  	if (ret < 0)
>  		return ret;
>  
>  	fs_ns = inode->i_sb->s_user_ns;
> -	if (ret == sizeof(struct vfs_cap_data)) {
> +	cap = (struct vfs_cap_data *) tmpbuf;
> +	if ((ret == sizeof(struct vfs_cap_data)) &&
> +	    (cap->magic_etc == cpu_to_le32(VFS_CAP_REVISION_2))) {
>  		/* If this is sizeof(vfs_cap_data) then we're ok with the
>  		 * on-disk value, so return that.  */
>  		if (alloc)
> @@ -394,7 +397,8 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>  		else
>  			kfree(tmpbuf);
>  		return ret;
> -	} else if (ret != sizeof(struct vfs_ns_cap_data)) {
> +	} else if ((ret != sizeof(struct vfs_ns_cap_data)) ||
> +		    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_3))) {
>  		kfree(tmpbuf);
>  		return -EINVAL;
>  	}
> @@ -417,7 +421,7 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>  
>  	if (!rootid_owns_currentns(kroot)) {
>  		kfree(tmpbuf);
> -		return -EOPNOTSUPP;
> +		return -ENODATA;
>  	}
>  
>  	/* This comes from a parent namespace.  Return as a v2 capability */
> @@ -474,11 +478,17 @@ int cap_setxattr_convert_nscap(struct dentry *dentry, const void *value, size_t
>  		return -EINVAL;
>  	if (size != XATTR_CAPS_SZ_2 && size != XATTR_CAPS_SZ_3)
>  		return -EINVAL;
> +	if ((size == XATTR_CAPS_SZ_2) &&
> +	    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_2)))
> +		return -EINVAL;
> +	if ((size == XATTR_CAPS_SZ_3) &&
> +	    (cap->magic_etc != cpu_to_le32(VFS_CAP_REVISION_3)))
> +		return -EINVAL;
>  	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
>  		return -EPERM;
>  	if (size == XATTR_CAPS_SZ_2)
>  		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
> -			// user is privileged, just write the v2
> +			/* user is privileged, just write the v2 */
>  			return 0;
>  
>  	rootid = rootid_from_xattr(value, size, task_ns);
> @@ -855,7 +865,10 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
>  		return 0;
>  
> -	// For XATTR_NAME_CAPS the check will be done in __vfs_setxattr_noperm()
> +	/*
> +	 * For XATTR_NAME_CAPS the check will be done in
> +	 * __vfs_setxattr_noperm()
> +	 */
>  	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>  		return 0;
>  
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-22 15:14   ` Serge E. Hallyn
@ 2017-04-23  1:14     ` Eric W. Biederman
  2017-04-27 16:20       ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-23  1:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> Overall this looks quite reasonable.
>> 
>> My only big concern was the lack of verifying of magic_etc.  As without
>
> Yes, I was relying too much on the size check.
>
>> that the code might not be future compatible with new versions of the
>> capability xattrs.  It it tends to be nice to be able to boot old
>> kernels when regression testing etc.  Even if they can't make use of
>> all of the features of a new filesystem.
>
> That certainly was my intent.
>
>> > diff --git a/fs/xattr.c b/fs/xattr.c
>> > index 7e3317c..75cc65a 100644
>> > --- a/fs/xattr.c
>> > +++ b/fs/xattr.c
>> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >  		const void *value, size_t size, int flags)
>> >  {
>> >  	struct inode *inode = dentry->d_inode;
>> > -	int error = -EAGAIN;
>> > +	int error;
>> > +	void *wvalue = NULL;
>> > +	size_t wsize = 0;
>> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> >  				   XATTR_SECURITY_PREFIX_LEN);
>> >  
>> > -	if (issec)
>> > +	if (issec) {
>> >  		inode->i_flags &= ~S_NOSEC;
>> > +
>> > +		if (!strcmp(name, "security.capability")) {
>> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
>> > +					&wvalue, &wsize);
>> > +			if (error < 0)
>> > +				return error;
>> > +			if (wvalue) {
>> > +				value = wvalue;
>> > +				size = wsize;
>> > +			}
>> > +		}
>> > +	}
>> > +
>> > +	error = -EAGAIN;
>> > +
>> 
>> Why is the conversion in __vfs_setxattr_noperm and not in setattr as
>> was done for posix_acl_fix_xattr_from_user?
>
> I think I was thinking I wanted to catch all the vfs_setxattr operations,
> but I don't think that's right.  Moving to setxattr seems right.  I'll
> look around a bit more.

Thanks.  This is one of these little details that we want a good answer
to why there.  If you can document that in your patch description when
you resend I would appreciate it.

>> Missing version checks on the magic_etc field.
>> And the wrong error code when the code deliberately refuses to return a
>> capability.
>
> Thanks, all looks good.  Did you want to just squash these yourself and
> move on, keep them as separate patches, or have me incorporate into mine
> and resend?

Given that there is an outstanding question I would appreciate a resend
with an updated patch description, the changes squashed and possibly a
change in where cap_setxattr_convert_nscap is called.

I have the untested version on my for-testing branch and except for
a small increase in the binary size of the kernel all seems well.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-23  1:14     ` Eric W. Biederman
@ 2017-04-27 16:20       ` Eric W. Biederman
  2017-04-27 16:52         ` Serge E. Hallyn
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-27 16:20 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serge@hallyn.com> writes:
>
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> 
>>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>> 
>>> > diff --git a/fs/xattr.c b/fs/xattr.c
>>> > index 7e3317c..75cc65a 100644
>>> > --- a/fs/xattr.c
>>> > +++ b/fs/xattr.c
>>> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>>> >  		const void *value, size_t size, int flags)
>>> >  {
>>> >  	struct inode *inode = dentry->d_inode;
>>> > -	int error = -EAGAIN;
>>> > +	int error;
>>> > +	void *wvalue = NULL;
>>> > +	size_t wsize = 0;
>>> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>>> >  				   XATTR_SECURITY_PREFIX_LEN);
>>> >  
>>> > -	if (issec)
>>> > +	if (issec) {
>>> >  		inode->i_flags &= ~S_NOSEC;
>>> > +
>>> > +		if (!strcmp(name, "security.capability")) {
>>> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
>>> > +					&wvalue, &wsize);
>>> > +			if (error < 0)
>>> > +				return error;
>>> > +			if (wvalue) {
>>> > +				value = wvalue;
>>> > +				size = wsize;
>>> > +			}
>>> > +		}
>>> > +	}
>>> > +
>>> > +	error = -EAGAIN;
>>> > +
>>> 
>>> Why is the conversion in __vfs_setxattr_noperm and not in setattr as
>>> was done for posix_acl_fix_xattr_from_user?
>>
>> I think I was thinking I wanted to catch all the vfs_setxattr operations,
>> but I don't think that's right.  Moving to setxattr seems right.  I'll
>> look around a bit more.
>
> Thanks.  This is one of these little details that we want a good answer
> to why there.  If you can document that in your patch description when
> you resend I would appreciate it.

Ok. Grrr.

Looking at this a little more getting it correct where we call the
conversion operation is critical. 

I believe the current placement of cap_setxattr_convert_nscap is
actively wrong.  In particular unless I am misleading something this
will trigger multiple conversions when setting one of these attributes
on overlayfs.

The stragey I adopted for for posix acls is:

   On a write from userspace convert from current_user_ns() to &init_user_ns.
   On a write to the filesystem convert from &init_user_ns to fs_user_ns.
   
   On a read from the filesystem convert from fs_user_ns to &init_user_ns
   On a read from the kernel to userspace convert from &init_user_ns
      to current_user_ns().

Overall a good strategy but no one we can trivially adopt for the
capability xattr as the second write to filesystem method does not
appear to actually exist for anything except for posix acls.

I need to think a little more about how we want to accomplish this for
the capability xattr.  My apoligies for leading you down a path that has
all of these bumps and then being sufficiently distracted not to help
you through this maze.

The only easy solution I can see is to just always keep things in
&init_user_ns inside the kernel.   That works until we bring fuse or
other unprivileged mounts onboard that have storage outside of the
kernel.

Seth and I will have to rework that for fuse support but that sounds
better than not letting such an issue prevent us from merging the code.


Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-27 16:20       ` Eric W. Biederman
@ 2017-04-27 16:52         ` Serge E. Hallyn
  2017-04-27 17:00           ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2017-04-27 16:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Seth Forshee, lkml, linux-api,
	linux-security-module, Kees Cook, Andreas Gruenbacher,
	Andy Lutomirski, Andrew G. Morgan

Quoting Eric W. Biederman (ebiederm@xmission.com):
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> >
> >> Quoting Eric W. Biederman (ebiederm@xmission.com):
> >>> 
> >>> "Serge E. Hallyn" <serge@hallyn.com> writes:
> >>> 
> >>> > diff --git a/fs/xattr.c b/fs/xattr.c
> >>> > index 7e3317c..75cc65a 100644
> >>> > --- a/fs/xattr.c
> >>> > +++ b/fs/xattr.c
> >>> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >>> >  		const void *value, size_t size, int flags)
> >>> >  {
> >>> >  	struct inode *inode = dentry->d_inode;
> >>> > -	int error = -EAGAIN;
> >>> > +	int error;
> >>> > +	void *wvalue = NULL;
> >>> > +	size_t wsize = 0;
> >>> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >>> >  				   XATTR_SECURITY_PREFIX_LEN);
> >>> >  
> >>> > -	if (issec)
> >>> > +	if (issec) {
> >>> >  		inode->i_flags &= ~S_NOSEC;
> >>> > +
> >>> > +		if (!strcmp(name, "security.capability")) {
> >>> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
> >>> > +					&wvalue, &wsize);
> >>> > +			if (error < 0)
> >>> > +				return error;
> >>> > +			if (wvalue) {
> >>> > +				value = wvalue;
> >>> > +				size = wsize;
> >>> > +			}
> >>> > +		}
> >>> > +	}
> >>> > +
> >>> > +	error = -EAGAIN;
> >>> > +
> >>> 
> >>> Why is the conversion in __vfs_setxattr_noperm and not in setattr as
> >>> was done for posix_acl_fix_xattr_from_user?
> >>
> >> I think I was thinking I wanted to catch all the vfs_setxattr operations,
> >> but I don't think that's right.  Moving to setxattr seems right.  I'll
> >> look around a bit more.
> >
> > Thanks.  This is one of these little details that we want a good answer
> > to why there.  If you can document that in your patch description when
> > you resend I would appreciate it.
> 
> Ok. Grrr.
> 
> Looking at this a little more getting it correct where we call the
> conversion operation is critical. 
> 
> I believe the current placement of cap_setxattr_convert_nscap is
> actively wrong.  In particular unless I am misleading something this
> will trigger multiple conversions when setting one of these attributes
> on overlayfs.
> 
> The stragey I adopted for for posix acls is:
> 
>    On a write from userspace convert from current_user_ns() to &init_user_ns.
>    On a write to the filesystem convert from &init_user_ns to fs_user_ns.
>    
>    On a read from the filesystem convert from fs_user_ns to &init_user_ns
>    On a read from the kernel to userspace convert from &init_user_ns
>       to current_user_ns().
> 
> Overall a good strategy but no one we can trivially adopt for the
> capability xattr as the second write to filesystem method does not
> appear to actually exist for anything except for posix acls.
> 
> I need to think a little more about how we want to accomplish this for
> the capability xattr.  My apoligies for leading you down a path that has
> all of these bumps and then being sufficiently distracted not to help
> you through this maze.
> 
> The only easy solution I can see is to just always keep things in
> &init_user_ns inside the kernel.   That works until we bring fuse or
> other unprivileged mounts onboard that have storage outside of the
> kernel.
> 
> Seth and I will have to rework that for fuse support but that sounds
> better than not letting such an issue prevent us from merging the code.

Ok, in the meantime I've made a few updates in my tree which I think
make the code a lot nicer (and do move the conversion to setxattr()),
but there's a bug in that which I'm still trying to nail down.  I'll
send a new version when I get that figured, and we can see how close
to ok that is.

Note that upstream cap_inode_removexattr and cap_inode_setxattr()
upstream still don't respect the fs_user_ns properly either (the
proper code is in the Ubuntu kernel, maybe it's in your -next
tree, I don't know how you and Seth are coordinating that)

-serge

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Introduce v3 namespaced file capabilities
  2017-04-27 16:52         ` Serge E. Hallyn
@ 2017-04-27 17:00           ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2017-04-27 17:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Seth Forshee, lkml, linux-api, linux-security-module, Kees Cook,
	Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>> > "Serge E. Hallyn" <serge@hallyn.com> writes:
>> >
>> >> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >>> 
>> >>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> >>> 
>> >>> > diff --git a/fs/xattr.c b/fs/xattr.c
>> >>> > index 7e3317c..75cc65a 100644
>> >>> > --- a/fs/xattr.c
>> >>> > +++ b/fs/xattr.c
>> >>> > @@ -170,12 +170,29 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >>> >  		const void *value, size_t size, int flags)
>> >>> >  {
>> >>> >  	struct inode *inode = dentry->d_inode;
>> >>> > -	int error = -EAGAIN;
>> >>> > +	int error;
>> >>> > +	void *wvalue = NULL;
>> >>> > +	size_t wsize = 0;
>> >>> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> >>> >  				   XATTR_SECURITY_PREFIX_LEN);
>> >>> >  
>> >>> > -	if (issec)
>> >>> > +	if (issec) {
>> >>> >  		inode->i_flags &= ~S_NOSEC;
>> >>> > +
>> >>> > +		if (!strcmp(name, "security.capability")) {
>> >>> > +			error = cap_setxattr_convert_nscap(dentry, value, size,
>> >>> > +					&wvalue, &wsize);
>> >>> > +			if (error < 0)
>> >>> > +				return error;
>> >>> > +			if (wvalue) {
>> >>> > +				value = wvalue;
>> >>> > +				size = wsize;
>> >>> > +			}
>> >>> > +		}
>> >>> > +	}
>> >>> > +
>> >>> > +	error = -EAGAIN;
>> >>> > +
>> >>> 
>> >>> Why is the conversion in __vfs_setxattr_noperm and not in setattr as
>> >>> was done for posix_acl_fix_xattr_from_user?
>> >>
>> >> I think I was thinking I wanted to catch all the vfs_setxattr operations,
>> >> but I don't think that's right.  Moving to setxattr seems right.  I'll
>> >> look around a bit more.
>> >
>> > Thanks.  This is one of these little details that we want a good answer
>> > to why there.  If you can document that in your patch description when
>> > you resend I would appreciate it.
>> 
>> Ok. Grrr.
>> 
>> Looking at this a little more getting it correct where we call the
>> conversion operation is critical. 
>> 
>> I believe the current placement of cap_setxattr_convert_nscap is
>> actively wrong.  In particular unless I am misleading something this
>> will trigger multiple conversions when setting one of these attributes
>> on overlayfs.
>> 
>> The stragey I adopted for for posix acls is:
>> 
>>    On a write from userspace convert from current_user_ns() to &init_user_ns.
>>    On a write to the filesystem convert from &init_user_ns to fs_user_ns.
>>    
>>    On a read from the filesystem convert from fs_user_ns to &init_user_ns
>>    On a read from the kernel to userspace convert from &init_user_ns
>>       to current_user_ns().
>> 
>> Overall a good strategy but no one we can trivially adopt for the
>> capability xattr as the second write to filesystem method does not
>> appear to actually exist for anything except for posix acls.
>> 
>> I need to think a little more about how we want to accomplish this for
>> the capability xattr.  My apoligies for leading you down a path that has
>> all of these bumps and then being sufficiently distracted not to help
>> you through this maze.
>> 
>> The only easy solution I can see is to just always keep things in
>> &init_user_ns inside the kernel.   That works until we bring fuse or
>> other unprivileged mounts onboard that have storage outside of the
>> kernel.
>> 
>> Seth and I will have to rework that for fuse support but that sounds
>> better than not letting such an issue prevent us from merging the code.
>
> Ok, in the meantime I've made a few updates in my tree which I think
> make the code a lot nicer (and do move the conversion to setxattr()),
> but there's a bug in that which I'm still trying to nail down.  I'll
> send a new version when I get that figured, and we can see how close
> to ok that is.
>
> Note that upstream cap_inode_removexattr and cap_inode_setxattr()
> upstream still don't respect the fs_user_ns properly either (the
> proper code is in the Ubuntu kernel, maybe it's in your -next
> tree, I don't know how you and Seth are coordinating that)

Oh yes.  The relaxation of permissions.  I remember holding off
on that until we knew the core vfs work was done.  At this point I don't
think it is necessary to keep holding off.  It seemed prudent before we
got all of the s_user_ns bits used in all of the proper places.

At this point I think it was just worry about the last little vfs bits
has been challenging enough that we just haven't gotten too it.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-27 17:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19 16:48 [PATCH] Introduce v3 namespaced file capabilities Serge E. Hallyn
2017-04-21 21:37 ` Eric W. Biederman
2017-04-21 21:50   ` Serge E. Hallyn
2017-04-21 21:53     ` Eric W. Biederman
2017-04-21 23:58 ` Eric W. Biederman
2017-04-22 15:14   ` Serge E. Hallyn
2017-04-23  1:14     ` Eric W. Biederman
2017-04-27 16:20       ` Eric W. Biederman
2017-04-27 16:52         ` Serge E. Hallyn
2017-04-27 17:00           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).