All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] Introduce new security.nscapability xattr
@ 2015-11-30 22:43 Serge E. Hallyn
  2015-11-30 23:08   ` Eric W. Biederman
  2016-01-20 12:14 ` Jann Horn
  0 siblings, 2 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2015-11-30 22:43 UTC (permalink / raw)
  To: lkml, Andrew Morgan, Andy Lutomirski, Eric W. Biederman,
	lxc-devel, Richard Weinberger, LSM, linux-api, keescook

A common way for daemons to run with minimal privilege is to start as root,
perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
then change uid to non-root.  A simpler way to achieve this is to set file
capabilities on a not-setuid-root binary.  However, when installing a package
inside a (user-namespaced) container, packages cannot be installed with file
capabilities.  For this reason, containers must install ping setuid-root.

To achieve this, we would need for containers to be able to request file
capabilities be added to a file without causing these to be honored in the
initial user namespace.

To this end, the patch below introduces a new capability xattr format.  The
main enhancement over the existing security.capability xattr is that we
tag capability sets with a uid - the uid of the root user in the namespace
where the capabilities are set.  The capabilities will be ignored in any
other namespace.  The special case of uid == -1 (which must only ever be
able to be set by kuid 0) means use the capabilities in all namespaces.

An alternative format would use a pair of uids to indicate a range of rootids.
This would allow root in a user namespace with uids 100000-165536 mapped to
set the xattr once on a file, then launch nested containers wherein the file
could be used with privilege.  That's not what this patch does, but would be
a trivial change if people think it would be worthwhile.

This patch does not actually address the real problem, which is setting the
xattrs from inside containers.  For that, I think the best solution is to
add a pair of new system calls, setfcap and getfcap. Userspace would for
instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
the kernel would, if not in init_user_ns, react by writing an appropriate
security.nscapability xattr.

The libcap2 library's cap_set_file/cap_get_file could be switched over
transparently to use this to hide its use from all callers.

Comments appreciated.

Note - In this patch, file capabilities only work for containers which have
a root uid defined.  We may want to allow -1 uids to work in all
namespaces.  There certainly would be uses for this, but I'm a bit unsettled
about the implications of allowing a program privilege in a container where
there is no uid with privilege.  This needs more thought.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 include/linux/capability.h      |  15 +++++
 include/uapi/linux/capability.h |  47 ++++++++++++++
 include/uapi/linux/xattr.h      |   3 +
 security/commoncap.c            | 135 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index af9f0b9..24ac18e 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,6 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
+#include <linux/uidgid.h>
 
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
@@ -31,6 +32,20 @@ struct cpu_vfs_cap_data {
 	kernel_cap_t inheritable;
 };
 
+struct cpu_vfs_ns_cap_data {
+	__u32 flags;
+	kuid_t rootid;
+	kernel_cap_t permitted;
+	kernel_cap_t inheritable;
+};
+
+struct cpu_vfs_ns_cap_header {
+	__u32 hdr_info;
+	struct cpu_vfs_ns_cap_data caps[0];
+};
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_NCAPS(x) ( (x >> 8) & 0xFF )
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..2211a33 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,10 +62,14 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION     1
+
 #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
 #define VFS_CAP_U32             VFS_CAP_U32_2
 #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
 
+
 struct vfs_cap_data {
 	__le32 magic_etc;            /* Little endian */
 	struct {
@@ -74,6 +78,49 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+/*
+ * Q: do we want version in the header, or in the data?
+ * If it is in the header, then a container will need to
+ * make sure it is writing the same data.
+ *
+ * Actually, perhaps we simply do not support writing the
+ * xattr, we just use a new system call to get/set the fscap.
+ * The kernel can be in charge of watching the version numbers.
+ * After all, we can't allow the container to override the
+ * fscaps of the init ns.
+ *
+ * @flags currently only containers the effective bit.  The
+ * other bits are reserved, and must be 0 at the moment.
+ * @rootid contains the kuid value of the root in the namespace
+ * for which this capability should be used.  If -1, then this
+ * works for all namespaces.  Only root in the initial ns can
+ * use this.
+ *
+ * Q: do we want to use a range instead?  Then root in a container
+ * could allow one binary with one capability to be used by any
+ * nested containers.
+ */
+#define VFS_NS_CAP_EFFECTIVE    0x1
+struct vfs_ns_cap_data {
+	__le32 flags;
+	__le32 rootid;
+	struct {
+		__le32 permitted;    /* Little endian */
+		__le32 inheritable;  /* Little endian */
+	} data[VFS_CAP_U32];
+};
+
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: ncaps
+ * last 8: version
+ */
+struct vfs_ns_cap_header {
+	__le32 hdr_info;
+	/* ncaps * vfs_ns_cap_data */
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 1832cf7..c44edf3 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -308,6 +308,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 	if (!inode->i_op->getxattr)
 	       return 0;
 
+	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+	if (error > 0)
+		return 1;
+
 	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
 	if (error <= 0)
 		return 0;
@@ -325,11 +329,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	int ret1, ret2;;
 
 	if (!inode->i_op->removexattr)
 	       return 0;
 
-	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+	if (ret1 != 0)
+		return ret1;
+	return ret2;
 }
 
 /*
@@ -433,6 +443,117 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	unsigned tocopy, i;
+	int ret = 0, size, expected;
+	unsigned len = 0;
+	struct vfs_ns_cap_header *hdr;
+	struct vfs_ns_cap_data *cap, *nscap = NULL;
+	__u16 ncaps, version;
+	__u32 hdr_info;
+	kuid_t current_root, caprootuid;
+
+	memset(cpu_caps, 0, sizeof(*cpu_caps));
+
+	if (!inode || !inode->i_op->getxattr)
+		return -ENODATA;
+
+	/* get the size */
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+			NULL, 0);
+	if (size == -ENODATA || size == -EOPNOTSUPP)
+		/* no data, that's ok */
+		return -ENODATA;
+	if (size < 0)
+		return size;
+	if (size < sizeof(struct cpu_vfs_ns_cap_header))
+		return -EINVAL;
+	if (size > sizeof(struct cpu_vfs_ns_cap_header) + 255 * sizeof(struct vfs_ns_cap_data))
+		return -EINVAL;
+	len = size;
+
+	hdr = kmalloc(len + 1, GFP_NOFS);
+	if (!hdr)
+		return -ENOMEM;
+
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS, hdr,
+				   len);
+	if (size < 0) {
+		ret = size;
+		goto out;
+	}
+
+	if (size != len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	hdr_info = le32_to_cpu(hdr->hdr_info);
+	version = NS_CAPS_VERSION(hdr_info);
+	ncaps = NS_CAPS_NCAPS(hdr_info);
+
+	if (version != VFS_NS_CAP_REVISION) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	expected = sizeof(*hdr) + ncaps * sizeof(*cap);
+	if (size != expected) {
+		ret = -EINVAL;
+		goto out;
+	}
+	tocopy = VFS_CAP_U32;
+
+	/* find an applicable entry */
+	/* a global entry (uid == -1) takes precedence */
+	current_root = make_kuid(current_user_ns(), 0);
+	if (!uid_valid(current_root)) {
+		/* no root user in this namespace;  no capabilities */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (i = 0, cap = (void *) hdr + sizeof(*hdr); i < ncaps; cap += sizeof(*cap), i++) {
+		uid_t uid = le32_to_cpu(cap->rootid);
+		if (uid == -1) {
+			nscap = cap;
+			break;
+		}
+
+		caprootuid = make_kuid(&init_user_ns, uid);
+		if (uid_eq(caprootuid, current_root))
+			nscap = cap;
+	}
+
+	if (!nscap) {
+		/* nothing found for this namespace */
+		ret = -ENODATA;
+		goto out;
+	}
+
+	/* copy the entry */
+	CAP_FOR_EACH_U32(i) {
+		if (i >= tocopy)
+			break;
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap->data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap->data[i].inheritable);
+	}
+
+	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+	if (nscap->flags & VFS_NS_CAP_EFFECTIVE)
+		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+
+out:
+	kfree(hdr);
+
+	return ret;
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -451,11 +572,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
 		return 0;
 
-	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	if (rc == -ENODATA)
+		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Got EINVAL reading file caps for %s\n",
+				bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -651,7 +774,7 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
+	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
 		if (!capable(CAP_SETFCAP))
 			return -EPERM;
 		return 0;
@@ -677,7 +800,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
+	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
 		if (!capable(CAP_SETFCAP))
 			return -EPERM;
 		return 0;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2015-11-30 23:08   ` Eric W. Biederman
  0 siblings, 0 replies; 19+ messages in thread
From: Eric W. Biederman @ 2015-11-30 23:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: lkml, Andrew Morgan, Andy Lutomirski, lxc-devel,
	Richard Weinberger, LSM, linux-api, keescook

"Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:

> A common way for daemons to run with minimal privilege is to start as root,
> perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> then change uid to non-root.  A simpler way to achieve this is to set file
> capabilities on a not-setuid-root binary.  However, when installing a package
> inside a (user-namespaced) container, packages cannot be installed with file
> capabilities.  For this reason, containers must install ping setuid-root.

Don't ping sockets avoid that specific problem?

I expect the general case still holds.

> To achieve this, we would need for containers to be able to request file
> capabilities be added to a file without causing these to be honored in the
> initial user namespace.
>
> To this end, the patch below introduces a new capability xattr format.  The
> main enhancement over the existing security.capability xattr is that we
> tag capability sets with a uid - the uid of the root user in the namespace
> where the capabilities are set.  The capabilities will be ignored in any
> other namespace.  The special case of uid == -1 (which must only ever be
> able to be set by kuid 0) means use the capabilities in all
> namespaces.

A quick comment on this.

We currently allow capabilities that have been gained to be valid in all
descendent user namespaces.

Applying this principle to the on-disk capabilities would make it so
that uid 0 would mean capabilities in all namespaces.

It might be worth it to introduce a fixed sized array with a length
parameter of perhaps 32 entries which is a path of root uids as seen by
the initial user namespace.  That way the entire construction of the
user namespace could be verified.  AKA verify the current user namespace
and the parent and the parents parent.  Up to the user namespace the
current filesystem is mounted in. We would look at how much space
allows an xattr to be stored without causing filesystems a challenge
to properly size such an array.

Given that uids are fundamentally flat that might not be particularly
useful.  If we add an alternative way of identifying user namespaces
say a privileged operation that set a uuid, then the complete path would
be more interesting.

> An alternative format would use a pair of uids to indicate a range of rootids.
> This would allow root in a user namespace with uids 100000-165536 mapped to
> set the xattr once on a file, then launch nested containers wherein the file
> could be used with privilege.  That's not what this patch does, but would be
> a trivial change if people think it would be worthwhile.
>
> This patch does not actually address the real problem, which is setting the
> xattrs from inside containers.  For that, I think the best solution is to
> add a pair of new system calls, setfcap and getfcap. Userspace would for
> instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> the kernel would, if not in init_user_ns, react by writing an appropriate
> security.nscapability xattr.

That feels hard to maintain, but you may be correct that we have a small
enough userspace that it would not be a problem.

Eric


> The libcap2 library's cap_set_file/cap_get_file could be switched over
> transparently to use this to hide its use from all callers.
>
> Comments appreciated.
>
> Note - In this patch, file capabilities only work for containers which have
> a root uid defined.  We may want to allow -1 uids to work in all
> namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> about the implications of allowing a program privilege in a container where
> there is no uid with privilege.  This needs more thought.
>
> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> ---
>  include/linux/capability.h      |  15 +++++
>  include/uapi/linux/capability.h |  47 ++++++++++++++
>  include/uapi/linux/xattr.h      |   3 +
>  security/commoncap.c            | 135 ++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 194 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index af9f0b9..24ac18e 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,6 +13,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include <uapi/linux/capability.h>
> +#include <linux/uidgid.h>
>  
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> @@ -31,6 +32,20 @@ struct cpu_vfs_cap_data {
>  	kernel_cap_t inheritable;
>  };
>  
> +struct cpu_vfs_ns_cap_data {
> +	__u32 flags;
> +	kuid_t rootid;
> +	kernel_cap_t permitted;
> +	kernel_cap_t inheritable;
> +};
> +
> +struct cpu_vfs_ns_cap_header {
> +	__u32 hdr_info;
> +	struct cpu_vfs_ns_cap_data caps[0];
> +};
> +#define NS_CAPS_VERSION(x) (x & 0xFF)
> +#define NS_CAPS_NCAPS(x) ( (x >> 8) & 0xFF )
> +
>  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
>  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 12c37a1..2211a33 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -62,10 +62,14 @@ typedef struct __user_cap_data_struct {
>  #define VFS_CAP_U32_2           2
>  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>  
> +/* version number for security.nscapability xattrs hdr->hdr_info */
> +#define VFS_NS_CAP_REVISION     1
> +
>  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
>  #define VFS_CAP_U32             VFS_CAP_U32_2
>  #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
>  
> +
>  struct vfs_cap_data {
>  	__le32 magic_etc;            /* Little endian */
>  	struct {
> @@ -74,6 +78,49 @@ struct vfs_cap_data {
>  	} data[VFS_CAP_U32];
>  };
>  
> +/*
> + * Q: do we want version in the header, or in the data?
> + * If it is in the header, then a container will need to
> + * make sure it is writing the same data.
> + *
> + * Actually, perhaps we simply do not support writing the
> + * xattr, we just use a new system call to get/set the fscap.
> + * The kernel can be in charge of watching the version numbers.
> + * After all, we can't allow the container to override the
> + * fscaps of the init ns.
> + *
> + * @flags currently only containers the effective bit.  The
> + * other bits are reserved, and must be 0 at the moment.
> + * @rootid contains the kuid value of the root in the namespace
> + * for which this capability should be used.  If -1, then this
> + * works for all namespaces.  Only root in the initial ns can
> + * use this.
> + *
> + * Q: do we want to use a range instead?  Then root in a container
> + * could allow one binary with one capability to be used by any
> + * nested containers.
> + */
> +#define VFS_NS_CAP_EFFECTIVE    0x1
> +struct vfs_ns_cap_data {
> +	__le32 flags;
> +	__le32 rootid;
> +	struct {
> +		__le32 permitted;    /* Little endian */
> +		__le32 inheritable;  /* Little endian */
> +	} data[VFS_CAP_U32];
> +};
> +
> +/*
> + * 32-bit hdr_info contains
> + * 16 leftmost: reserved
> + * next 8: ncaps
> + * last 8: version
> + */
> +struct vfs_ns_cap_header {
> +	__le32 hdr_info;
> +	/* ncaps * vfs_ns_cap_data */
> +};
> +
>  #ifndef __KERNEL__
>  
>  /*
> diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> index 1590c49..67c80ab 100644
> --- a/include/uapi/linux/xattr.h
> +++ b/include/uapi/linux/xattr.h
> @@ -68,6 +68,9 @@
>  #define XATTR_CAPS_SUFFIX "capability"
>  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>  
> +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> +
>  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
>  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
>  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 1832cf7..c44edf3 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -308,6 +308,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>  	if (!inode->i_op->getxattr)
>  	       return 0;
>  
> +	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> +	if (error > 0)
> +		return 1;
> +
>  	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
>  	if (error <= 0)
>  		return 0;
> @@ -325,11 +329,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>  int cap_inode_killpriv(struct dentry *dentry)
>  {
>  	struct inode *inode = d_backing_inode(dentry);
> +	int ret1, ret2;;
>  
>  	if (!inode->i_op->removexattr)
>  	       return 0;
>  
> -	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> +
> +	if (ret1 != 0)
> +		return ret1;
> +	return ret2;
>  }
>  
>  /*
> @@ -433,6 +443,117 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>  	return 0;
>  }
>  
> +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> +{
> +	struct inode *inode = d_backing_inode(dentry);
> +	unsigned tocopy, i;
> +	int ret = 0, size, expected;
> +	unsigned len = 0;
> +	struct vfs_ns_cap_header *hdr;
> +	struct vfs_ns_cap_data *cap, *nscap = NULL;
> +	__u16 ncaps, version;
> +	__u32 hdr_info;
> +	kuid_t current_root, caprootuid;
> +
> +	memset(cpu_caps, 0, sizeof(*cpu_caps));
> +
> +	if (!inode || !inode->i_op->getxattr)
> +		return -ENODATA;
> +
> +	/* get the size */
> +	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> +			NULL, 0);
> +	if (size == -ENODATA || size == -EOPNOTSUPP)
> +		/* no data, that's ok */
> +		return -ENODATA;
> +	if (size < 0)
> +		return size;
> +	if (size < sizeof(struct cpu_vfs_ns_cap_header))
> +		return -EINVAL;
> +	if (size > sizeof(struct cpu_vfs_ns_cap_header) + 255 * sizeof(struct vfs_ns_cap_data))
> +		return -EINVAL;
> +	len = size;
> +
> +	hdr = kmalloc(len + 1, GFP_NOFS);
> +	if (!hdr)
> +		return -ENOMEM;
> +
> +	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS, hdr,
> +				   len);
> +	if (size < 0) {
> +		ret = size;
> +		goto out;
> +	}
> +
> +	if (size != len) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	hdr_info = le32_to_cpu(hdr->hdr_info);
> +	version = NS_CAPS_VERSION(hdr_info);
> +	ncaps = NS_CAPS_NCAPS(hdr_info);
> +
> +	if (version != VFS_NS_CAP_REVISION) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	expected = sizeof(*hdr) + ncaps * sizeof(*cap);
> +	if (size != expected) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	tocopy = VFS_CAP_U32;
> +
> +	/* find an applicable entry */
> +	/* a global entry (uid == -1) takes precedence */
> +	current_root = make_kuid(current_user_ns(), 0);
> +	if (!uid_valid(current_root)) {
> +		/* no root user in this namespace;  no capabilities */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	for (i = 0, cap = (void *) hdr + sizeof(*hdr); i < ncaps; cap += sizeof(*cap), i++) {
> +		uid_t uid = le32_to_cpu(cap->rootid);
> +		if (uid == -1) {
> +			nscap = cap;
> +			break;
> +		}
> +
> +		caprootuid = make_kuid(&init_user_ns, uid);
> +		if (uid_eq(caprootuid, current_root))
> +			nscap = cap;
> +	}
> +
> +	if (!nscap) {
> +		/* nothing found for this namespace */
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +
> +	/* copy the entry */
> +	CAP_FOR_EACH_U32(i) {
> +		if (i >= tocopy)
> +			break;
> +		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap->data[i].permitted);
> +		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap->data[i].inheritable);
> +	}
> +
> +	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +
> +	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> +	if (nscap->flags & VFS_NS_CAP_EFFECTIVE)
> +		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> +
> +out:
> +	kfree(hdr);
> +
> +	return ret;
> +}
> +
>  /*
>   * Attempt to get the on-exec apply capability sets for an executable file from
>   * its xattrs and, if present, apply them to the proposed credentials being
> @@ -451,11 +572,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>  	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
>  		return 0;
>  
> -	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +	if (rc == -ENODATA)
> +		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>  	if (rc < 0) {
>  		if (rc == -EINVAL)
> -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> -				__func__, rc, bprm->filename);
> +			printk(KERN_NOTICE "Got EINVAL reading file caps for %s\n",
> +				bprm->filename);
>  		else if (rc == -ENODATA)
>  			rc = 0;
>  		goto out;
> @@ -651,7 +774,7 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> +	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
>  		if (!capable(CAP_SETFCAP))
>  			return -EPERM;
>  		return 0;
> @@ -677,7 +800,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> +	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
>  		if (!capable(CAP_SETFCAP))
>  			return -EPERM;
>  		return 0;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2015-11-30 23:08   ` Eric W. Biederman
  0 siblings, 0 replies; 19+ messages in thread
From: Eric W. Biederman @ 2015-11-30 23:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: lkml, Andrew Morgan, Andy Lutomirski,
	lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I,
	Richard Weinberger, LSM, linux-api-u79uwXL29TY76Z2rM5mHXA,
	keescook-F7+t8E8rja9g9hUCZPvPmw

"Serge E. Hallyn" <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

> A common way for daemons to run with minimal privilege is to start as root,
> perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> then change uid to non-root.  A simpler way to achieve this is to set file
> capabilities on a not-setuid-root binary.  However, when installing a package
> inside a (user-namespaced) container, packages cannot be installed with file
> capabilities.  For this reason, containers must install ping setuid-root.

Don't ping sockets avoid that specific problem?

I expect the general case still holds.

> To achieve this, we would need for containers to be able to request file
> capabilities be added to a file without causing these to be honored in the
> initial user namespace.
>
> To this end, the patch below introduces a new capability xattr format.  The
> main enhancement over the existing security.capability xattr is that we
> tag capability sets with a uid - the uid of the root user in the namespace
> where the capabilities are set.  The capabilities will be ignored in any
> other namespace.  The special case of uid == -1 (which must only ever be
> able to be set by kuid 0) means use the capabilities in all
> namespaces.

A quick comment on this.

We currently allow capabilities that have been gained to be valid in all
descendent user namespaces.

Applying this principle to the on-disk capabilities would make it so
that uid 0 would mean capabilities in all namespaces.

It might be worth it to introduce a fixed sized array with a length
parameter of perhaps 32 entries which is a path of root uids as seen by
the initial user namespace.  That way the entire construction of the
user namespace could be verified.  AKA verify the current user namespace
and the parent and the parents parent.  Up to the user namespace the
current filesystem is mounted in. We would look at how much space
allows an xattr to be stored without causing filesystems a challenge
to properly size such an array.

Given that uids are fundamentally flat that might not be particularly
useful.  If we add an alternative way of identifying user namespaces
say a privileged operation that set a uuid, then the complete path would
be more interesting.

> An alternative format would use a pair of uids to indicate a range of rootids.
> This would allow root in a user namespace with uids 100000-165536 mapped to
> set the xattr once on a file, then launch nested containers wherein the file
> could be used with privilege.  That's not what this patch does, but would be
> a trivial change if people think it would be worthwhile.
>
> This patch does not actually address the real problem, which is setting the
> xattrs from inside containers.  For that, I think the best solution is to
> add a pair of new system calls, setfcap and getfcap. Userspace would for
> instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> the kernel would, if not in init_user_ns, react by writing an appropriate
> security.nscapability xattr.

That feels hard to maintain, but you may be correct that we have a small
enough userspace that it would not be a problem.

Eric


> The libcap2 library's cap_set_file/cap_get_file could be switched over
> transparently to use this to hide its use from all callers.
>
> Comments appreciated.
>
> Note - In this patch, file capabilities only work for containers which have
> a root uid defined.  We may want to allow -1 uids to work in all
> namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> about the implications of allowing a program privilege in a container where
> there is no uid with privilege.  This needs more thought.
>
> Signed-off-by: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/capability.h      |  15 +++++
>  include/uapi/linux/capability.h |  47 ++++++++++++++
>  include/uapi/linux/xattr.h      |   3 +
>  security/commoncap.c            | 135 ++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 194 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index af9f0b9..24ac18e 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,6 +13,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include <uapi/linux/capability.h>
> +#include <linux/uidgid.h>
>  
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> @@ -31,6 +32,20 @@ struct cpu_vfs_cap_data {
>  	kernel_cap_t inheritable;
>  };
>  
> +struct cpu_vfs_ns_cap_data {
> +	__u32 flags;
> +	kuid_t rootid;
> +	kernel_cap_t permitted;
> +	kernel_cap_t inheritable;
> +};
> +
> +struct cpu_vfs_ns_cap_header {
> +	__u32 hdr_info;
> +	struct cpu_vfs_ns_cap_data caps[0];
> +};
> +#define NS_CAPS_VERSION(x) (x & 0xFF)
> +#define NS_CAPS_NCAPS(x) ( (x >> 8) & 0xFF )
> +
>  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
>  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 12c37a1..2211a33 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -62,10 +62,14 @@ typedef struct __user_cap_data_struct {
>  #define VFS_CAP_U32_2           2
>  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>  
> +/* version number for security.nscapability xattrs hdr->hdr_info */
> +#define VFS_NS_CAP_REVISION     1
> +
>  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
>  #define VFS_CAP_U32             VFS_CAP_U32_2
>  #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
>  
> +
>  struct vfs_cap_data {
>  	__le32 magic_etc;            /* Little endian */
>  	struct {
> @@ -74,6 +78,49 @@ struct vfs_cap_data {
>  	} data[VFS_CAP_U32];
>  };
>  
> +/*
> + * Q: do we want version in the header, or in the data?
> + * If it is in the header, then a container will need to
> + * make sure it is writing the same data.
> + *
> + * Actually, perhaps we simply do not support writing the
> + * xattr, we just use a new system call to get/set the fscap.
> + * The kernel can be in charge of watching the version numbers.
> + * After all, we can't allow the container to override the
> + * fscaps of the init ns.
> + *
> + * @flags currently only containers the effective bit.  The
> + * other bits are reserved, and must be 0 at the moment.
> + * @rootid contains the kuid value of the root in the namespace
> + * for which this capability should be used.  If -1, then this
> + * works for all namespaces.  Only root in the initial ns can
> + * use this.
> + *
> + * Q: do we want to use a range instead?  Then root in a container
> + * could allow one binary with one capability to be used by any
> + * nested containers.
> + */
> +#define VFS_NS_CAP_EFFECTIVE    0x1
> +struct vfs_ns_cap_data {
> +	__le32 flags;
> +	__le32 rootid;
> +	struct {
> +		__le32 permitted;    /* Little endian */
> +		__le32 inheritable;  /* Little endian */
> +	} data[VFS_CAP_U32];
> +};
> +
> +/*
> + * 32-bit hdr_info contains
> + * 16 leftmost: reserved
> + * next 8: ncaps
> + * last 8: version
> + */
> +struct vfs_ns_cap_header {
> +	__le32 hdr_info;
> +	/* ncaps * vfs_ns_cap_data */
> +};
> +
>  #ifndef __KERNEL__
>  
>  /*
> diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> index 1590c49..67c80ab 100644
> --- a/include/uapi/linux/xattr.h
> +++ b/include/uapi/linux/xattr.h
> @@ -68,6 +68,9 @@
>  #define XATTR_CAPS_SUFFIX "capability"
>  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>  
> +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> +
>  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
>  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
>  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 1832cf7..c44edf3 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -308,6 +308,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>  	if (!inode->i_op->getxattr)
>  	       return 0;
>  
> +	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> +	if (error > 0)
> +		return 1;
> +
>  	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
>  	if (error <= 0)
>  		return 0;
> @@ -325,11 +329,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>  int cap_inode_killpriv(struct dentry *dentry)
>  {
>  	struct inode *inode = d_backing_inode(dentry);
> +	int ret1, ret2;;
>  
>  	if (!inode->i_op->removexattr)
>  	       return 0;
>  
> -	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> +
> +	if (ret1 != 0)
> +		return ret1;
> +	return ret2;
>  }
>  
>  /*
> @@ -433,6 +443,117 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>  	return 0;
>  }
>  
> +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> +{
> +	struct inode *inode = d_backing_inode(dentry);
> +	unsigned tocopy, i;
> +	int ret = 0, size, expected;
> +	unsigned len = 0;
> +	struct vfs_ns_cap_header *hdr;
> +	struct vfs_ns_cap_data *cap, *nscap = NULL;
> +	__u16 ncaps, version;
> +	__u32 hdr_info;
> +	kuid_t current_root, caprootuid;
> +
> +	memset(cpu_caps, 0, sizeof(*cpu_caps));
> +
> +	if (!inode || !inode->i_op->getxattr)
> +		return -ENODATA;
> +
> +	/* get the size */
> +	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> +			NULL, 0);
> +	if (size == -ENODATA || size == -EOPNOTSUPP)
> +		/* no data, that's ok */
> +		return -ENODATA;
> +	if (size < 0)
> +		return size;
> +	if (size < sizeof(struct cpu_vfs_ns_cap_header))
> +		return -EINVAL;
> +	if (size > sizeof(struct cpu_vfs_ns_cap_header) + 255 * sizeof(struct vfs_ns_cap_data))
> +		return -EINVAL;
> +	len = size;
> +
> +	hdr = kmalloc(len + 1, GFP_NOFS);
> +	if (!hdr)
> +		return -ENOMEM;
> +
> +	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS, hdr,
> +				   len);
> +	if (size < 0) {
> +		ret = size;
> +		goto out;
> +	}
> +
> +	if (size != len) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	hdr_info = le32_to_cpu(hdr->hdr_info);
> +	version = NS_CAPS_VERSION(hdr_info);
> +	ncaps = NS_CAPS_NCAPS(hdr_info);
> +
> +	if (version != VFS_NS_CAP_REVISION) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	expected = sizeof(*hdr) + ncaps * sizeof(*cap);
> +	if (size != expected) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	tocopy = VFS_CAP_U32;
> +
> +	/* find an applicable entry */
> +	/* a global entry (uid == -1) takes precedence */
> +	current_root = make_kuid(current_user_ns(), 0);
> +	if (!uid_valid(current_root)) {
> +		/* no root user in this namespace;  no capabilities */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	for (i = 0, cap = (void *) hdr + sizeof(*hdr); i < ncaps; cap += sizeof(*cap), i++) {
> +		uid_t uid = le32_to_cpu(cap->rootid);
> +		if (uid == -1) {
> +			nscap = cap;
> +			break;
> +		}
> +
> +		caprootuid = make_kuid(&init_user_ns, uid);
> +		if (uid_eq(caprootuid, current_root))
> +			nscap = cap;
> +	}
> +
> +	if (!nscap) {
> +		/* nothing found for this namespace */
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +
> +	/* copy the entry */
> +	CAP_FOR_EACH_U32(i) {
> +		if (i >= tocopy)
> +			break;
> +		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap->data[i].permitted);
> +		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap->data[i].inheritable);
> +	}
> +
> +	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +
> +	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> +	if (nscap->flags & VFS_NS_CAP_EFFECTIVE)
> +		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> +
> +out:
> +	kfree(hdr);
> +
> +	return ret;
> +}
> +
>  /*
>   * Attempt to get the on-exec apply capability sets for an executable file from
>   * its xattrs and, if present, apply them to the proposed credentials being
> @@ -451,11 +572,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>  	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
>  		return 0;
>  
> -	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +	if (rc == -ENODATA)
> +		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>  	if (rc < 0) {
>  		if (rc == -EINVAL)
> -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> -				__func__, rc, bprm->filename);
> +			printk(KERN_NOTICE "Got EINVAL reading file caps for %s\n",
> +				bprm->filename);
>  		else if (rc == -ENODATA)
>  			rc = 0;
>  		goto out;
> @@ -651,7 +774,7 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> +	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
>  		if (!capable(CAP_SETFCAP))
>  			return -EPERM;
>  		return 0;
> @@ -677,7 +800,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> +	if (!strcmp(name, XATTR_NAME_CAPS) || !strcmp(name, XATTR_NAME_NS_CAPS)) {
>  		if (!capable(CAP_SETFCAP))
>  			return -EPERM;
>  		return 0;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2015-12-01  3:51     ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2015-12-01  3:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, lkml, Andrew Morgan, Andy Lutomirski, lxc-devel,
	Richard Weinberger, LSM, linux-api, keescook

On Mon, Nov 30, 2015 at 05:08:34PM -0600, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> 
> > A common way for daemons to run with minimal privilege is to start as root,
> > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > then change uid to non-root.  A simpler way to achieve this is to set file
> > capabilities on a not-setuid-root binary.  However, when installing a package
> > inside a (user-namespaced) container, packages cannot be installed with file
> > capabilities.  For this reason, containers must install ping setuid-root.
> 
> Don't ping sockets avoid that specific problem?
> 
> I expect the general case still holds.

Hah - yes, I guess do I have to update my 10 year old default example :)

> > To achieve this, we would need for containers to be able to request file
> > capabilities be added to a file without causing these to be honored in the
> > initial user namespace.
> >
> > To this end, the patch below introduces a new capability xattr format.  The
> > main enhancement over the existing security.capability xattr is that we
> > tag capability sets with a uid - the uid of the root user in the namespace
> > where the capabilities are set.  The capabilities will be ignored in any
> > other namespace.  The special case of uid == -1 (which must only ever be
> > able to be set by kuid 0) means use the capabilities in all
> > namespaces.
> 
> A quick comment on this.
> 
> We currently allow capabilities that have been gained to be valid in all
> descendent user namespaces.
> 
> Applying this principle to the on-disk capabilities would make it so
> that uid 0 would mean capabilities in all namespaces.
> 
> It might be worth it to introduce a fixed sized array with a length
> parameter of perhaps 32 entries which is a path of root uids as seen by
> the initial user namespace.  That way the entire construction of the
> user namespace could be verified.  AKA verify the current user namespace
> and the parent and the parents parent.  Up to the user namespace the

Hm, so if container b runs in container a, a has rootid 100000 and a
range of 200000, and b has root kuid 200000, range 65536, iiuc you're
suggesting that for a binary in container b we store [100000,200000] ?
I'm not sure that's helpful, though - uid 200000 in a user namespace
with 200000 mapped to root, is all powerful anyway.  I was actually
thinking (with the uid ranges) of making the connection looser, not
tighter.

> current filesystem is mounted in. We would look at how much space
> allows an xattr to be stored without causing filesystems a challenge
> to properly size such an array.
> 
> Given that uids are fundamentally flat that might not be particularly
> useful.

Right, I think that's the conclusion I've drawn above (if I'm not
misunderstanding you)

>   If we add an alternative way of identifying user namespaces
> say a privileged operation that set a uuid, then the complete path would
> be more interesting.
> 
> > An alternative format would use a pair of uids to indicate a range of rootids.
> > This would allow root in a user namespace with uids 100000-165536 mapped to
> > set the xattr once on a file, then launch nested containers wherein the file
> > could be used with privilege.  That's not what this patch does, but would be
> > a trivial change if people think it would be worthwhile.
> >
> > This patch does not actually address the real problem, which is setting the
> > xattrs from inside containers.  For that, I think the best solution is to
> > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > the kernel would, if not in init_user_ns, react by writing an appropriate
> > security.nscapability xattr.
> 
> That feels hard to maintain, but you may be correct that we have a small

Hard to maintain in which sense?  Complicated for userspace software, or
becoming too complicated in the kernel's bprm-capabilities code?

> enough userspace that it would not be a problem.

Yeah I'm thinking we can hide it all behind libcap2.   Unless we go with
uid ranges, in which case we'd need a way to expose that, but that would
be an optional extension, the sane default would be transparent, so no big
deal.

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2015-12-01  3:51     ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2015-12-01  3:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, lkml, Andrew Morgan, Andy Lutomirski,
	lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I,
	Richard Weinberger, LSM, linux-api-u79uwXL29TY76Z2rM5mHXA,
	keescook-F7+t8E8rja9g9hUCZPvPmw

On Mon, Nov 30, 2015 at 05:08:34PM -0600, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> 
> > A common way for daemons to run with minimal privilege is to start as root,
> > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > then change uid to non-root.  A simpler way to achieve this is to set file
> > capabilities on a not-setuid-root binary.  However, when installing a package
> > inside a (user-namespaced) container, packages cannot be installed with file
> > capabilities.  For this reason, containers must install ping setuid-root.
> 
> Don't ping sockets avoid that specific problem?
> 
> I expect the general case still holds.

Hah - yes, I guess do I have to update my 10 year old default example :)

> > To achieve this, we would need for containers to be able to request file
> > capabilities be added to a file without causing these to be honored in the
> > initial user namespace.
> >
> > To this end, the patch below introduces a new capability xattr format.  The
> > main enhancement over the existing security.capability xattr is that we
> > tag capability sets with a uid - the uid of the root user in the namespace
> > where the capabilities are set.  The capabilities will be ignored in any
> > other namespace.  The special case of uid == -1 (which must only ever be
> > able to be set by kuid 0) means use the capabilities in all
> > namespaces.
> 
> A quick comment on this.
> 
> We currently allow capabilities that have been gained to be valid in all
> descendent user namespaces.
> 
> Applying this principle to the on-disk capabilities would make it so
> that uid 0 would mean capabilities in all namespaces.
> 
> It might be worth it to introduce a fixed sized array with a length
> parameter of perhaps 32 entries which is a path of root uids as seen by
> the initial user namespace.  That way the entire construction of the
> user namespace could be verified.  AKA verify the current user namespace
> and the parent and the parents parent.  Up to the user namespace the

Hm, so if container b runs in container a, a has rootid 100000 and a
range of 200000, and b has root kuid 200000, range 65536, iiuc you're
suggesting that for a binary in container b we store [100000,200000] ?
I'm not sure that's helpful, though - uid 200000 in a user namespace
with 200000 mapped to root, is all powerful anyway.  I was actually
thinking (with the uid ranges) of making the connection looser, not
tighter.

> current filesystem is mounted in. We would look at how much space
> allows an xattr to be stored without causing filesystems a challenge
> to properly size such an array.
> 
> Given that uids are fundamentally flat that might not be particularly
> useful.

Right, I think that's the conclusion I've drawn above (if I'm not
misunderstanding you)

>   If we add an alternative way of identifying user namespaces
> say a privileged operation that set a uuid, then the complete path would
> be more interesting.
> 
> > An alternative format would use a pair of uids to indicate a range of rootids.
> > This would allow root in a user namespace with uids 100000-165536 mapped to
> > set the xattr once on a file, then launch nested containers wherein the file
> > could be used with privilege.  That's not what this patch does, but would be
> > a trivial change if people think it would be worthwhile.
> >
> > This patch does not actually address the real problem, which is setting the
> > xattrs from inside containers.  For that, I think the best solution is to
> > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > the kernel would, if not in init_user_ns, react by writing an appropriate
> > security.nscapability xattr.
> 
> That feels hard to maintain, but you may be correct that we have a small

Hard to maintain in which sense?  Complicated for userspace software, or
becoming too complicated in the kernel's bprm-capabilities code?

> enough userspace that it would not be a problem.

Yeah I'm thinking we can hide it all behind libcap2.   Unless we go with
uid ranges, in which case we'd need a way to expose that, but that would
be an optional extension, the sane default would be transparent, so no big
deal.

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2015-11-30 23:08   ` Eric W. Biederman
  (?)
  (?)
@ 2015-12-04 20:21   ` Serge E. Hallyn
  2016-01-19  7:09     ` Serge E. Hallyn
  2016-01-20 12:48     ` Jann Horn
  -1 siblings, 2 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2015-12-04 20:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, lkml, Andrew Morgan, Andy Lutomirski, lxc-devel,
	Richard Weinberger, LSM, linux-api, keescook

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> 
> > A common way for daemons to run with minimal privilege is to start as root,
> > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > then change uid to non-root.  A simpler way to achieve this is to set file
> > capabilities on a not-setuid-root binary.  However, when installing a package
> > inside a (user-namespaced) container, packages cannot be installed with file
> > capabilities.  For this reason, containers must install ping setuid-root.
> 
> Don't ping sockets avoid that specific problem?
> 
> I expect the general case still holds.
> 
> > To achieve this, we would need for containers to be able to request file
> > capabilities be added to a file without causing these to be honored in the
> > initial user namespace.
> >
> > To this end, the patch below introduces a new capability xattr format.  The
> > main enhancement over the existing security.capability xattr is that we
> > tag capability sets with a uid - the uid of the root user in the namespace
> > where the capabilities are set.  The capabilities will be ignored in any
> > other namespace.  The special case of uid == -1 (which must only ever be
> > able to be set by kuid 0) means use the capabilities in all
> > namespaces.

really since security.capability xattrs are currently honored in all
namespaces this isn't really necessary.  Until and unless Seth's set
changes that.

> 
> A quick comment on this.
> 
> We currently allow capabilities that have been gained to be valid in all
> descendent user namespaces.
> 
> Applying this principle to the on-disk capabilities would make it so
> that uid 0 would mean capabilities in all namespaces.
> 
> It might be worth it to introduce a fixed sized array with a length
> parameter of perhaps 32 entries which is a path of root uids as seen by
> the initial user namespace.  That way the entire construction of the
> user namespace could be verified.  AKA verify the current user namespace
> and the parent and the parents parent.  Up to the user namespace the
> current filesystem is mounted in. We would look at how much space
> allows an xattr to be stored without causing filesystems a challenge
> to properly size such an array.
> 
> Given that uids are fundamentally flat that might not be particularly
> useful.  If we add an alternative way of identifying user namespaces
> say a privileged operation that set a uuid, then the complete path would
> be more interesting.
> 
> > An alternative format would use a pair of uids to indicate a range of rootids.
> > This would allow root in a user namespace with uids 100000-165536 mapped to
> > set the xattr once on a file, then launch nested containers wherein the file
> > could be used with privilege.  That's not what this patch does, but would be
> > a trivial change if people think it would be worthwhile.
> >
> > This patch does not actually address the real problem, which is setting the
> > xattrs from inside containers.  For that, I think the best solution is to
> > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > the kernel would, if not in init_user_ns, react by writing an appropriate
> > security.nscapability xattr.
> 
> That feels hard to maintain, but you may be correct that we have a small
> enough userspace that it would not be a problem.
> 
> Eric
> 
> 
> > The libcap2 library's cap_set_file/cap_get_file could be switched over
> > transparently to use this to hide its use from all callers.
> >
> > Comments appreciated.
> >
> > Note - In this patch, file capabilities only work for containers which have
> > a root uid defined.  We may want to allow -1 uids to work in all
> > namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> > about the implications of allowing a program privilege in a container where
> > there is no uid with privilege.  This needs more thought.

So for actually enabling (user-namespaced) containers to use these, there
are a few possibilities that come to mine.

1. A new setfcap (/getfcap) syscall.  Uses mapped uid 0 from
current_user_ns() to write a value in the security.nscapability xattr.
Userspace doesn't need to worry at all about namespace issues.

2. Just expect userspace to write a xattr;  kernel checks that no values
are changed for any other namespaces.  This could be a lot of parsing and
verifying in the kernel.

3. Switch the xattr scheme - instead of one security.nscapability xattr
with multiple entries, use security.nscapability.$(rootid).  Now the
kernel only needs to verify that the $rootid is valid for the writing
task, and we don't need a new syscall.  OTOH userspace needs to know
what it's doing.  Of course we can still hide that behind libcap2's helpers.

Any opinions on which way seems best?  1 does seem cleanest (and supports
use of seccomp if we want to forbit its use by some containers), but
involves a new pair of syscalls.  2 seems to me to be right out, but
others might disagree...

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2015-12-04 20:21   ` Serge E. Hallyn
@ 2016-01-19  7:09     ` Serge E. Hallyn
  2016-01-20 12:48     ` Jann Horn
  1 sibling, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-01-19  7:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Serge E. Hallyn, lkml, Andrew Morgan,
	Andy Lutomirski, lxc-devel, Richard Weinberger, LSM, linux-api,
	keescook

On Fri, Dec 04, 2015 at 02:21:16PM -0600, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> > 
> > > A common way for daemons to run with minimal privilege is to start as root,
> > > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > > then change uid to non-root.  A simpler way to achieve this is to set file
> > > capabilities on a not-setuid-root binary.  However, when installing a package
> > > inside a (user-namespaced) container, packages cannot be installed with file
> > > capabilities.  For this reason, containers must install ping setuid-root.
> > 
> > Don't ping sockets avoid that specific problem?
> > 
> > I expect the general case still holds.
> > 
> > > To achieve this, we would need for containers to be able to request file
> > > capabilities be added to a file without causing these to be honored in the
> > > initial user namespace.
> > >
> > > To this end, the patch below introduces a new capability xattr format.  The
> > > main enhancement over the existing security.capability xattr is that we
> > > tag capability sets with a uid - the uid of the root user in the namespace
> > > where the capabilities are set.  The capabilities will be ignored in any
> > > other namespace.  The special case of uid == -1 (which must only ever be
> > > able to be set by kuid 0) means use the capabilities in all
> > > namespaces.
> 
> really since security.capability xattrs are currently honored in all
> namespaces this isn't really necessary.  Until and unless Seth's set
> changes that.
> 
> > 
> > A quick comment on this.
> > 
> > We currently allow capabilities that have been gained to be valid in all
> > descendent user namespaces.
> > 
> > Applying this principle to the on-disk capabilities would make it so
> > that uid 0 would mean capabilities in all namespaces.
> > 
> > It might be worth it to introduce a fixed sized array with a length
> > parameter of perhaps 32 entries which is a path of root uids as seen by
> > the initial user namespace.  That way the entire construction of the
> > user namespace could be verified.  AKA verify the current user namespace
> > and the parent and the parents parent.  Up to the user namespace the
> > current filesystem is mounted in. We would look at how much space
> > allows an xattr to be stored without causing filesystems a challenge
> > to properly size such an array.
> > 
> > Given that uids are fundamentally flat that might not be particularly
> > useful.  If we add an alternative way of identifying user namespaces
> > say a privileged operation that set a uuid, then the complete path would
> > be more interesting.
> > 
> > > An alternative format would use a pair of uids to indicate a range of rootids.
> > > This would allow root in a user namespace with uids 100000-165536 mapped to
> > > set the xattr once on a file, then launch nested containers wherein the file
> > > could be used with privilege.  That's not what this patch does, but would be
> > > a trivial change if people think it would be worthwhile.
> > >
> > > This patch does not actually address the real problem, which is setting the
> > > xattrs from inside containers.  For that, I think the best solution is to
> > > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > > the kernel would, if not in init_user_ns, react by writing an appropriate
> > > security.nscapability xattr.
> > 
> > That feels hard to maintain, but you may be correct that we have a small
> > enough userspace that it would not be a problem.
> > 
> > Eric
> > 
> > 
> > > The libcap2 library's cap_set_file/cap_get_file could be switched over
> > > transparently to use this to hide its use from all callers.
> > >
> > > Comments appreciated.
> > >
> > > Note - In this patch, file capabilities only work for containers which have
> > > a root uid defined.  We may want to allow -1 uids to work in all
> > > namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> > > about the implications of allowing a program privilege in a container where
> > > there is no uid with privilege.  This needs more thought.
> 
> So for actually enabling (user-namespaced) containers to use these, there
> are a few possibilities that come to mine.
> 
> 1. A new setfcap (/getfcap) syscall.  Uses mapped uid 0 from
> current_user_ns() to write a value in the security.nscapability xattr.
> Userspace doesn't need to worry at all about namespace issues.
> 
> 2. Just expect userspace to write a xattr;  kernel checks that no values
> are changed for any other namespaces.  This could be a lot of parsing and
> verifying in the kernel.
> 
> 3. Switch the xattr scheme - instead of one security.nscapability xattr
> with multiple entries, use security.nscapability.$(rootid).  Now the
> kernel only needs to verify that the $rootid is valid for the writing
> task, and we don't need a new syscall.  OTOH userspace needs to know
> what it's doing.  Of course we can still hide that behind libcap2's helpers.
> 
> Any opinions on which way seems best?  1 does seem cleanest (and supports
> use of seccomp if we want to forbit its use by some containers), but
> involves a new pair of syscalls.  2 seems to me to be right out, but
> others might disagree...

ok - I'm going to lean toward option 1 failing any convincing arguments
otherwise.

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2015-11-30 22:43 [PATCH RFC] Introduce new security.nscapability xattr Serge E. Hallyn
  2015-11-30 23:08   ` Eric W. Biederman
@ 2016-01-20 12:14 ` Jann Horn
  1 sibling, 0 replies; 19+ messages in thread
From: Jann Horn @ 2016-01-20 12:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: lkml, Andrew Morgan, Andy Lutomirski, Eric W. Biederman,
	lxc-devel, Richard Weinberger, LSM, linux-api, keescook

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Mon, Nov 30, 2015 at 04:43:56PM -0600, Serge E. Hallyn wrote:
> +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> +{
[...]
> +	/* find an applicable entry */
> +	/* a global entry (uid == -1) takes precedence */
> +	current_root = make_kuid(current_user_ns(), 0);
> +	if (!uid_valid(current_root)) {
> +		/* no root user in this namespace;  no capabilities */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	for (i = 0, cap = (void *) hdr + sizeof(*hdr); i < ncaps; cap += sizeof(*cap), i++) {
> +		uid_t uid = le32_to_cpu(cap->rootid);
> +		if (uid == -1) {
> +			nscap = cap;
> +			break;
> +		}
> +
> +		caprootuid = make_kuid(&init_user_ns, uid);
> +		if (uid_eq(caprootuid, current_root))
> +			nscap = cap;
> +	}

Wouldn't it be more consistent to check against the root uids of all parent
namespaces until one matches?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2015-12-04 20:21   ` Serge E. Hallyn
  2016-01-19  7:09     ` Serge E. Hallyn
@ 2016-01-20 12:48     ` Jann Horn
  2016-01-27 16:08       ` Serge E. Hallyn
  1 sibling, 1 reply; 19+ messages in thread
From: Jann Horn @ 2016-01-20 12:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Serge E. Hallyn, lkml, Andrew Morgan,
	Andy Lutomirski, lxc-devel, Richard Weinberger, LSM, linux-api,
	keescook

[-- Attachment #1: Type: text/plain, Size: 7646 bytes --]

On Fri, Dec 04, 2015 at 02:21:16PM -0600, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> > 
> > > A common way for daemons to run with minimal privilege is to start as root,
> > > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > > then change uid to non-root.  A simpler way to achieve this is to set file
> > > capabilities on a not-setuid-root binary.  However, when installing a package
> > > inside a (user-namespaced) container, packages cannot be installed with file
> > > capabilities.  For this reason, containers must install ping setuid-root.
> > 
> > Don't ping sockets avoid that specific problem?
> > 
> > I expect the general case still holds.
> > 
> > > To achieve this, we would need for containers to be able to request file
> > > capabilities be added to a file without causing these to be honored in the
> > > initial user namespace.
> > >
> > > To this end, the patch below introduces a new capability xattr format.  The
> > > main enhancement over the existing security.capability xattr is that we
> > > tag capability sets with a uid - the uid of the root user in the namespace
> > > where the capabilities are set.  The capabilities will be ignored in any
> > > other namespace.  The special case of uid == -1 (which must only ever be
> > > able to be set by kuid 0) means use the capabilities in all
> > > namespaces.
> 
> really since security.capability xattrs are currently honored in all
> namespaces this isn't really necessary.  Until and unless Seth's set
> changes that.
> 
> > 
> > A quick comment on this.
> > 
> > We currently allow capabilities that have been gained to be valid in all
> > descendent user namespaces.
> > 
> > Applying this principle to the on-disk capabilities would make it so
> > that uid 0 would mean capabilities in all namespaces.
> > 
> > It might be worth it to introduce a fixed sized array with a length
> > parameter of perhaps 32 entries which is a path of root uids as seen by
> > the initial user namespace.  That way the entire construction of the
> > user namespace could be verified.  AKA verify the current user namespace
> > and the parent and the parents parent.  Up to the user namespace the
> > current filesystem is mounted in. We would look at how much space
> > allows an xattr to be stored without causing filesystems a challenge
> > to properly size such an array.
> > 
> > Given that uids are fundamentally flat that might not be particularly
> > useful.  If we add an alternative way of identifying user namespaces
> > say a privileged operation that set a uuid, then the complete path would
> > be more interesting.
> > 
> > > An alternative format would use a pair of uids to indicate a range of rootids.
> > > This would allow root in a user namespace with uids 100000-165536 mapped to
> > > set the xattr once on a file, then launch nested containers wherein the file
> > > could be used with privilege.  That's not what this patch does, but would be
> > > a trivial change if people think it would be worthwhile.
> > >
> > > This patch does not actually address the real problem, which is setting the
> > > xattrs from inside containers.  For that, I think the best solution is to
> > > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > > the kernel would, if not in init_user_ns, react by writing an appropriate
> > > security.nscapability xattr.
> > 
> > That feels hard to maintain, but you may be correct that we have a small
> > enough userspace that it would not be a problem.
> > 
> > Eric
> > 
> > 
> > > The libcap2 library's cap_set_file/cap_get_file could be switched over
> > > transparently to use this to hide its use from all callers.
> > >
> > > Comments appreciated.
> > >
> > > Note - In this patch, file capabilities only work for containers which have
> > > a root uid defined.  We may want to allow -1 uids to work in all
> > > namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> > > about the implications of allowing a program privilege in a container where
> > > there is no uid with privilege.  This needs more thought.
> 
> So for actually enabling (user-namespaced) containers to use these, there
> are a few possibilities that come to mine.
> 
> 1. A new setfcap (/getfcap) syscall.  Uses mapped uid 0 from
> current_user_ns() to write a value in the security.nscapability xattr.
> Userspace doesn't need to worry at all about namespace issues.

Your patch has an "ncaps" field and supports giving (possibly different)
privileges to different namespace root users.
I think that with a setfcap() syscall as you describe, it would be hard
to add more than one security.nscapability entry to a file without
poking holes into stuff:

If setfcap() requires inode write access, every namespace root for whom
a security.nscapability entry can be added can overwrite the file, so
namespace roots might be able to execute code as each other.

If setfcap() doesn't require inode write access, that isn't a big
security issue, but it allows unprivileged users to waste disk space
in a writable-mounted filesystem on which they have no write access
anywhere. I'm not sure whether that is a sufficiently big problem for
anyone to care.


Another issue with this would be that it makes restoring sub-namespace
capabilities from a backup somewhat ugly, at least if you're not in the
init ns: You'd have to parse the capabilities attribute, then, for every
attribute you want to restore whose rootid doesn't refer to your
namespace, clone(CLONE_NEWUSER), map the target uid to 0 from the parent
process, seteuid(0), do the setfcap() and exit.

It would require less synchronization with my ptrace hardening patch
that checks whether uids are mapped (if that lands, I don't think it has
yet?) because that would allow you to first setuid, then clone(), map
the uid to 0 inside the namespace and setfcap(), but it'd still require
special-case code in backup software.


> 2. Just expect userspace to write a xattr;  kernel checks that no values
> are changed for any other namespaces.  This could be a lot of parsing and
> verifying in the kernel.

This option would allow the first problem I described with option 1 (if
it is a problem?) to be worked around by letting a helper in the outer
namespace add capabilities for lower namespaces - but it wouldn't be
pretty.

I think it would also allow tar, which already supports xattrs, to
deal with namespace capabilities without needing any additional code.
I think that would be nice to have.


> 3. Switch the xattr scheme - instead of one security.nscapability xattr
> with multiple entries, use security.nscapability.$(rootid).  Now the
> kernel only needs to verify that the $rootid is valid for the writing
> task, and we don't need a new syscall.  OTOH userspace needs to know
> what it's doing.  Of course we can still hide that behind libcap2's helpers.

This doesn't sound so good - how would the namespace know its rootid?
If it is just one level below init_ns, it can grab it from
/proc/self/uid_map, but for deeper levels that won't work.


> Any opinions on which way seems best?  1 does seem cleanest (and supports
> use of seccomp if we want to forbit its use by some containers), but
> involves a new pair of syscalls.  2 seems to me to be right out, but
> others might disagree...

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2016-01-20 12:48     ` Jann Horn
@ 2016-01-27 16:08       ` Serge E. Hallyn
  2016-01-27 17:22         ` Jann Horn
  0 siblings, 1 reply; 19+ messages in thread
From: Serge E. Hallyn @ 2016-01-27 16:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Eric W. Biederman, Serge E. Hallyn, lkml,
	Andrew Morgan, Andy Lutomirski, lxc-devel, Richard Weinberger,
	LSM, linux-api, keescook

On Wed, Jan 20, 2016 at 01:48:16PM +0100, Jann Horn wrote:
> On Fri, Dec 04, 2015 at 02:21:16PM -0600, Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> > > 
> > > > A common way for daemons to run with minimal privilege is to start as root,
> > > > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > > > then change uid to non-root.  A simpler way to achieve this is to set file
> > > > capabilities on a not-setuid-root binary.  However, when installing a package
> > > > inside a (user-namespaced) container, packages cannot be installed with file
> > > > capabilities.  For this reason, containers must install ping setuid-root.
> > > 
> > > Don't ping sockets avoid that specific problem?
> > > 
> > > I expect the general case still holds.
> > > 
> > > > To achieve this, we would need for containers to be able to request file
> > > > capabilities be added to a file without causing these to be honored in the
> > > > initial user namespace.
> > > >
> > > > To this end, the patch below introduces a new capability xattr format.  The
> > > > main enhancement over the existing security.capability xattr is that we
> > > > tag capability sets with a uid - the uid of the root user in the namespace
> > > > where the capabilities are set.  The capabilities will be ignored in any
> > > > other namespace.  The special case of uid == -1 (which must only ever be
> > > > able to be set by kuid 0) means use the capabilities in all
> > > > namespaces.
> > 
> > really since security.capability xattrs are currently honored in all
> > namespaces this isn't really necessary.  Until and unless Seth's set
> > changes that.
> > 
> > > 
> > > A quick comment on this.
> > > 
> > > We currently allow capabilities that have been gained to be valid in all
> > > descendent user namespaces.
> > > 
> > > Applying this principle to the on-disk capabilities would make it so
> > > that uid 0 would mean capabilities in all namespaces.
> > > 
> > > It might be worth it to introduce a fixed sized array with a length
> > > parameter of perhaps 32 entries which is a path of root uids as seen by
> > > the initial user namespace.  That way the entire construction of the
> > > user namespace could be verified.  AKA verify the current user namespace
> > > and the parent and the parents parent.  Up to the user namespace the
> > > current filesystem is mounted in. We would look at how much space
> > > allows an xattr to be stored without causing filesystems a challenge
> > > to properly size such an array.
> > > 
> > > Given that uids are fundamentally flat that might not be particularly
> > > useful.  If we add an alternative way of identifying user namespaces
> > > say a privileged operation that set a uuid, then the complete path would
> > > be more interesting.
> > > 
> > > > An alternative format would use a pair of uids to indicate a range of rootids.
> > > > This would allow root in a user namespace with uids 100000-165536 mapped to
> > > > set the xattr once on a file, then launch nested containers wherein the file
> > > > could be used with privilege.  That's not what this patch does, but would be
> > > > a trivial change if people think it would be worthwhile.
> > > >
> > > > This patch does not actually address the real problem, which is setting the
> > > > xattrs from inside containers.  For that, I think the best solution is to
> > > > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > > > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > > > the kernel would, if not in init_user_ns, react by writing an appropriate
> > > > security.nscapability xattr.
> > > 
> > > That feels hard to maintain, but you may be correct that we have a small
> > > enough userspace that it would not be a problem.
> > > 
> > > Eric
> > > 
> > > 
> > > > The libcap2 library's cap_set_file/cap_get_file could be switched over
> > > > transparently to use this to hide its use from all callers.
> > > >
> > > > Comments appreciated.
> > > >
> > > > Note - In this patch, file capabilities only work for containers which have
> > > > a root uid defined.  We may want to allow -1 uids to work in all
> > > > namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> > > > about the implications of allowing a program privilege in a container where
> > > > there is no uid with privilege.  This needs more thought.
> > 
> > So for actually enabling (user-namespaced) containers to use these, there
> > are a few possibilities that come to mine.
> > 
> > 1. A new setfcap (/getfcap) syscall.  Uses mapped uid 0 from
> > current_user_ns() to write a value in the security.nscapability xattr.
> > Userspace doesn't need to worry at all about namespace issues.
> 
> Your patch has an "ncaps" field and supports giving (possibly different)
> privileges to different namespace root users.
> I think that with a setfcap() syscall as you describe, it would be hard
> to add more than one security.nscapability entry to a file without
> poking holes into stuff:
> 
> If setfcap() requires inode write access, every namespace root for whom
> a security.nscapability entry can be added can overwrite the file, so
> namespace roots might be able to execute code as each other.
> 
> If setfcap() doesn't require inode write access, that isn't a big
> security issue, but it allows unprivileged users to waste disk space
> in a writable-mounted filesystem on which they have no write access
> anywhere. I'm not sure whether that is a sufficiently big problem for
> anyone to care.
> 
> 
> Another issue with this would be that it makes restoring sub-namespace
> capabilities from a backup somewhat ugly, at least if you're not in the
> init ns: You'd have to parse the capabilities attribute, then, for every
> attribute you want to restore whose rootid doesn't refer to your
> namespace, clone(CLONE_NEWUSER), map the target uid to 0 from the parent
> process, seteuid(0), do the setfcap() and exit.
> 
> It would require less synchronization with my ptrace hardening patch
> that checks whether uids are mapped (if that lands, I don't think it has
> yet?) because that would allow you to first setuid, then clone(), map
> the uid to 0 inside the namespace and setfcap(), but it'd still require
> special-case code in backup software.
> 
> 
> > 2. Just expect userspace to write a xattr;  kernel checks that no values
> > are changed for any other namespaces.  This could be a lot of parsing and
> > verifying in the kernel.
> 
> This option would allow the first problem I described with option 1 (if
> it is a problem?) to be worked around by letting a helper in the outer
> namespace add capabilities for lower namespaces - but it wouldn't be

Note that regardless, the fscaps would still be stored as an xattr.
The setfcap/getfacp syscalls would be for convenience, but a suitably
privileged task in the init_user_ns could just write the xattr.

You've made a good point here and in your prev email: when a namespaced
file cap exists for say uid 1000, perhaps it should apply for every
descendent ns of uid 1000.  In this case maybe it makes sense to drop
the nscaps field and actually only allow one namespace fscap.  It must
be set by the owner of the file (i.e., uid 1000, or uid 0 if a
host-root-owned file) or someone privileged over it (i.e. uid 100000
who is root in his ns and i_sb->s_user_ns->owner for the file).

So if the host has /bin/ping with cap_net_raw=ep, any user in any
ns which can reach and xecute it can run the file with those privileges
(in their own ns).  If uid 1000 creates a user_ns with root 100000,
that container can create a $rootfs/bin/ping with cap_net_raw=ep
with uid tag 100000.  If that container creates a sub-container owned
by uid 150000, that container can also run $rootfs/bin/ping with
cap_net_raw=ep, but uid 1001 on the host does not.  Now we don't have
the risk of wasting disk space, and restoring a cap-endowed file in
a sub-userns is trivial.

In this case we can probably also do away with the setfcap() syscall,
as userspace only needs to look for the "0 x y" /proc/self/uid_map
entry and enter x in the xattr.

Does this seem reasonable and somewhat risk-averse?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2016-01-27 16:08       ` Serge E. Hallyn
@ 2016-01-27 17:22         ` Jann Horn
  2016-01-28  0:36             ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Jann Horn @ 2016-01-27 17:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Serge E. Hallyn, lkml, Andrew Morgan,
	Andy Lutomirski, lxc-devel, Richard Weinberger, LSM, linux-api,
	keescook

[-- Attachment #1: Type: text/plain, Size: 9886 bytes --]

On Wed, Jan 27, 2016 at 10:08:15AM -0600, Serge E. Hallyn wrote:
> On Wed, Jan 20, 2016 at 01:48:16PM +0100, Jann Horn wrote:
> > On Fri, Dec 04, 2015 at 02:21:16PM -0600, Serge E. Hallyn wrote:
> > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
> > > > 
> > > > > A common way for daemons to run with minimal privilege is to start as root,
> > > > > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEPCAPS,
> > > > > then change uid to non-root.  A simpler way to achieve this is to set file
> > > > > capabilities on a not-setuid-root binary.  However, when installing a package
> > > > > inside a (user-namespaced) container, packages cannot be installed with file
> > > > > capabilities.  For this reason, containers must install ping setuid-root.
> > > > 
> > > > Don't ping sockets avoid that specific problem?
> > > > 
> > > > I expect the general case still holds.
> > > > 
> > > > > To achieve this, we would need for containers to be able to request file
> > > > > capabilities be added to a file without causing these to be honored in the
> > > > > initial user namespace.
> > > > >
> > > > > To this end, the patch below introduces a new capability xattr format.  The
> > > > > main enhancement over the existing security.capability xattr is that we
> > > > > tag capability sets with a uid - the uid of the root user in the namespace
> > > > > where the capabilities are set.  The capabilities will be ignored in any
> > > > > other namespace.  The special case of uid == -1 (which must only ever be
> > > > > able to be set by kuid 0) means use the capabilities in all
> > > > > namespaces.
> > > 
> > > really since security.capability xattrs are currently honored in all
> > > namespaces this isn't really necessary.  Until and unless Seth's set
> > > changes that.
> > > 
> > > > 
> > > > A quick comment on this.
> > > > 
> > > > We currently allow capabilities that have been gained to be valid in all
> > > > descendent user namespaces.
> > > > 
> > > > Applying this principle to the on-disk capabilities would make it so
> > > > that uid 0 would mean capabilities in all namespaces.
> > > > 
> > > > It might be worth it to introduce a fixed sized array with a length
> > > > parameter of perhaps 32 entries which is a path of root uids as seen by
> > > > the initial user namespace.  That way the entire construction of the
> > > > user namespace could be verified.  AKA verify the current user namespace
> > > > and the parent and the parents parent.  Up to the user namespace the
> > > > current filesystem is mounted in. We would look at how much space
> > > > allows an xattr to be stored without causing filesystems a challenge
> > > > to properly size such an array.
> > > > 
> > > > Given that uids are fundamentally flat that might not be particularly
> > > > useful.  If we add an alternative way of identifying user namespaces
> > > > say a privileged operation that set a uuid, then the complete path would
> > > > be more interesting.
> > > > 
> > > > > An alternative format would use a pair of uids to indicate a range of rootids.
> > > > > This would allow root in a user namespace with uids 100000-165536 mapped to
> > > > > set the xattr once on a file, then launch nested containers wherein the file
> > > > > could be used with privilege.  That's not what this patch does, but would be
> > > > > a trivial change if people think it would be worthwhile.
> > > > >
> > > > > This patch does not actually address the real problem, which is setting the
> > > > > xattrs from inside containers.  For that, I think the best solution is to
> > > > > add a pair of new system calls, setfcap and getfcap. Userspace would for
> > > > > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to which
> > > > > the kernel would, if not in init_user_ns, react by writing an appropriate
> > > > > security.nscapability xattr.
> > > > 
> > > > That feels hard to maintain, but you may be correct that we have a small
> > > > enough userspace that it would not be a problem.
> > > > 
> > > > Eric
> > > > 
> > > > 
> > > > > The libcap2 library's cap_set_file/cap_get_file could be switched over
> > > > > transparently to use this to hide its use from all callers.
> > > > >
> > > > > Comments appreciated.
> > > > >
> > > > > Note - In this patch, file capabilities only work for containers which have
> > > > > a root uid defined.  We may want to allow -1 uids to work in all
> > > > > namespaces.  There certainly would be uses for this, but I'm a bit unsettled
> > > > > about the implications of allowing a program privilege in a container where
> > > > > there is no uid with privilege.  This needs more thought.
> > > 
> > > So for actually enabling (user-namespaced) containers to use these, there
> > > are a few possibilities that come to mine.
> > > 
> > > 1. A new setfcap (/getfcap) syscall.  Uses mapped uid 0 from
> > > current_user_ns() to write a value in the security.nscapability xattr.
> > > Userspace doesn't need to worry at all about namespace issues.
> > 
> > Your patch has an "ncaps" field and supports giving (possibly different)
> > privileges to different namespace root users.
> > I think that with a setfcap() syscall as you describe, it would be hard
> > to add more than one security.nscapability entry to a file without
> > poking holes into stuff:
> > 
> > If setfcap() requires inode write access, every namespace root for whom
> > a security.nscapability entry can be added can overwrite the file, so
> > namespace roots might be able to execute code as each other.
> > 
> > If setfcap() doesn't require inode write access, that isn't a big
> > security issue, but it allows unprivileged users to waste disk space
> > in a writable-mounted filesystem on which they have no write access
> > anywhere. I'm not sure whether that is a sufficiently big problem for
> > anyone to care.
> > 
> > 
> > Another issue with this would be that it makes restoring sub-namespace
> > capabilities from a backup somewhat ugly, at least if you're not in the
> > init ns: You'd have to parse the capabilities attribute, then, for every
> > attribute you want to restore whose rootid doesn't refer to your
> > namespace, clone(CLONE_NEWUSER), map the target uid to 0 from the parent
> > process, seteuid(0), do the setfcap() and exit.
> > 
> > It would require less synchronization with my ptrace hardening patch
> > that checks whether uids are mapped (if that lands, I don't think it has
> > yet?) because that would allow you to first setuid, then clone(), map
> > the uid to 0 inside the namespace and setfcap(), but it'd still require
> > special-case code in backup software.
> > 
> > 
> > > 2. Just expect userspace to write a xattr;  kernel checks that no values
> > > are changed for any other namespaces.  This could be a lot of parsing and
> > > verifying in the kernel.
> > 
> > This option would allow the first problem I described with option 1 (if
> > it is a problem?) to be worked around by letting a helper in the outer
> > namespace add capabilities for lower namespaces - but it wouldn't be
> 
> Note that regardless, the fscaps would still be stored as an xattr.
> The setfcap/getfacp syscalls would be for convenience, but a suitably
> privileged task in the init_user_ns could just write the xattr.

Yes - I was thinking of a privileged task in an intermediate namespace,
which could then be allowed to backup and restore capability xattrs for
all mapped uids.
I'm not sure whether that's a realistic/common scenario. Maybe if someone
uses an LXC VPS inside which he uses LXC to further isolate various
services.


> You've made a good point here and in your prev email: when a namespaced
> file cap exists for say uid 1000, perhaps it should apply for every
> descendent ns of uid 1000.  In this case maybe it makes sense to drop
> the nscaps field and actually only allow one namespace fscap.  It must
> be set by the owner of the file (i.e., uid 1000, or uid 0 if a
> host-root-owned file) or someone privileged over it (i.e. uid 100000
> who is root in his ns and i_sb->s_user_ns->owner for the file).
> 
> So if the host has /bin/ping with cap_net_raw=ep, any user in any
> ns which can reach and xecute it can run the file with those privileges
> (in their own ns).  If uid 1000 creates a user_ns with root 100000,
> that container can create a $rootfs/bin/ping with cap_net_raw=ep
> with uid tag 100000.  If that container creates a sub-container owned
> by uid 150000, that container can also run $rootfs/bin/ping with
> cap_net_raw=ep, but uid 1001 on the host does not.  Now we don't have
> the risk of wasting disk space, and restoring a cap-endowed file in
> a sub-userns is trivial.
> 
> In this case we can probably also do away with the setfcap() syscall,
> as userspace only needs to look for the "0 x y" /proc/self/uid_map
> entry and enter x in the xattr.

That won't work in a namespace >1 levels below the init ns though,
right? Because /proc/self/uid_map only shows the uids in the current
user ns and its parent, not the kuids.

$ id
uid=1000[...]
$ unshare -Ur unshare -Ur cat /proc/self/uid_map
         0          0          1

This could of course be worked around by adding a "map this uid up
to a kuid" ioctl for user namespaces to ns_file_operations or so,
but I'm not sure how I feel about that.

Can we let the kernel map the uid in the file attribute up/down
on attribute access? If it's just one binary 32-bit field at the
start of the attribute value, the code should be fairly
straightforward.

> Does this seem reasonable and somewhat risk-averse?

I think it sounds good from a security perspective.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-01-28  0:36             ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-28  0:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Eric W. Biederman, Serge E. Hallyn, lkml,
	Andrew Morgan, LXC development mailing-list, Richard Weinberger,
	LSM, Linux API, Kees Cook

On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> I think it sounds good from a security perspective.

I'm a bit late to the game, but I have a question: why should this be
keyed to the *root* uid of the namespace in particular?  Certainly if
user foo trusts the cap bits on some file, then user foo might trust
those caps to be exerted over any namespace that user foo owns, since
user foo owns the namespace.

But another option would be to include a list of uids and gids such
that the cap bits on the file are trusted by any namespace that maps
only uids and gids in the list.  After all, the existence of a
namespace with root user foo that also maps bar and baz along with a
file with caps set means that, if baz can get to the file and
permissions are set appropriately, then baz now owns bar (via any
number of fs-related capabilities).  So maybe bar and baz should have
to be listed as well.

But maybe this doesn't matter.

In any event, at the end of the day, the right answer to all of this
is to stop using setuid and stop using cap bits too and start using
privileged daemons or other things that don't use the eternally
fragile grant-privilege-on-execve mechanisms.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-01-28  0:36             ` Andy Lutomirski
  0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-01-28  0:36 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Eric W. Biederman, Serge E. Hallyn, lkml,
	Andrew Morgan, LXC development mailing-list, Richard Weinberger,
	LSM, Linux API, Kees Cook

On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> wrote:
> I think it sounds good from a security perspective.

I'm a bit late to the game, but I have a question: why should this be
keyed to the *root* uid of the namespace in particular?  Certainly if
user foo trusts the cap bits on some file, then user foo might trust
those caps to be exerted over any namespace that user foo owns, since
user foo owns the namespace.

But another option would be to include a list of uids and gids such
that the cap bits on the file are trusted by any namespace that maps
only uids and gids in the list.  After all, the existence of a
namespace with root user foo that also maps bar and baz along with a
file with caps set means that, if baz can get to the file and
permissions are set appropriately, then baz now owns bar (via any
number of fs-related capabilities).  So maybe bar and baz should have
to be listed as well.

But maybe this doesn't matter.

In any event, at the end of the day, the right answer to all of this
is to stop using setuid and stop using cap bits too and start using
privileged daemons or other things that don't use the eternally
fragile grant-privilege-on-execve mechanisms.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
  2016-01-28  0:36             ` Andy Lutomirski
@ 2016-01-29  7:31               ` Serge E. Hallyn
  -1 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-01-29  7:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Serge E. Hallyn, Eric W. Biederman, Serge E. Hallyn,
	lkml, Andrew Morgan, LXC development mailing-list,
	Richard Weinberger, LSM, Linux API, Kees Cook

On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> > I think it sounds good from a security perspective.
> 
> I'm a bit late to the game, but I have a question: why should this be
> keyed to the *root* uid of the namespace in particular?  Certainly if
> user foo trusts the cap bits on some file, then user foo might trust
> those caps to be exerted over any namespace that user foo owns, since
> user foo owns the namespace.

...  Tying it to a kuid which represents the userns->owner of any
namespace in which the capability will be honored might be fine
with me.  Is that what you mean?  So if uid 1000 creates a userns
mapping uids 100000-200000, and 100000 in that container puts X=pe
on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
X.  Uid 101000 in someone else's container does not.

Although, if I create two containers and provide them different
uidmaps, it may well be because I want them segragated and want
to minimize the changes of one container breaking out into the
other.  This risks breaking that.

> But another option would be to include a list of uids and gids such
> that the cap bits on the file are trusted by any namespace that maps
> only uids and gids in the list.  After all, the existence of a
> namespace with root user foo that also maps bar and baz along with a
> file with caps set means that, if baz can get to the file and
> permissions are set appropriately, then baz now owns bar (via any
> number of fs-related capabilities).  So maybe bar and baz should have
> to be listed as well.
> 
> But maybe this doesn't matter.
> 
> In any event, at the end of the day, the right answer to all of this
> is to stop using setuid and stop using cap bits too and start using
> privileged daemons or other things that don't use the eternally
> fragile grant-privilege-on-execve mechanisms.

Heh, that's why I wrote a p9auth driver a few years ago, but it
was too early for such a thing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-01-29  7:31               ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-01-29  7:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Serge E. Hallyn, Eric W. Biederman, Serge E. Hallyn,
	lkml, Andrew Morgan, LXC development mailing-list,
	Richard Weinberger, LSM, Linux API, Kees Cook

On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> > I think it sounds good from a security perspective.
> 
> I'm a bit late to the game, but I have a question: why should this be
> keyed to the *root* uid of the namespace in particular?  Certainly if
> user foo trusts the cap bits on some file, then user foo might trust
> those caps to be exerted over any namespace that user foo owns, since
> user foo owns the namespace.

...  Tying it to a kuid which represents the userns->owner of any
namespace in which the capability will be honored might be fine
with me.  Is that what you mean?  So if uid 1000 creates a userns
mapping uids 100000-200000, and 100000 in that container puts X=pe
on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
X.  Uid 101000 in someone else's container does not.

Although, if I create two containers and provide them different
uidmaps, it may well be because I want them segragated and want
to minimize the changes of one container breaking out into the
other.  This risks breaking that.

> But another option would be to include a list of uids and gids such
> that the cap bits on the file are trusted by any namespace that maps
> only uids and gids in the list.  After all, the existence of a
> namespace with root user foo that also maps bar and baz along with a
> file with caps set means that, if baz can get to the file and
> permissions are set appropriately, then baz now owns bar (via any
> number of fs-related capabilities).  So maybe bar and baz should have
> to be listed as well.
> 
> But maybe this doesn't matter.
> 
> In any event, at the end of the day, the right answer to all of this
> is to stop using setuid and stop using cap bits too and start using
> privileged daemons or other things that don't use the eternally
> fragile grant-privilege-on-execve mechanisms.

Heh, that's why I wrote a p9auth driver a few years ago, but it
was too early for such a thing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-02-29 21:38                 ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-02-29 21:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Jann Horn, Serge E. Hallyn, Eric W. Biederman,
	lkml, Andrew Morgan, LXC development mailing-list,
	Richard Weinberger, LSM, Linux API, Kees Cook, Christian Brauner

On Fri, Jan 29, 2016 at 01:31:51AM -0600, Serge E. Hallyn wrote:
> On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> > > I think it sounds good from a security perspective.
> > 
> > I'm a bit late to the game, but I have a question: why should this be
> > keyed to the *root* uid of the namespace in particular?  Certainly if
> > user foo trusts the cap bits on some file, then user foo might trust
> > those caps to be exerted over any namespace that user foo owns, since
> > user foo owns the namespace.
> 
> ...  Tying it to a kuid which represents the userns->owner of any
> namespace in which the capability will be honored might be fine
> with me.  Is that what you mean?  So if uid 1000 creates a userns
> mapping uids 100000-200000, and 100000 in that container puts X=pe
> on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
> X.  Uid 101000 in someone else's container does not.
> 
> Although, if I create two containers and provide them different
> uidmaps, it may well be because I want them segragated and want
> to minimize the changes of one container breaking out into the
> other.  This risks breaking that.

Thinking differently now...  I really want it to "just work" to tar
and untar these.  So I'm thinking of simply using the file owner
as the uid.  So to write a security.ns_capability xattr, you must
be uid 0 in the inode's namespace, the file must be owned by uid 0,
and the capabilities in the xattr will be honored for any namespace
where in that uid_t 0 is root.

Does that sound overly restrictive?  I expect file capabilities to
be used on files owned by root but not setuid-root, so I think it
is ok.

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-02-29 21:38                 ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-02-29 21:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Jann Horn, Serge E. Hallyn, Eric W. Biederman,
	lkml, Andrew Morgan, LXC development mailing-list,
	Richard Weinberger, LSM, Linux API, Kees Cook, Christian Brauner

On Fri, Jan 29, 2016 at 01:31:51AM -0600, Serge E. Hallyn wrote:
> On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> wrote:
> > > I think it sounds good from a security perspective.
> > 
> > I'm a bit late to the game, but I have a question: why should this be
> > keyed to the *root* uid of the namespace in particular?  Certainly if
> > user foo trusts the cap bits on some file, then user foo might trust
> > those caps to be exerted over any namespace that user foo owns, since
> > user foo owns the namespace.
> 
> ...  Tying it to a kuid which represents the userns->owner of any
> namespace in which the capability will be honored might be fine
> with me.  Is that what you mean?  So if uid 1000 creates a userns
> mapping uids 100000-200000, and 100000 in that container puts X=pe
> on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
> X.  Uid 101000 in someone else's container does not.
> 
> Although, if I create two containers and provide them different
> uidmaps, it may well be because I want them segragated and want
> to minimize the changes of one container breaking out into the
> other.  This risks breaking that.

Thinking differently now...  I really want it to "just work" to tar
and untar these.  So I'm thinking of simply using the file owner
as the uid.  So to write a security.ns_capability xattr, you must
be uid 0 in the inode's namespace, the file must be owned by uid 0,
and the capabilities in the xattr will be honored for any namespace
where in that uid_t 0 is root.

Does that sound overly restrictive?  I expect file capabilities to
be used on files owned by root but not setuid-root, so I think it
is ok.

-serge

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-03-02  0:00                   ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-03-02  0:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge E. Hallyn, Andy Lutomirski, Jann Horn, Eric W. Biederman,
	lkml, Andrew Morgan, LXC development mailing-list,
	Richard Weinberger, LSM, Linux API, Kees Cook, Christian Brauner

On Mon, Feb 29, 2016 at 03:38:20PM -0600, Serge E. Hallyn wrote:
> On Fri, Jan 29, 2016 at 01:31:51AM -0600, Serge E. Hallyn wrote:
> > On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> > > On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> > > > I think it sounds good from a security perspective.
> > > 
> > > I'm a bit late to the game, but I have a question: why should this be
> > > keyed to the *root* uid of the namespace in particular?  Certainly if
> > > user foo trusts the cap bits on some file, then user foo might trust
> > > those caps to be exerted over any namespace that user foo owns, since
> > > user foo owns the namespace.
> > 
> > ...  Tying it to a kuid which represents the userns->owner of any
> > namespace in which the capability will be honored might be fine
> > with me.  Is that what you mean?  So if uid 1000 creates a userns
> > mapping uids 100000-200000, and 100000 in that container puts X=pe
> > on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
> > X.  Uid 101000 in someone else's container does not.
> > 
> > Although, if I create two containers and provide them different
> > uidmaps, it may well be because I want them segragated and want
> > to minimize the changes of one container breaking out into the
> > other.  This risks breaking that.
> 
> Thinking differently now...  I really want it to "just work" to tar
> and untar these.  So I'm thinking of simply using the file owner
> as the uid.  So to write a security.ns_capability xattr, you must
> be uid 0 in the inode's namespace, the file must be owned by uid 0,
> and the capabilities in the xattr will be honored for any namespace
> where in that uid_t 0 is root.
> 
> Does that sound overly restrictive?  I expect file capabilities to
> be used on files owned by root but not setuid-root, so I think it
> is ok.
> 
> -serge

Here is a working first draft:

>From 019ff81124b7dd3161414720f5666f6793a8ccd9 Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serge.hallyn@ubuntu.com>
Date: Tue, 1 Mar 2016 00:09:35 +0000
Subject: [PATCH 1/1] simplified security.nscapability xattr

This can only be set by root in his own namespace, and will
only be respected by namespaces with that same root kuid
mapped as root.  The file must be owned by the root user in
the container.  This allows us to avoid having to store a
'root user' value in the capability.

This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 include/linux/capability.h      |  5 ++-
 include/uapi/linux/capability.h | 18 +++++++++
 include/uapi/linux/xattr.h      |  3 ++
 security/commoncap.c            | 81 +++++++++++++++++++++++++++++++++++++++--
 4 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index af9f0b9..19a37a9 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
 	kernel_cap_t inheritable;
 };
 
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..f0b4a66 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION     1
+
 #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
 #define VFS_CAP_U32             VFS_CAP_U32_2
 #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
@@ -74,6 +77,21 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+#define VFS_NS_CAP_EFFECTIVE    0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+       __le32 magic_etc;
+       struct {
+               __le32 permitted;    /* Little endian */
+               __le32 inheritable;  /* Little endian */
+       } data[VFS_CAP_U32];
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 6f093f3..735d4c7 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -308,6 +308,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 	if (!inode->i_op->getxattr)
 	       return 0;
 
+	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+	if (error > 0)
+		return 1;
+
 	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
 	if (error <= 0)
 		return 0;
@@ -325,11 +329,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	int ret1, ret2;
 
 	if (!inode->i_op->removexattr)
 	       return 0;
 
-	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+	if (ret1 != 0)
+		return ret1;
+	return ret2;
 }
 
 /*
@@ -433,6 +443,55 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	unsigned tocopy, i;
+	u32 magic_etc;
+	ssize_t size;
+	struct vfs_ns_cap_data nscap;
+
+	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+	if (!inode || !inode->i_op->getxattr)
+		return -ENODATA;
+
+	/* verify that userns root owns this file */
+	if (from_kuid(current_user_ns(), dentry->d_inode->i_uid) != 0)
+		return -ENODATA;
+
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+			&nscap, sizeof(nscap));
+	if (size == -ENODATA || size == -EOPNOTSUPP)
+		/* no data, that's ok */
+		return -ENODATA;
+	if (size < 0)
+		return size;
+	if (size != sizeof(nscap))
+		return -EINVAL;
+
+	magic_etc = le32_to_cpu(nscap.magic_etc);
+
+	if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
+		return -EINVAL;
+
+	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+	if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
+		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+	/* copy the entry */
+	CAP_FOR_EACH_U32(i) {
+		if (i >= VFS_CAP_U32_2)
+			break;
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
+	}
+
+	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+	return 0;
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -453,11 +512,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	if (!current_in_userns(bprm->file->f_path.mnt->mnt_sb->s_user_ns))
 		return 0;
 
-	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	if (rc == -ENODATA)
+		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -661,6 +722,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !ns_capable(user_ns, CAP_SYS_ADMIN))
@@ -689,6 +756,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !ns_capable(user_ns, CAP_SYS_ADMIN))
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] Introduce new security.nscapability xattr
@ 2016-03-02  0:00                   ` Serge E. Hallyn
  0 siblings, 0 replies; 19+ messages in thread
From: Serge E. Hallyn @ 2016-03-02  0:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kees Cook, Richard Weinberger, lkml, Andy Lutomirski, LSM,
	LXC development mailing-list, Linux API, Jann Horn,
	Andrew Morgan

On Mon, Feb 29, 2016 at 03:38:20PM -0600, Serge E. Hallyn wrote:
> On Fri, Jan 29, 2016 at 01:31:51AM -0600, Serge E. Hallyn wrote:
> > On Wed, Jan 27, 2016 at 04:36:02PM -0800, Andy Lutomirski wrote:
> > > On Wed, Jan 27, 2016 at 9:22 AM, Jann Horn <jann@thejh.net> wrote:
> > > > I think it sounds good from a security perspective.
> > > 
> > > I'm a bit late to the game, but I have a question: why should this be
> > > keyed to the *root* uid of the namespace in particular?  Certainly if
> > > user foo trusts the cap bits on some file, then user foo might trust
> > > those caps to be exerted over any namespace that user foo owns, since
> > > user foo owns the namespace.
> > 
> > ...  Tying it to a kuid which represents the userns->owner of any
> > namespace in which the capability will be honored might be fine
> > with me.  Is that what you mean?  So if uid 1000 creates a userns
> > mapping uids 100000-200000, and 100000 in that container puts X=pe
> > on /bin/foo, uid 101000 in that container runs /bin/foo with privilege
> > X.  Uid 101000 in someone else's container does not.
> > 
> > Although, if I create two containers and provide them different
> > uidmaps, it may well be because I want them segragated and want
> > to minimize the changes of one container breaking out into the
> > other.  This risks breaking that.
> 
> Thinking differently now...  I really want it to "just work" to tar
> and untar these.  So I'm thinking of simply using the file owner
> as the uid.  So to write a security.ns_capability xattr, you must
> be uid 0 in the inode's namespace, the file must be owned by uid 0,
> and the capabilities in the xattr will be honored for any namespace
> where in that uid_t 0 is root.
> 
> Does that sound overly restrictive?  I expect file capabilities to
> be used on files owned by root but not setuid-root, so I think it
> is ok.
> 
> -serge

Here is a working first draft:

From 019ff81124b7dd3161414720f5666f6793a8ccd9 Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serge.hallyn@ubuntu.com>
Date: Tue, 1 Mar 2016 00:09:35 +0000
Subject: [PATCH 1/1] simplified security.nscapability xattr

This can only be set by root in his own namespace, and will
only be respected by namespaces with that same root kuid
mapped as root.  The file must be owned by the root user in
the container.  This allows us to avoid having to store a
'root user' value in the capability.

This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 include/linux/capability.h      |  5 ++-
 include/uapi/linux/capability.h | 18 +++++++++
 include/uapi/linux/xattr.h      |  3 ++
 security/commoncap.c            | 81 +++++++++++++++++++++++++++++++++++++++--
 4 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index af9f0b9..19a37a9 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
 	kernel_cap_t inheritable;
 };
 
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..f0b4a66 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION     1
+
 #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
 #define VFS_CAP_U32             VFS_CAP_U32_2
 #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
@@ -74,6 +77,21 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+#define VFS_NS_CAP_EFFECTIVE    0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+       __le32 magic_etc;
+       struct {
+               __le32 permitted;    /* Little endian */
+               __le32 inheritable;  /* Little endian */
+       } data[VFS_CAP_U32];
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 6f093f3..735d4c7 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -308,6 +308,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 	if (!inode->i_op->getxattr)
 	       return 0;
 
+	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+	if (error > 0)
+		return 1;
+
 	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
 	if (error <= 0)
 		return 0;
@@ -325,11 +329,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	int ret1, ret2;
 
 	if (!inode->i_op->removexattr)
 	       return 0;
 
-	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+	if (ret1 != 0)
+		return ret1;
+	return ret2;
 }
 
 /*
@@ -433,6 +443,55 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	unsigned tocopy, i;
+	u32 magic_etc;
+	ssize_t size;
+	struct vfs_ns_cap_data nscap;
+
+	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+	if (!inode || !inode->i_op->getxattr)
+		return -ENODATA;
+
+	/* verify that userns root owns this file */
+	if (from_kuid(current_user_ns(), dentry->d_inode->i_uid) != 0)
+		return -ENODATA;
+
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+			&nscap, sizeof(nscap));
+	if (size == -ENODATA || size == -EOPNOTSUPP)
+		/* no data, that's ok */
+		return -ENODATA;
+	if (size < 0)
+		return size;
+	if (size != sizeof(nscap))
+		return -EINVAL;
+
+	magic_etc = le32_to_cpu(nscap.magic_etc);
+
+	if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
+		return -EINVAL;
+
+	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+	if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
+		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+	/* copy the entry */
+	CAP_FOR_EACH_U32(i) {
+		if (i >= VFS_CAP_U32_2)
+			break;
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
+	}
+
+	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+	return 0;
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -453,11 +512,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	if (!current_in_userns(bprm->file->f_path.mnt->mnt_sb->s_user_ns))
 		return 0;
 
-	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	if (rc == -ENODATA)
+		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -661,6 +722,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !ns_capable(user_ns, CAP_SYS_ADMIN))
@@ -689,6 +756,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !ns_capable(user_ns, CAP_SYS_ADMIN))
-- 
2.7.0

_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-03-02  0:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-30 22:43 [PATCH RFC] Introduce new security.nscapability xattr Serge E. Hallyn
2015-11-30 23:08 ` Eric W. Biederman
2015-11-30 23:08   ` Eric W. Biederman
2015-12-01  3:51   ` Serge E. Hallyn
2015-12-01  3:51     ` Serge E. Hallyn
2015-12-04 20:21   ` Serge E. Hallyn
2016-01-19  7:09     ` Serge E. Hallyn
2016-01-20 12:48     ` Jann Horn
2016-01-27 16:08       ` Serge E. Hallyn
2016-01-27 17:22         ` Jann Horn
2016-01-28  0:36           ` Andy Lutomirski
2016-01-28  0:36             ` Andy Lutomirski
2016-01-29  7:31             ` Serge E. Hallyn
2016-01-29  7:31               ` Serge E. Hallyn
2016-02-29 21:38               ` Serge E. Hallyn
2016-02-29 21:38                 ` Serge E. Hallyn
2016-03-02  0:00                 ` Serge E. Hallyn
2016-03-02  0:00                   ` Serge E. Hallyn
2016-01-20 12:14 ` Jann Horn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.