linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* namespaced file capabilities
@ 2016-04-22 17:26 serge.hallyn
  2016-04-22 17:26 ` [PATCH 1/1] simplified security.nscapability xattr serge.hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: serge.hallyn @ 2016-04-22 17:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, containers, ebiederm, morgan, luto, keescook, jmorris

Hi,

I've sent a few patches and emails over the past months about supporting
file capabilities in user namespace confined containers.  A few of the
requirements as I see them are:

1. Root in a user namespace should be able to set file capabilities on a binary
for use by any user mapped into his namespace.

2. Any uid not mapped into the user namespace whose root user set file
capabilities should not gain privileges when running an executable which only
has file capabilities set by this root user.

3. Existing calls to cap_set_file(3) and cap_get_file(3) as well as
setcap(8) and getcap(8) should transparently work.  This would allow
package managers to simply set file capabilities in postinst.

Below is a kernel patch which implements a new security.nscapability
extended attribute.  Setting this xattr on a file requires cap_setfcap
against the current user namespace, and for the file to be owned by
a uid and gid mapped into that namespace.  When found on a file,
the capabilities will take effect only if the file is owned by the
root uid in the caller's namespace, or the root uid in any ancestor
namespace.

While this design supports nested namespaces, it does not support
use of file capabilities by users in unrelated namespaces.  So if
the same file is linked into two namespaces N1 and N2 which do not
share the same root kuid, then the only way for N1 and N2 to both
execute the file while respecting security.nscapability is to have
a common ancestor namespace write the capability.  The only reasonable
way we could handle this case would be to use a securityfs interface
to set file capabilities.  The capability.ko module could then
do the work of keeping a list of uid ranges for which file capabilities
should be honored.  I don't think that flexibility is really called
for.
 
The kernel patch follows, and can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/log/?h=2016-04-22/nsfscaps

The libcap patch can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/libcap.git/log/?h=2016-04-22/nscaps

Comments/conversation/suggestions greatly appreciated.

thanks,
-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/1] simplified security.nscapability xattr
  2016-04-22 17:26 namespaced file capabilities serge.hallyn
@ 2016-04-22 17:26 ` serge.hallyn
  2016-04-26 19:46   ` Seth Forshee
  2016-04-26 21:59   ` Kees Cook
  0 siblings, 2 replies; 33+ messages in thread
From: serge.hallyn @ 2016-04-22 17:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-api, containers, ebiederm, morgan, luto, keescook, jmorris,
	Serge Hallyn

From: Serge Hallyn <serge.hallyn@ubuntu.com>

This can only be set by root in his own namespace, and will
only be respected by namespaces with that same root kuid
mapped as root, or namespaces descended from it.

This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 include/linux/capability.h      |    5 ++-
 include/uapi/linux/capability.h |   18 ++++++++
 include/uapi/linux/xattr.h      |    3 ++
 security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 112 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 00690ff..cf533ff 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
 	kernel_cap_t inheritable;
 };
 
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..f0b4a66 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION     1
+
 #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
 #define VFS_CAP_U32             VFS_CAP_U32_2
 #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
@@ -74,6 +77,21 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+#define VFS_NS_CAP_EFFECTIVE    0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+       __le32 magic_etc;
+       struct {
+               __le32 permitted;    /* Little endian */
+               __le32 inheritable;  /* Little endian */
+       } data[VFS_CAP_U32];
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 48071ed..8f3f34a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 	if (!inode->i_op->getxattr)
 	       return 0;
 
+	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+	if (error > 0)
+		return 1;
+
 	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
 	if (error <= 0)
 		return 0;
@@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	int ret1, ret2;
 
 	if (!inode->i_op->removexattr)
 	       return 0;
 
-	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+	if (ret1 != 0)
+		return ret1;
+	return ret2;
 }
 
 /*
@@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	unsigned i;
+	u32 magic_etc;
+	ssize_t size;
+	struct vfs_ns_cap_data nscap;
+	bool foundroot = false;
+	struct user_namespace *ns;
+
+	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+	if (!inode || !inode->i_op->getxattr)
+		return -ENODATA;
+
+	/* verify that current or ancestor userns root owns this file */
+	for (ns = current_user_ns(); ; ns = ns->parent) {
+		if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
+			foundroot = true;
+			break;
+		}
+		if (ns == &init_user_ns)
+			break;
+	}
+	if (!foundroot)
+		return -ENODATA;
+
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+			&nscap, sizeof(nscap));
+	if (size == -ENODATA || size == -EOPNOTSUPP)
+		/* no data, that's ok */
+		return -ENODATA;
+	if (size < 0)
+		return size;
+	if (size != sizeof(nscap))
+		return -EINVAL;
+
+	magic_etc = le32_to_cpu(nscap.magic_etc);
+
+	if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
+		return -EINVAL;
+
+	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+	if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
+		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+	/* copy the entry */
+	CAP_FOR_EACH_U32(i) {
+		if (i >= VFS_CAP_U32_2)
+			break;
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
+	}
+
+	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+	return 0;
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
 		return 0;
 
-	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	if (rc == -ENODATA)
+		rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !capable(CAP_SYS_ADMIN))
@@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !capable(CAP_SYS_ADMIN))
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-22 17:26 ` [PATCH 1/1] simplified security.nscapability xattr serge.hallyn
@ 2016-04-26 19:46   ` Seth Forshee
  2016-04-26 21:59   ` Kees Cook
  1 sibling, 0 replies; 33+ messages in thread
From: Seth Forshee @ 2016-04-26 19:46 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, linux-api, containers, ebiederm, morgan, luto,
	keescook, jmorris

On Fri, Apr 22, 2016 at 12:26:33PM -0500, serge.hallyn@ubuntu.com wrote:
> From: Serge Hallyn <serge.hallyn@ubuntu.com>
> 
> This can only be set by root in his own namespace, and will
> only be respected by namespaces with that same root kuid
> mapped as root, or namespaces descended from it.
> 
> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.
> 
> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>

Seems like the simplest possible design which meets the requirements.

Acked-by: Seth Forshee <seth.forshee@canonical.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-22 17:26 ` [PATCH 1/1] simplified security.nscapability xattr serge.hallyn
  2016-04-26 19:46   ` Seth Forshee
@ 2016-04-26 21:59   ` Kees Cook
  2016-04-26 22:26     ` Serge E. Hallyn
  1 sibling, 1 reply; 33+ messages in thread
From: Kees Cook @ 2016-04-26 21:59 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Linux API, Linux Containers, Eric W. Biederman,
	Andrew G. Morgan, Andy Lutomirski, James Morris

On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> From: Serge Hallyn <serge.hallyn@ubuntu.com>
>
> This can only be set by root in his own namespace, and will
> only be respected by namespaces with that same root kuid
> mapped as root, or namespaces descended from it.
>
> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.

The concept seems sensible to me. Various notes below...

>
> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> ---
>  include/linux/capability.h      |    5 ++-
>  include/uapi/linux/capability.h |   18 ++++++++
>  include/uapi/linux/xattr.h      |    3 ++
>  security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
>  4 files changed, 112 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..cf533ff 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,7 +13,7 @@
>  #define _LINUX_CAPABILITY_H
>
>  #include <uapi/linux/capability.h>
> -
> +#include <linux/uidgid.h>
>
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
>         kernel_cap_t inheritable;
>  };
>
> +#define NS_CAPS_VERSION(x) (x & 0xFF)
> +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> +
>  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
>  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
>
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 12c37a1..f0b4a66 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
>  #define VFS_CAP_U32_2           2
>  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>
> +/* version number for security.nscapability xattrs hdr->hdr_info */
> +#define VFS_NS_CAP_REVISION     1
> +
>  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
>  #define VFS_CAP_U32             VFS_CAP_U32_2
>  #define VFS_CAP_REVISION       VFS_CAP_REVISION_2
> @@ -74,6 +77,21 @@ struct vfs_cap_data {
>         } data[VFS_CAP_U32];
>  };
>
> +#define VFS_NS_CAP_EFFECTIVE    0x1
> +/*
> + * 32-bit hdr_info contains
> + * 16 leftmost: reserved
> + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> + * last 8: version
> + */
> +struct vfs_ns_cap_data {
> +       __le32 magic_etc;
> +       struct {
> +               __le32 permitted;    /* Little endian */
> +               __le32 inheritable;  /* Little endian */
> +       } data[VFS_CAP_U32];
> +};

This is identical to vfs_cap_data. Is there a reason not to re-use the
existing one?

> +
>  #ifndef __KERNEL__
>
>  /*
> diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> index 1590c49..67c80ab 100644
> --- a/include/uapi/linux/xattr.h
> +++ b/include/uapi/linux/xattr.h
> @@ -68,6 +68,9 @@
>  #define XATTR_CAPS_SUFFIX "capability"
>  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>
> +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> +
>  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
>  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
>  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"

Are these documented anywhere in Documentation/ or in man pages? This
seems like it'd need a man-page update too.

> diff --git a/security/commoncap.c b/security/commoncap.c
> index 48071ed..8f3f34a 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>         if (!inode->i_op->getxattr)
>                return 0;
>
> +       error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> +       if (error > 0)
> +               return 1;
> +
>         error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
>         if (error <= 0)
>                 return 0;

I think this might be more readable if the getxattr calls were
standardized (one returns 1, the other returns 0, with inverted tests
-- hard to read). IIUC, if either returns > 0, return 1, otherwise
return 0.

> @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>  int cap_inode_killpriv(struct dentry *dentry)
>  {
>         struct inode *inode = d_backing_inode(dentry);
> +       int ret1, ret2;
>
>         if (!inode->i_op->removexattr)
>                return 0;
>
> -       return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +       ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> +       ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> +
> +       if (ret1 != 0)
> +               return ret1;
> +       return ret2;
>  }
>
>  /*
> @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>         return 0;
>  }
>
> +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> +{
> +       struct inode *inode = d_backing_inode(dentry);
> +       unsigned i;
> +       u32 magic_etc;
> +       ssize_t size;
> +       struct vfs_ns_cap_data nscap;
> +       bool foundroot = false;
> +       struct user_namespace *ns;
> +
> +       memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> +
> +       if (!inode || !inode->i_op->getxattr)
> +               return -ENODATA;
> +
> +       /* verify that current or ancestor userns root owns this file */
> +       for (ns = current_user_ns(); ; ns = ns->parent) {
> +               if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> +                       foundroot = true;
> +                       break;
> +               }
> +               if (ns == &init_user_ns)
> +                       break;
> +       }
> +       if (!foundroot)
> +               return -ENODATA;
> +
> +       size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> +                       &nscap, sizeof(nscap));
> +       if (size == -ENODATA || size == -EOPNOTSUPP)
> +               /* no data, that's ok */
> +               return -ENODATA;
> +       if (size < 0)
> +               return size;
> +       if (size != sizeof(nscap))
> +               return -EINVAL;
> +
> +       magic_etc = le32_to_cpu(nscap.magic_etc);
> +
> +       if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> +               return -EINVAL;
> +
> +       cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> +       if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> +               cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> +       /* copy the entry */
> +       CAP_FOR_EACH_U32(i) {
> +               if (i >= VFS_CAP_U32_2)
> +                       break;
> +               cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> +               cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> +       }
> +
> +       cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +       cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +
> +       return 0;
> +}
> +
>  /*
>   * Attempt to get the on-exec apply capability sets for an executable file from
>   * its xattrs and, if present, apply them to the proposed credentials being
> @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>         if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
>                 return 0;
>
> -       rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +       rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> +       if (rc == -ENODATA)
> +               rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);

So nscaps overrides a "regular" file cap? That might seem worth
mentioning in the change log or somewhere.

>         if (rc < 0) {
>                 if (rc == -EINVAL)
> -                       printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> -                               __func__, rc, bprm->filename);
> +                       printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> +                                       bprm->filename);
>                 else if (rc == -ENODATA)
>                         rc = 0;
>                 goto out;
> @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>                 return 0;
>         }
>
> +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> +                       return -EPERM;
> +               return 0;
> +       }
> +
>         if (!strncmp(name, XATTR_SECURITY_PREFIX,
>                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>             !capable(CAP_SYS_ADMIN))
> @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>                 return 0;
>         }
>
> +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> +                       return -EPERM;
> +               return 0;
> +       }
> +

This looks like userspace must knowingly be aware that it is in a
namespace and to DTRT instead of it being translated by the kernel
when setxattr is called under !init_user_ns?

>         if (!strncmp(name, XATTR_SECURITY_PREFIX,
>                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>             !capable(CAP_SYS_ADMIN))
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-26 21:59   ` Kees Cook
@ 2016-04-26 22:26     ` Serge E. Hallyn
  2016-04-26 22:39       ` Kees Cook
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-04-26 22:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: Serge E. Hallyn, Linux API, Linux Containers, LKML,
	Andy Lutomirski, Eric W. Biederman, Andrew G. Morgan

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> >
> > This can only be set by root in his own namespace, and will
> > only be respected by namespaces with that same root kuid
> > mapped as root, or namespaces descended from it.
> >
> > This allows a simple setxattr to work, allows tar/untar to
> > work, and allows us to tar in one namespace and untar in
> > another while preserving the capability, without risking
> > leaking privilege into a parent namespace.
> 
> The concept seems sensible to me. Various notes below...
> 
> >
> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> > ---
> >  include/linux/capability.h      |    5 ++-
> >  include/uapi/linux/capability.h |   18 ++++++++
> >  include/uapi/linux/xattr.h      |    3 ++
> >  security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
> >  4 files changed, 112 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> > index 00690ff..cf533ff 100644
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -13,7 +13,7 @@
> >  #define _LINUX_CAPABILITY_H
> >
> >  #include <uapi/linux/capability.h>
> > -
> > +#include <linux/uidgid.h>
> >
> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> >         kernel_cap_t inheritable;
> >  };
> >
> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> > +
> >  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
> >  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
> >
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index 12c37a1..f0b4a66 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> >  #define VFS_CAP_U32_2           2
> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >
> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> > +#define VFS_NS_CAP_REVISION     1
> > +
> >  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> >  #define VFS_CAP_U32             VFS_CAP_U32_2
> >  #define VFS_CAP_REVISION       VFS_CAP_REVISION_2
> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> >         } data[VFS_CAP_U32];
> >  };
> >
> > +#define VFS_NS_CAP_EFFECTIVE    0x1
> > +/*
> > + * 32-bit hdr_info contains
> > + * 16 leftmost: reserved
> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> > + * last 8: version
> > + */
> > +struct vfs_ns_cap_data {
> > +       __le32 magic_etc;
> > +       struct {
> > +               __le32 permitted;    /* Little endian */
> > +               __le32 inheritable;  /* Little endian */
> > +       } data[VFS_CAP_U32];
> > +};
> 
> This is identical to vfs_cap_data. Is there a reason not to re-use the
> existing one?

Hm.  I used to have them completely different.  ATM the only difference
is what goes into the magic_etc, and that not very (different).  So
yeah it probably should be re-used.

> > +
> >  #ifndef __KERNEL__
> >
> >  /*
> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> > index 1590c49..67c80ab 100644
> > --- a/include/uapi/linux/xattr.h
> > +++ b/include/uapi/linux/xattr.h
> > @@ -68,6 +68,9 @@
> >  #define XATTR_CAPS_SUFFIX "capability"
> >  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >
> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> > +
> >  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
> >  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> >  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
> 
> Are these documented anywhere in Documentation/ or in man pages? This
> seems like it'd need a man-page update too.

Yeah, if we decide we're ok with this strategy I'll update those.

> > diff --git a/security/commoncap.c b/security/commoncap.c
> > index 48071ed..8f3f34a 100644
> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >         if (!inode->i_op->getxattr)
> >                return 0;
> >
> > +       error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> > +       if (error > 0)
> > +               return 1;
> > +
> >         error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> >         if (error <= 0)
> >                 return 0;
> 
> I think this might be more readable if the getxattr calls were
> standardized (one returns 1, the other returns 0, with inverted tests
> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> return 0.

Hm.  Yeah I can see where that would be confusing.  I can change the flow
to make the checks the same.

> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >  int cap_inode_killpriv(struct dentry *dentry)
> >  {
> >         struct inode *inode = d_backing_inode(dentry);
> > +       int ret1, ret2;
> >
> >         if (!inode->i_op->removexattr)
> >                return 0;
> >
> > -       return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> > +       ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> > +       ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> > +
> > +       if (ret1 != 0)
> > +               return ret1;
> > +       return ret2;
> >  }
> >
> >  /*
> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >         return 0;
> >  }
> >
> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> > +{
> > +       struct inode *inode = d_backing_inode(dentry);
> > +       unsigned i;
> > +       u32 magic_etc;
> > +       ssize_t size;
> > +       struct vfs_ns_cap_data nscap;
> > +       bool foundroot = false;
> > +       struct user_namespace *ns;
> > +
> > +       memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> > +
> > +       if (!inode || !inode->i_op->getxattr)
> > +               return -ENODATA;
> > +
> > +       /* verify that current or ancestor userns root owns this file */
> > +       for (ns = current_user_ns(); ; ns = ns->parent) {
> > +               if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> > +                       foundroot = true;
> > +                       break;
> > +               }
> > +               if (ns == &init_user_ns)
> > +                       break;
> > +       }
> > +       if (!foundroot)
> > +               return -ENODATA;
> > +
> > +       size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> > +                       &nscap, sizeof(nscap));
> > +       if (size == -ENODATA || size == -EOPNOTSUPP)
> > +               /* no data, that's ok */
> > +               return -ENODATA;
> > +       if (size < 0)
> > +               return size;
> > +       if (size != sizeof(nscap))
> > +               return -EINVAL;
> > +
> > +       magic_etc = le32_to_cpu(nscap.magic_etc);
> > +
> > +       if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> > +               return -EINVAL;
> > +
> > +       cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> > +       if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> > +               cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> > +       /* copy the entry */
> > +       CAP_FOR_EACH_U32(i) {
> > +               if (i >= VFS_CAP_U32_2)
> > +                       break;
> > +               cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> > +               cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> > +       }
> > +
> > +       cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> > +       cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> > +
> > +       return 0;
> > +}
> > +
> >  /*
> >   * Attempt to get the on-exec apply capability sets for an executable file from
> >   * its xattrs and, if present, apply them to the proposed credentials being
> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >         if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> >                 return 0;
> >
> > -       rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> > +       rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> > +       if (rc == -ENODATA)
> > +               rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> 
> So nscaps overrides a "regular" file cap? That might seem worth
> mentioning in the change log or somewhere.

If it applies, yes.  Now maybe that's not what we want.  Maybe so long as
a regular one exists (which must have been set by the init_user_ns admin)
we should use it?  I think that makes sense, what do you think?

(In either case agreed it should be clearly documented)

> >         if (rc < 0) {
> >                 if (rc == -EINVAL)
> > -                       printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> > -                               __func__, rc, bprm->filename);
> > +                       printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> > +                                       bprm->filename);
> >                 else if (rc == -ENODATA)
> >                         rc = 0;
> >                 goto out;
> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >                 return 0;
> >         }
> >
> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> > +                       return -EPERM;
> > +               return 0;
> > +       }
> > +
> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >             !capable(CAP_SYS_ADMIN))
> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >                 return 0;
> >         }
> >
> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> > +                       return -EPERM;
> > +               return 0;
> > +       }
> > +
> 
> This looks like userspace must knowingly be aware that it is in a
> namespace and to DTRT instead of it being translated by the kernel
> when setxattr is called under !init_user_ns?

Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
shows you are in init_user_ns then it uses security.capability, otherwise
it uses security.nscapability.

I've occasionally considered having the xattr code do the quiet
substitution if need be.

In fact, much of this structure comes from when I was still trying to
do multiple values per xattr.  Given what we're doing here, we could
keep the xattr contents exactly the same, just changing the name.
So userspace could just get and set security.capability;  if you are
in a non-init user_ns, if security.capability is set then you cannot
set it;  if security.capability is not set, then the kernel writes
security.nscapability instead and returns success.

I don't like magic, but this might be just straightforward enough
to not be offensive.  Thoughts?

> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >             !capable(CAP_SYS_ADMIN))
> > --
> > 1.7.9.5
> >
> 
> 
> 
> -- 
> Kees Cook
> Chrome OS & Brillo Security
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-26 22:26     ` Serge E. Hallyn
@ 2016-04-26 22:39       ` Kees Cook
  2016-04-27  4:39         ` Serge E. Hallyn
                           ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Kees Cook @ 2016-04-26 22:39 UTC (permalink / raw)
  To: Serge E. Hallyn, Michael Kerrisk-manpages, Eric W. Biederman, Jann Horn
  Cc: Serge E. Hallyn, Linux API, Linux Containers, LKML,
	Andy Lutomirski, Andrew G. Morgan

On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Kees Cook (keescook@chromium.org):
>> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
>> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
>> >
>> > This can only be set by root in his own namespace, and will
>> > only be respected by namespaces with that same root kuid
>> > mapped as root, or namespaces descended from it.
>> >
>> > This allows a simple setxattr to work, allows tar/untar to
>> > work, and allows us to tar in one namespace and untar in
>> > another while preserving the capability, without risking
>> > leaking privilege into a parent namespace.
>>
>> The concept seems sensible to me. Various notes below...
>>
>> >
>> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
>> > ---
>> >  include/linux/capability.h      |    5 ++-
>> >  include/uapi/linux/capability.h |   18 ++++++++
>> >  include/uapi/linux/xattr.h      |    3 ++
>> >  security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
>> >  4 files changed, 112 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/include/linux/capability.h b/include/linux/capability.h
>> > index 00690ff..cf533ff 100644
>> > --- a/include/linux/capability.h
>> > +++ b/include/linux/capability.h
>> > @@ -13,7 +13,7 @@
>> >  #define _LINUX_CAPABILITY_H
>> >
>> >  #include <uapi/linux/capability.h>
>> > -
>> > +#include <linux/uidgid.h>
>> >
>> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
>> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
>> >         kernel_cap_t inheritable;
>> >  };
>> >
>> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
>> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
>> > +
>> >  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
>> >  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
>> >
>> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
>> > index 12c37a1..f0b4a66 100644
>> > --- a/include/uapi/linux/capability.h
>> > +++ b/include/uapi/linux/capability.h
>> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
>> >  #define VFS_CAP_U32_2           2
>> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>> >
>> > +/* version number for security.nscapability xattrs hdr->hdr_info */
>> > +#define VFS_NS_CAP_REVISION     1
>> > +
>> >  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
>> >  #define VFS_CAP_U32             VFS_CAP_U32_2
>> >  #define VFS_CAP_REVISION       VFS_CAP_REVISION_2
>> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
>> >         } data[VFS_CAP_U32];
>> >  };
>> >
>> > +#define VFS_NS_CAP_EFFECTIVE    0x1
>> > +/*
>> > + * 32-bit hdr_info contains
>> > + * 16 leftmost: reserved
>> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
>> > + * last 8: version
>> > + */
>> > +struct vfs_ns_cap_data {
>> > +       __le32 magic_etc;
>> > +       struct {
>> > +               __le32 permitted;    /* Little endian */
>> > +               __le32 inheritable;  /* Little endian */
>> > +       } data[VFS_CAP_U32];
>> > +};
>>
>> This is identical to vfs_cap_data. Is there a reason not to re-use the
>> existing one?
>
> Hm.  I used to have them completely different.  ATM the only difference
> is what goes into the magic_etc, and that not very (different).  So
> yeah it probably should be re-used.
>
>> > +
>> >  #ifndef __KERNEL__
>> >
>> >  /*
>> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
>> > index 1590c49..67c80ab 100644
>> > --- a/include/uapi/linux/xattr.h
>> > +++ b/include/uapi/linux/xattr.h
>> > @@ -68,6 +68,9 @@
>> >  #define XATTR_CAPS_SUFFIX "capability"
>> >  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>> >
>> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
>> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
>> > +
>> >  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
>> >  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
>> >  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
>>
>> Are these documented anywhere in Documentation/ or in man pages? This
>> seems like it'd need a man-page update too.
>
> Yeah, if we decide we're ok with this strategy I'll update those.
>
>> > diff --git a/security/commoncap.c b/security/commoncap.c
>> > index 48071ed..8f3f34a 100644
>> > --- a/security/commoncap.c
>> > +++ b/security/commoncap.c
>> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>> >         if (!inode->i_op->getxattr)
>> >                return 0;
>> >
>> > +       error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
>> > +       if (error > 0)
>> > +               return 1;
>> > +
>> >         error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
>> >         if (error <= 0)
>> >                 return 0;
>>
>> I think this might be more readable if the getxattr calls were
>> standardized (one returns 1, the other returns 0, with inverted tests
>> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
>> return 0.
>
> Hm.  Yeah I can see where that would be confusing.  I can change the flow
> to make the checks the same.
>
>> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>> >  int cap_inode_killpriv(struct dentry *dentry)
>> >  {
>> >         struct inode *inode = d_backing_inode(dentry);
>> > +       int ret1, ret2;
>> >
>> >         if (!inode->i_op->removexattr)
>> >                return 0;
>> >
>> > -       return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
>> > +       ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
>> > +       ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
>> > +
>> > +       if (ret1 != 0)
>> > +               return ret1;
>> > +       return ret2;
>> >  }
>> >
>> >  /*
>> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>> >         return 0;
>> >  }
>> >
>> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
>> > +{
>> > +       struct inode *inode = d_backing_inode(dentry);
>> > +       unsigned i;
>> > +       u32 magic_etc;
>> > +       ssize_t size;
>> > +       struct vfs_ns_cap_data nscap;
>> > +       bool foundroot = false;
>> > +       struct user_namespace *ns;
>> > +
>> > +       memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>> > +
>> > +       if (!inode || !inode->i_op->getxattr)
>> > +               return -ENODATA;
>> > +
>> > +       /* verify that current or ancestor userns root owns this file */
>> > +       for (ns = current_user_ns(); ; ns = ns->parent) {
>> > +               if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
>> > +                       foundroot = true;
>> > +                       break;
>> > +               }
>> > +               if (ns == &init_user_ns)
>> > +                       break;
>> > +       }
>> > +       if (!foundroot)
>> > +               return -ENODATA;
>> > +
>> > +       size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
>> > +                       &nscap, sizeof(nscap));
>> > +       if (size == -ENODATA || size == -EOPNOTSUPP)
>> > +               /* no data, that's ok */
>> > +               return -ENODATA;
>> > +       if (size < 0)
>> > +               return size;
>> > +       if (size != sizeof(nscap))
>> > +               return -EINVAL;
>> > +
>> > +       magic_etc = le32_to_cpu(nscap.magic_etc);
>> > +
>> > +       if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
>> > +               return -EINVAL;
>> > +
>> > +       cpu_caps->magic_etc = VFS_CAP_REVISION_2;
>> > +       if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
>> > +               cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
>> > +       /* copy the entry */
>> > +       CAP_FOR_EACH_U32(i) {
>> > +               if (i >= VFS_CAP_U32_2)
>> > +                       break;
>> > +               cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
>> > +               cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
>> > +       }
>> > +
>> > +       cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>> > +       cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>> > +
>> > +       return 0;
>> > +}
>> > +
>> >  /*
>> >   * Attempt to get the on-exec apply capability sets for an executable file from
>> >   * its xattrs and, if present, apply them to the proposed credentials being
>> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>> >         if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
>> >                 return 0;
>> >
>> > -       rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>> > +       rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>> > +       if (rc == -ENODATA)
>> > +               rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>>
>> So nscaps overrides a "regular" file cap? That might seem worth
>> mentioning in the change log or somewhere.
>
> If it applies, yes.  Now maybe that's not what we want.  Maybe so long as
> a regular one exists (which must have been set by the init_user_ns admin)
> we should use it?  I think that makes sense, what do you think?

I struggle to see why the non-ns fscap should be honored... this is
effectively fscap stacking, and there's no way that I see for the cap
to leak "up" to init_user_ns. It only leaks up to nested user_nses...
but they need to already be priv-equiv. Hmmm.

I'm CC'ing Jann who like these mind-benders... :)

>
> (In either case agreed it should be clearly documented)
>
>> >         if (rc < 0) {
>> >                 if (rc == -EINVAL)
>> > -                       printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
>> > -                               __func__, rc, bprm->filename);
>> > +                       printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
>> > +                                       bprm->filename);
>> >                 else if (rc == -ENODATA)
>> >                         rc = 0;
>> >                 goto out;
>> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>> >                 return 0;
>> >         }
>> >
>> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
>> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
>> > +                       return -EPERM;
>> > +               return 0;
>> > +       }
>> > +
>> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> >             !capable(CAP_SYS_ADMIN))
>> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>> >                 return 0;
>> >         }
>> >
>> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
>> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
>> > +                       return -EPERM;
>> > +               return 0;
>> > +       }
>> > +
>>
>> This looks like userspace must knowingly be aware that it is in a
>> namespace and to DTRT instead of it being translated by the kernel
>> when setxattr is called under !init_user_ns?
>
> Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
> shows you are in init_user_ns then it uses security.capability, otherwise
> it uses security.nscapability.
>
> I've occasionally considered having the xattr code do the quiet
> substitution if need be.
>
> In fact, much of this structure comes from when I was still trying to
> do multiple values per xattr.  Given what we're doing here, we could
> keep the xattr contents exactly the same, just changing the name.
> So userspace could just get and set security.capability;  if you are
> in a non-init user_ns, if security.capability is set then you cannot
> set it;  if security.capability is not set, then the kernel writes
> security.nscapability instead and returns success.
>
> I don't like magic, but this might be just straightforward enough
> to not be offensive.  Thoughts?

Yeah, I think it might be better to have the magic in this case, since
it seems weird to just reject setxattr if a tool didn't realize it was
in a namespace. I'm not sure -- it is also nice to have an explicit
API here.

I would defer to Eric or Michael on that. I keep going back and forth,
though I suspect it's probably best to do what you already have
(explicit API).

-Kees

>
>> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> >             !capable(CAP_SYS_ADMIN))
>> > --
>> > 1.7.9.5
>> >
>>
>>
>>
>> --
>> Kees Cook
>> Chrome OS & Brillo Security
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-26 22:39       ` Kees Cook
@ 2016-04-27  4:39         ` Serge E. Hallyn
  2016-04-27  8:09         ` Jann Horn
  2016-05-02  3:54         ` Serge E. Hallyn
  2 siblings, 0 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2016-04-27  4:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Serge E. Hallyn, Michael Kerrisk-manpages, Eric W. Biederman,
	Jann Horn, Serge E. Hallyn, Linux API, Linux Containers, LKML,
	Andy Lutomirski, Andrew G. Morgan

Quoting Kees Cook (keescook@chromium.org):
> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Kees Cook (keescook@chromium.org):
> >> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> >> >
> >> > This can only be set by root in his own namespace, and will
> >> > only be respected by namespaces with that same root kuid
> >> > mapped as root, or namespaces descended from it.
> >> >
> >> > This allows a simple setxattr to work, allows tar/untar to
> >> > work, and allows us to tar in one namespace and untar in
> >> > another while preserving the capability, without risking
> >> > leaking privilege into a parent namespace.
> >>
> >> The concept seems sensible to me. Various notes below...
> >>
> >> >
> >> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> >> > ---
> >> >  include/linux/capability.h      |    5 ++-
> >> >  include/uapi/linux/capability.h |   18 ++++++++
> >> >  include/uapi/linux/xattr.h      |    3 ++
> >> >  security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
> >> >  4 files changed, 112 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> >> > index 00690ff..cf533ff 100644
> >> > --- a/include/linux/capability.h
> >> > +++ b/include/linux/capability.h
> >> > @@ -13,7 +13,7 @@
> >> >  #define _LINUX_CAPABILITY_H
> >> >
> >> >  #include <uapi/linux/capability.h>
> >> > -
> >> > +#include <linux/uidgid.h>
> >> >
> >> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> >> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> >> >         kernel_cap_t inheritable;
> >> >  };
> >> >
> >> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> >> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> >> > +
> >> >  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
> >> >  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
> >> >
> >> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> >> > index 12c37a1..f0b4a66 100644
> >> > --- a/include/uapi/linux/capability.h
> >> > +++ b/include/uapi/linux/capability.h
> >> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> >> >  #define VFS_CAP_U32_2           2
> >> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >> >
> >> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> >> > +#define VFS_NS_CAP_REVISION     1
> >> > +
> >> >  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> >> >  #define VFS_CAP_U32             VFS_CAP_U32_2
> >> >  #define VFS_CAP_REVISION       VFS_CAP_REVISION_2
> >> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> >> >         } data[VFS_CAP_U32];
> >> >  };
> >> >
> >> > +#define VFS_NS_CAP_EFFECTIVE    0x1
> >> > +/*
> >> > + * 32-bit hdr_info contains
> >> > + * 16 leftmost: reserved
> >> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> >> > + * last 8: version
> >> > + */
> >> > +struct vfs_ns_cap_data {
> >> > +       __le32 magic_etc;
> >> > +       struct {
> >> > +               __le32 permitted;    /* Little endian */
> >> > +               __le32 inheritable;  /* Little endian */
> >> > +       } data[VFS_CAP_U32];
> >> > +};
> >>
> >> This is identical to vfs_cap_data. Is there a reason not to re-use the
> >> existing one?
> >
> > Hm.  I used to have them completely different.  ATM the only difference
> > is what goes into the magic_etc, and that not very (different).  So
> > yeah it probably should be re-used.
> >
> >> > +
> >> >  #ifndef __KERNEL__
> >> >
> >> >  /*
> >> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> >> > index 1590c49..67c80ab 100644
> >> > --- a/include/uapi/linux/xattr.h
> >> > +++ b/include/uapi/linux/xattr.h
> >> > @@ -68,6 +68,9 @@
> >> >  #define XATTR_CAPS_SUFFIX "capability"
> >> >  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >> >
> >> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> >> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> >> > +
> >> >  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
> >> >  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> >> >  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
> >>
> >> Are these documented anywhere in Documentation/ or in man pages? This
> >> seems like it'd need a man-page update too.
> >
> > Yeah, if we decide we're ok with this strategy I'll update those.
> >
> >> > diff --git a/security/commoncap.c b/security/commoncap.c
> >> > index 48071ed..8f3f34a 100644
> >> > --- a/security/commoncap.c
> >> > +++ b/security/commoncap.c
> >> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> >         if (!inode->i_op->getxattr)
> >> >                return 0;
> >> >
> >> > +       error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> >> > +       if (error > 0)
> >> > +               return 1;
> >> > +
> >> >         error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> >> >         if (error <= 0)
> >> >                 return 0;
> >>
> >> I think this might be more readable if the getxattr calls were
> >> standardized (one returns 1, the other returns 0, with inverted tests
> >> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> >> return 0.
> >
> > Hm.  Yeah I can see where that would be confusing.  I can change the flow
> > to make the checks the same.
> >
> >> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> >  int cap_inode_killpriv(struct dentry *dentry)
> >> >  {
> >> >         struct inode *inode = d_backing_inode(dentry);
> >> > +       int ret1, ret2;
> >> >
> >> >         if (!inode->i_op->removexattr)
> >> >                return 0;
> >> >
> >> > -       return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > +       ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > +       ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> >> > +
> >> > +       if (ret1 != 0)
> >> > +               return ret1;
> >> > +       return ret2;
> >> >  }
> >> >
> >> >  /*
> >> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >> >         return 0;
> >> >  }
> >> >
> >> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> >> > +{
> >> > +       struct inode *inode = d_backing_inode(dentry);
> >> > +       unsigned i;
> >> > +       u32 magic_etc;
> >> > +       ssize_t size;
> >> > +       struct vfs_ns_cap_data nscap;
> >> > +       bool foundroot = false;
> >> > +       struct user_namespace *ns;
> >> > +
> >> > +       memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >> > +
> >> > +       if (!inode || !inode->i_op->getxattr)
> >> > +               return -ENODATA;
> >> > +
> >> > +       /* verify that current or ancestor userns root owns this file */
> >> > +       for (ns = current_user_ns(); ; ns = ns->parent) {
> >> > +               if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> >> > +                       foundroot = true;
> >> > +                       break;
> >> > +               }
> >> > +               if (ns == &init_user_ns)
> >> > +                       break;
> >> > +       }
> >> > +       if (!foundroot)
> >> > +               return -ENODATA;
> >> > +
> >> > +       size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> >> > +                       &nscap, sizeof(nscap));
> >> > +       if (size == -ENODATA || size == -EOPNOTSUPP)
> >> > +               /* no data, that's ok */
> >> > +               return -ENODATA;
> >> > +       if (size < 0)
> >> > +               return size;
> >> > +       if (size != sizeof(nscap))
> >> > +               return -EINVAL;
> >> > +
> >> > +       magic_etc = le32_to_cpu(nscap.magic_etc);
> >> > +
> >> > +       if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> >> > +               return -EINVAL;
> >> > +
> >> > +       cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> >> > +       if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> >> > +               cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> >> > +       /* copy the entry */
> >> > +       CAP_FOR_EACH_U32(i) {
> >> > +               if (i >= VFS_CAP_U32_2)
> >> > +                       break;
> >> > +               cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> >> > +               cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> >> > +       }
> >> > +
> >> > +       cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +       cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +
> >> > +       return 0;
> >> > +}
> >> > +
> >> >  /*
> >> >   * Attempt to get the on-exec apply capability sets for an executable file from
> >> >   * its xattrs and, if present, apply them to the proposed credentials being
> >> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >> >         if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> >> >                 return 0;
> >> >
> >> > -       rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > +       rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > +       if (rc == -ENODATA)
> >> > +               rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >>
> >> So nscaps overrides a "regular" file cap? That might seem worth
> >> mentioning in the change log or somewhere.
> >
> > If it applies, yes.  Now maybe that's not what we want.  Maybe so long as
> > a regular one exists (which must have been set by the init_user_ns admin)
> > we should use it?  I think that makes sense, what do you think?
> 
> I struggle to see why the non-ns fscap should be honored... this is

Mainly because that's how it is right now - if there is a security.capability
then it is honored in all namespaces.  Changing that *could* change the
behavior which some poor schlep is depending on.  And in general we don't
do that.

> effectively fscap stacking, and there's no way that I see for the cap
> to leak "up" to init_user_ns. It only leaks up to nested user_nses...
> but they need to already be priv-equiv. Hmmm.
> 
> I'm CC'ing Jann who like these mind-benders... :)
> 
> >
> > (In either case agreed it should be clearly documented)
> >
> >> >         if (rc < 0) {
> >> >                 if (rc == -EINVAL)
> >> > -                       printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> >> > -                               __func__, rc, bprm->filename);
> >> > +                       printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> >> > +                                       bprm->filename);
> >> >                 else if (rc == -ENODATA)
> >> >                         rc = 0;
> >> >                 goto out;
> >> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >> >                 return 0;
> >> >         }
> >> >
> >> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > +                       return -EPERM;
> >> > +               return 0;
> >> > +       }
> >> > +
> >> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> >             !capable(CAP_SYS_ADMIN))
> >> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >> >                 return 0;
> >> >         }
> >> >
> >> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > +                       return -EPERM;
> >> > +               return 0;
> >> > +       }
> >> > +
> >>
> >> This looks like userspace must knowingly be aware that it is in a
> >> namespace and to DTRT instead of it being translated by the kernel
> >> when setxattr is called under !init_user_ns?
> >
> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
> > shows you are in init_user_ns then it uses security.capability, otherwise
> > it uses security.nscapability.
> >
> > I've occasionally considered having the xattr code do the quiet
> > substitution if need be.
> >
> > In fact, much of this structure comes from when I was still trying to
> > do multiple values per xattr.  Given what we're doing here, we could
> > keep the xattr contents exactly the same, just changing the name.
> > So userspace could just get and set security.capability;  if you are
> > in a non-init user_ns, if security.capability is set then you cannot
> > set it;  if security.capability is not set, then the kernel writes
> > security.nscapability instead and returns success.
> >
> > I don't like magic, but this might be just straightforward enough
> > to not be offensive.  Thoughts?
> 
> Yeah, I think it might be better to have the magic in this case, since
> it seems weird to just reject setxattr if a tool didn't realize it was
> in a namespace. I'm not sure -- it is also nice to have an explicit
> API here.
> 
> I would defer to Eric or Michael on that. I keep going back and forth,
> though I suspect it's probably best to do what you already have
> (explicit API).
> 
> -Kees
> 
> >
> >> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> >             !capable(CAP_SYS_ADMIN))
> >> > --
> >> > 1.7.9.5
> >> >
> >>
> >>
> >>
> >> --
> >> Kees Cook
> >> Chrome OS & Brillo Security
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@lists.linux-foundation.org
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 
> 
> 
> -- 
> Kees Cook
> Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-26 22:39       ` Kees Cook
  2016-04-27  4:39         ` Serge E. Hallyn
@ 2016-04-27  8:09         ` Jann Horn
  2016-05-02  3:54         ` Serge E. Hallyn
  2 siblings, 0 replies; 33+ messages in thread
From: Jann Horn @ 2016-04-27  8:09 UTC (permalink / raw)
  To: Kees Cook
  Cc: Serge E. Hallyn, Michael Kerrisk-manpages, Eric W. Biederman,
	Serge E. Hallyn, Linux API, Linux Containers, LKML,
	Andy Lutomirski, Andrew G. Morgan

On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Kees Cook (keescook@chromium.org):
> >> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> >> >
> >> > This can only be set by root in his own namespace, and will
> >> > only be respected by namespaces with that same root kuid
> >> > mapped as root, or namespaces descended from it.
> >> >
> >> > This allows a simple setxattr to work, allows tar/untar to
> >> > work, and allows us to tar in one namespace and untar in
> >> > another while preserving the capability, without risking
> >> > leaking privilege into a parent namespace.
> >>
> >> The concept seems sensible to me. Various notes below...
> >>
> >> >
> >> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> >> > ---
> >> >  include/linux/capability.h      |    5 ++-
> >> >  include/uapi/linux/capability.h |   18 ++++++++
> >> >  include/uapi/linux/xattr.h      |    3 ++
> >> >  security/commoncap.c            |   91 +++++++++++++++++++++++++++++++++++++--
> >> >  4 files changed, 112 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> >> > index 00690ff..cf533ff 100644
> >> > --- a/include/linux/capability.h
> >> > +++ b/include/linux/capability.h
> >> > @@ -13,7 +13,7 @@
> >> >  #define _LINUX_CAPABILITY_H
> >> >
> >> >  #include <uapi/linux/capability.h>
> >> > -
> >> > +#include <linux/uidgid.h>
> >> >
> >> >  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >> >  #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
> >> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> >> >         kernel_cap_t inheritable;
> >> >  };
> >> >
> >> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> >> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> >> > +
> >> >  #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
> >> >  #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
> >> >
> >> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> >> > index 12c37a1..f0b4a66 100644
> >> > --- a/include/uapi/linux/capability.h
> >> > +++ b/include/uapi/linux/capability.h
> >> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> >> >  #define VFS_CAP_U32_2           2
> >> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >> >
> >> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> >> > +#define VFS_NS_CAP_REVISION     1
> >> > +
> >> >  #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> >> >  #define VFS_CAP_U32             VFS_CAP_U32_2
> >> >  #define VFS_CAP_REVISION       VFS_CAP_REVISION_2
> >> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> >> >         } data[VFS_CAP_U32];
> >> >  };
> >> >
> >> > +#define VFS_NS_CAP_EFFECTIVE    0x1
> >> > +/*
> >> > + * 32-bit hdr_info contains
> >> > + * 16 leftmost: reserved
> >> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> >> > + * last 8: version
> >> > + */
> >> > +struct vfs_ns_cap_data {
> >> > +       __le32 magic_etc;
> >> > +       struct {
> >> > +               __le32 permitted;    /* Little endian */
> >> > +               __le32 inheritable;  /* Little endian */
> >> > +       } data[VFS_CAP_U32];
> >> > +};
> >>
> >> This is identical to vfs_cap_data. Is there a reason not to re-use the
> >> existing one?
> >
> > Hm.  I used to have them completely different.  ATM the only difference
> > is what goes into the magic_etc, and that not very (different).  So
> > yeah it probably should be re-used.
> >
> >> > +
> >> >  #ifndef __KERNEL__
> >> >
> >> >  /*
> >> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> >> > index 1590c49..67c80ab 100644
> >> > --- a/include/uapi/linux/xattr.h
> >> > +++ b/include/uapi/linux/xattr.h
> >> > @@ -68,6 +68,9 @@
> >> >  #define XATTR_CAPS_SUFFIX "capability"
> >> >  #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >> >
> >> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> >> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> >> > +
> >> >  #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
> >> >  #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> >> >  #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
> >>
> >> Are these documented anywhere in Documentation/ or in man pages? This
> >> seems like it'd need a man-page update too.
> >
> > Yeah, if we decide we're ok with this strategy I'll update those.
> >
> >> > diff --git a/security/commoncap.c b/security/commoncap.c
> >> > index 48071ed..8f3f34a 100644
> >> > --- a/security/commoncap.c
> >> > +++ b/security/commoncap.c
> >> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> >         if (!inode->i_op->getxattr)
> >> >                return 0;
> >> >
> >> > +       error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> >> > +       if (error > 0)
> >> > +               return 1;
> >> > +
> >> >         error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> >> >         if (error <= 0)
> >> >                 return 0;
> >>
> >> I think this might be more readable if the getxattr calls were
> >> standardized (one returns 1, the other returns 0, with inverted tests
> >> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> >> return 0.
> >
> > Hm.  Yeah I can see where that would be confusing.  I can change the flow
> > to make the checks the same.
> >
> >> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> >  int cap_inode_killpriv(struct dentry *dentry)
> >> >  {
> >> >         struct inode *inode = d_backing_inode(dentry);
> >> > +       int ret1, ret2;
> >> >
> >> >         if (!inode->i_op->removexattr)
> >> >                return 0;
> >> >
> >> > -       return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > +       ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > +       ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> >> > +
> >> > +       if (ret1 != 0)
> >> > +               return ret1;
> >> > +       return ret2;
> >> >  }
> >> >
> >> >  /*
> >> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >> >         return 0;
> >> >  }
> >> >
> >> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> >> > +{
> >> > +       struct inode *inode = d_backing_inode(dentry);
> >> > +       unsigned i;
> >> > +       u32 magic_etc;
> >> > +       ssize_t size;
> >> > +       struct vfs_ns_cap_data nscap;
> >> > +       bool foundroot = false;
> >> > +       struct user_namespace *ns;
> >> > +
> >> > +       memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >> > +
> >> > +       if (!inode || !inode->i_op->getxattr)
> >> > +               return -ENODATA;
> >> > +
> >> > +       /* verify that current or ancestor userns root owns this file */
> >> > +       for (ns = current_user_ns(); ; ns = ns->parent) {
> >> > +               if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> >> > +                       foundroot = true;
> >> > +                       break;
> >> > +               }
> >> > +               if (ns == &init_user_ns)
> >> > +                       break;
> >> > +       }
> >> > +       if (!foundroot)
> >> > +               return -ENODATA;
> >> > +
> >> > +       size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> >> > +                       &nscap, sizeof(nscap));
> >> > +       if (size == -ENODATA || size == -EOPNOTSUPP)
> >> > +               /* no data, that's ok */
> >> > +               return -ENODATA;
> >> > +       if (size < 0)
> >> > +               return size;
> >> > +       if (size != sizeof(nscap))
> >> > +               return -EINVAL;
> >> > +
> >> > +       magic_etc = le32_to_cpu(nscap.magic_etc);
> >> > +
> >> > +       if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> >> > +               return -EINVAL;
> >> > +
> >> > +       cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> >> > +       if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> >> > +               cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> >> > +       /* copy the entry */
> >> > +       CAP_FOR_EACH_U32(i) {
> >> > +               if (i >= VFS_CAP_U32_2)
> >> > +                       break;
> >> > +               cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> >> > +               cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> >> > +       }
> >> > +
> >> > +       cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +       cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +
> >> > +       return 0;
> >> > +}
> >> > +
> >> >  /*
> >> >   * Attempt to get the on-exec apply capability sets for an executable file from
> >> >   * its xattrs and, if present, apply them to the proposed credentials being
> >> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >> >         if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> >> >                 return 0;
> >> >
> >> > -       rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > +       rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > +       if (rc == -ENODATA)
> >> > +               rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >>
> >> So nscaps overrides a "regular" file cap? That might seem worth
> >> mentioning in the change log or somewhere.
> >
> > If it applies, yes.  Now maybe that's not what we want.  Maybe so long as
> > a regular one exists (which must have been set by the init_user_ns admin)
> > we should use it?  I think that makes sense, what do you think?
> 
> I struggle to see why the non-ns fscap should be honored... this is
> effectively fscap stacking, and there's no way that I see for the cap
> to leak "up" to init_user_ns. It only leaks up to nested user_nses...
> but they need to already be priv-equiv. Hmmm.
> 
> I'm CC'ing Jann who like these mind-benders... :)

Thanks. :D
I think it's safe, and it might make sense for userspace when e.g.
bind-mounting /bin into a container or so.
Theoretically, it could be an issue if a setcap binary from the init
namespace behaves insecurely when executed in another namespace (because
it is less privileged than expected), in the worst case making it
possible for someone to gain capabilities inside the namespace - but
as long as userspace behaves sanely and does proper error checking, I
think this shouldn't happen.


> > (In either case agreed it should be clearly documented)
> >
> >> >         if (rc < 0) {
> >> >                 if (rc == -EINVAL)
> >> > -                       printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> >> > -                               __func__, rc, bprm->filename);
> >> > +                       printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> >> > +                                       bprm->filename);
> >> >                 else if (rc == -ENODATA)
> >> >                         rc = 0;
> >> >                 goto out;
> >> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >> >                 return 0;
> >> >         }
> >> >
> >> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > +                       return -EPERM;
> >> > +               return 0;
> >> > +       }
> >> > +
> >> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> >             !capable(CAP_SYS_ADMIN))
> >> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >> >                 return 0;
> >> >         }
> >> >
> >> > +       if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > +               if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > +                       return -EPERM;
> >> > +               return 0;
> >> > +       }
> >> > +
> >>
> >> This looks like userspace must knowingly be aware that it is in a
> >> namespace and to DTRT instead of it being translated by the kernel
> >> when setxattr is called under !init_user_ns?
> >
> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
> > shows you are in init_user_ns then it uses security.capability, otherwise
> > it uses security.nscapability.
> >
> > I've occasionally considered having the xattr code do the quiet
> > substitution if need be.
> >
> > In fact, much of this structure comes from when I was still trying to
> > do multiple values per xattr.  Given what we're doing here, we could
> > keep the xattr contents exactly the same, just changing the name.
> > So userspace could just get and set security.capability;  if you are
> > in a non-init user_ns, if security.capability is set then you cannot
> > set it;  if security.capability is not set, then the kernel writes
> > security.nscapability instead and returns success.
> >
> > I don't like magic, but this might be just straightforward enough
> > to not be offensive.  Thoughts?
> 
> Yeah, I think it might be better to have the magic in this case, since
> it seems weird to just reject setxattr if a tool didn't realize it was
> in a namespace. I'm not sure -- it is also nice to have an explicit
> API here.
> 
> I would defer to Eric or Michael on that. I keep going back and forth,
> though I suspect it's probably best to do what you already have
> (explicit API).
> 
> -Kees
> 
> >
> >> >         if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> >                      sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> >             !capable(CAP_SYS_ADMIN))
> >> > --
> >> > 1.7.9.5
> >> >
> >>
> >>
> >>
> >> --
> >> Kees Cook
> >> Chrome OS & Brillo Security
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@lists.linux-foundation.org
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 
> 
> 
> -- 
> Kees Cook
> Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-04-26 22:39       ` Kees Cook
  2016-04-27  4:39         ` Serge E. Hallyn
  2016-04-27  8:09         ` Jann Horn
@ 2016-05-02  3:54         ` Serge E. Hallyn
  2016-05-02 18:31           ` Michael Kerrisk (man-pages)
  2016-05-02 21:31           ` Eric W. Biederman
  2 siblings, 2 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-02  3:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Serge E. Hallyn, Michael Kerrisk-manpages, Eric W. Biederman,
	Jann Horn, Serge E. Hallyn, Linux API, Linux Containers, LKML,
	Andy Lutomirski, Andrew G. Morgan

On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Kees Cook (keescook@chromium.org):
> >> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
...
> >> This looks like userspace must knowingly be aware that it is in a
> >> namespace and to DTRT instead of it being translated by the kernel
> >> when setxattr is called under !init_user_ns?
> >
> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
> > shows you are in init_user_ns then it uses security.capability, otherwise
> > it uses security.nscapability.
> >
> > I've occasionally considered having the xattr code do the quiet
> > substitution if need be.
> >
> > In fact, much of this structure comes from when I was still trying to
> > do multiple values per xattr.  Given what we're doing here, we could
> > keep the xattr contents exactly the same, just changing the name.
> > So userspace could just get and set security.capability;  if you are
> > in a non-init user_ns, if security.capability is set then you cannot
> > set it;  if security.capability is not set, then the kernel writes
> > security.nscapability instead and returns success.
> >
> > I don't like magic, but this might be just straightforward enough
> > to not be offensive.  Thoughts?
> 
> Yeah, I think it might be better to have the magic in this case, since
> it seems weird to just reject setxattr if a tool didn't realize it was
> in a namespace. I'm not sure -- it is also nice to have an explicit
> API here.
> 
> I would defer to Eric or Michael on that. I keep going back and forth,
> though I suspect it's probably best to do what you already have
> (explicit API).

Michael, Eric, what do you think?  The choice we're making here is
whether we should

1. Keep a nice simple separate pair of xattrs, the pre-existing
security.capability which can only be written from init_user_ns,
and the new (in this patch) security.nscapability which you can
write to any file where you are privileged wrt the file.

2. Make security.capability somewhat 'magic' - if someone in a
non-initial user ns tries to write it and has privilege wrt the
file, then the kernel silently writes security.nscapability instead.

The biggest drawback of (1) would be any tar-like program trying
to restore a file which had security.capability, needing to know
to detect its userns and write the security.nscapability instead.
The drawback of (2) is ~\o/~ magic.

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-02  3:54         ` Serge E. Hallyn
@ 2016-05-02 18:31           ` Michael Kerrisk (man-pages)
  2016-05-02 21:31           ` Eric W. Biederman
  1 sibling, 0 replies; 33+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-05-02 18:31 UTC (permalink / raw)
  To: Serge E. Hallyn, Kees Cook
  Cc: mtk.manpages, Eric W. Biederman, Jann Horn, Serge E. Hallyn,
	Linux API, Linux Containers, LKML, Andy Lutomirski,
	Andrew G. Morgan

On 05/02/2016 05:54 AM, Serge E. Hallyn wrote:
> On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
>> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>> Quoting Kees Cook (keescook@chromium.org):
>>>> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
>>>>> From: Serge Hallyn <serge.hallyn@ubuntu.com>
> ...
>>>> This looks like userspace must knowingly be aware that it is in a
>>>> namespace and to DTRT instead of it being translated by the kernel
>>>> when setxattr is called under !init_user_ns?
>>>
>>> Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
>>> shows you are in init_user_ns then it uses security.capability, otherwise
>>> it uses security.nscapability.
>>>
>>> I've occasionally considered having the xattr code do the quiet
>>> substitution if need be.
>>>
>>> In fact, much of this structure comes from when I was still trying to
>>> do multiple values per xattr.  Given what we're doing here, we could
>>> keep the xattr contents exactly the same, just changing the name.
>>> So userspace could just get and set security.capability;  if you are
>>> in a non-init user_ns, if security.capability is set then you cannot
>>> set it;  if security.capability is not set, then the kernel writes
>>> security.nscapability instead and returns success.
>>>
>>> I don't like magic, but this might be just straightforward enough
>>> to not be offensive.  Thoughts?
>>
>> Yeah, I think it might be better to have the magic in this case, since
>> it seems weird to just reject setxattr if a tool didn't realize it was
>> in a namespace. I'm not sure -- it is also nice to have an explicit
>> API here.
>>
>> I would defer to Eric or Michael on that. I keep going back and forth,
>> though I suspect it's probably best to do what you already have
>> (explicit API).
> 
> Michael, Eric, what do you think?  The choice we're making here is
> whether we should
> 
> 1. Keep a nice simple separate pair of xattrs, the pre-existing
> security.capability which can only be written from init_user_ns,
> and the new (in this patch) security.nscapability which you can
> write to any file where you are privileged wrt the file.
> 
> 2. Make security.capability somewhat 'magic' - if someone in a
> non-initial user ns tries to write it and has privilege wrt the
> file, then the kernel silently writes security.nscapability instead.
> 
> The biggest drawback of (1) would be any tar-like program trying
> to restore a file which had security.capability, needing to know
> to detect its userns and write the security.nscapability instead.
> The drawback of (2) is ~\o/~ magic.

I have only (minor) thoughts from the interface perspective.
(1) Sounds the source of possibly unpleasant surprises.
(2) Is a little surprising, but less so if it's well documented,
and it saves us the surprises of (1). So, (2) sounds better.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-02  3:54         ` Serge E. Hallyn
  2016-05-02 18:31           ` Michael Kerrisk (man-pages)
@ 2016-05-02 21:31           ` Eric W. Biederman
       [not found]             ` <CALQRfL7mfpyudWs4Z8W5Zi8CTG-9O0OvrCnRU7pk0MXtsLBd0A@mail.gmail.com>
  1 sibling, 1 reply; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-02 21:31 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kees Cook, Michael Kerrisk-manpages, Jann Horn, Serge E. Hallyn,
	Linux API, Linux Containers, LKML, Andy Lutomirski,
	Andrew G. Morgan

"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
>> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Quoting Kees Cook (keescook@chromium.org):
>> >> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
>> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> ...
>> >> This looks like userspace must knowingly be aware that it is in a
>> >> namespace and to DTRT instead of it being translated by the kernel
>> >> when setxattr is called under !init_user_ns?
>> >
>> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If that
>> > shows you are in init_user_ns then it uses security.capability, otherwise
>> > it uses security.nscapability.
>> >
>> > I've occasionally considered having the xattr code do the quiet
>> > substitution if need be.
>> >
>> > In fact, much of this structure comes from when I was still trying to
>> > do multiple values per xattr.  Given what we're doing here, we could
>> > keep the xattr contents exactly the same, just changing the name.
>> > So userspace could just get and set security.capability;  if you are
>> > in a non-init user_ns, if security.capability is set then you cannot
>> > set it;  if security.capability is not set, then the kernel writes
>> > security.nscapability instead and returns success.
>> >
>> > I don't like magic, but this might be just straightforward enough
>> > to not be offensive.  Thoughts?
>> 
>> Yeah, I think it might be better to have the magic in this case, since
>> it seems weird to just reject setxattr if a tool didn't realize it was
>> in a namespace. I'm not sure -- it is also nice to have an explicit
>> API here.
>> 
>> I would defer to Eric or Michael on that. I keep going back and forth,
>> though I suspect it's probably best to do what you already have
>> (explicit API).
>
> Michael, Eric, what do you think?  The choice we're making here is
> whether we should
>
> 1. Keep a nice simple separate pair of xattrs, the pre-existing
> security.capability which can only be written from init_user_ns,
> and the new (in this patch) security.nscapability which you can
> write to any file where you are privileged wrt the file.
>
> 2. Make security.capability somewhat 'magic' - if someone in a
> non-initial user ns tries to write it and has privilege wrt the
> file, then the kernel silently writes security.nscapability instead.
>
> The biggest drawback of (1) would be any tar-like program trying
> to restore a file which had security.capability, needing to know
> to detect its userns and write the security.nscapability instead.
> The drawback of (2) is ~\o/~ magic.

Apologies for not having followed this more closely before.

I don't like either option.  I think we will be in much better shape if
we upgrade the capability xattr.  It seems totally wrong or at least
confusing for a file to have both capability xattrs.

Just using security.capability allows us to confront any weird issues
with mixing both the old semantics and the new semantics.

We had previously discussioned extending the capbility a little and
adding a uid who needed to be the root uid in a user namespace, to be
valid.  Using the owner of the file seems simpler, and even a little
more transparent as this makes the security.capability xattr a limited
form of setuid (which it semantically is).

So I believe the new semantics in general are an improvement.


Given the expected use case let me ask as simple question: Are there any
known cases where the owner of a setcap exectuable is not root?

I expect the pile of setcap exectuables is small enough we can go
through the top distros and look at all of the setcap executlables.


If there is not a need to support setcap executables owned by non-root,
I suspect the right play is to just change the semantics to always treat
the security.capability attribute this way.

If there is a need to support setcap exectualbes owned by non-root,
then the current implementation is most likely unacceptable.  As that
problem case can not work in a container.



My guess is that we can just reinterpret the current security.capable to
only be valid when root owns the file in the initial user namespace.  At
which point backwards compatibility becomes trivial as the
security.capable does not change, just the rules for setting it, and
interpreting it.


We should also ensure that the gid of the file is mapped into the user
namespace where the uid is the root of the user namespace.  So that we
are effectively testing capable_wrtuid_and_gid on execute as well a
read/write of the the xattr.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
       [not found]             ` <CALQRfL7mfpyudWs4Z8W5Zi8CTG-9O0OvrCnRU7pk0MXtsLBd0A@mail.gmail.com>
@ 2016-05-03  4:50               ` Eric W. Biederman
  2016-05-10 19:00                 ` Serge E. Hallyn
  2016-05-03  5:19               ` Serge E. Hallyn
  1 sibling, 1 reply; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-03  4:50 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Kees Cook, Serge E. Hallyn, Jann Horn, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers,
	LKML

"Andrew G. Morgan" <morgan@kernel.org> writes:

> On 2 May 2016 6:04 p.m., "Eric W. Biederman" <ebiederm@xmission.com>
> wrote:
>>
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>
>> > On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
>> >> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn
> <serge@hallyn.com> wrote:
>> >> > Quoting Kees Cook (keescook@chromium.org):
>> >> >> On Fri, Apr 22, 2016 at 10:26 AM, <serge.hallyn@ubuntu.com>
> wrote:
>> >> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
>> > ...
>> >> >> This looks like userspace must knowingly be aware that it is
> in a
>> >> >> namespace and to DTRT instead of it being translated by the
> kernel
>> >> >> when setxattr is called under !init_user_ns?
>> >> >
>> >> > Yes - my libcap2 patch checks /proc/self/uid_map to decide
> that. If that
>> >> > shows you are in init_user_ns then it uses security.capability,
> otherwise
>> >> > it uses security.nscapability.
>> >> >
>> >> > I've occasionally considered having the xattr code do the quiet
>> >> > substitution if need be.
>> >> >
>> >> > In fact, much of this structure comes from when I was still
> trying to
>> >> > do multiple values per xattr. Given what we're doing here, we
> could
>> >> > keep the xattr contents exactly the same, just changing the
> name.
>> >> > So userspace could just get and set security.capability; if you
> are
>> >> > in a non-init user_ns, if security.capability is set then you
> cannot
>> >> > set it; if security.capability is not set, then the kernel
> writes
>> >> > security.nscapability instead and returns success.
>> >> >
>> >> > I don't like magic, but this might be just straightforward
> enough
>> >> > to not be offensive. Thoughts?
>> >>
>> >> Yeah, I think it might be better to have the magic in this case,
> since
>> >> it seems weird to just reject setxattr if a tool didn't realize
> it was
>> >> in a namespace. I'm not sure -- it is also nice to have an
> explicit
>> >> API here.
>> >>
>> >> I would defer to Eric or Michael on that. I keep going back and
> forth,
>> >> though I suspect it's probably best to do what you already have
>> >> (explicit API).
>> >
>> > Michael, Eric, what do you think? The choice we're making here is
>> > whether we should
>> >
>> > 1. Keep a nice simple separate pair of xattrs, the pre-existing
>> > security.capability which can only be written from init_user_ns,
>> > and the new (in this patch) security.nscapability which you can
>> > write to any file where you are privileged wrt the file.
>> >
>> > 2. Make security.capability somewhat 'magic' - if someone in a
>> > non-initial user ns tries to write it and has privilege wrt the
>> > file, then the kernel silently writes security.nscapability
> instead.
>> >
>> > The biggest drawback of (1) would be any tar-like program trying
>> > to restore a file which had security.capability, needing to know
>> > to detect its userns and write the security.nscapability instead.
>> > The drawback of (2) is ~\o/~ magic.
>>
>> Apologies for not having followed this more closely before.
>>
>> I don't like either option. I think we will be in much better shape
> if
>> we upgrade the capability xattr. It seems totally wrong or at least
>> confusing for a file to have both capability xattrs.
>>
>> Just using security.capability allows us to confront any weird
> issues
>> with mixing both the old semantics and the new semantics.
>>
>> We had previously discussioned extending the capbility a little and
>> adding a uid who needed to be the root uid in a user namespace, to
> be
>> valid. Using the owner of the file seems simpler, and even a little
>> more transparent as this makes the security.capability xattr a
> limited
>> form of setuid (which it semantically is).
>>
>> So I believe the new semantics in general are an improvement.
>>
>>
>> Given the expected use case let me ask as simple question: Are there
> any
>> known cases where the owner of a setcap exectuable is not root?
>>
>> I expect the pile of setcap exectuables is small enough we can go
>> through the top distros and look at all of the setcap executlables.
>>
>>
>> If there is not a need to support setcap executables owned by
> non-root,
>> I suspect the right play is to just change the semantics to always
> treat
>> the security.capability attribute this way.
>>
>
> I guess I'm confused how we have strayed so far that this isn't an
> obvious requirement. Uid=0 as being the root of privilege was the
> basic problem that capabilities were designed to change.

uid==0 as the owner of a file is slightly different from uid==0 of a
running process.  Last I checked if it is installed as part of a
distribution the programs are owned by root by default.

> Uid is an acl concept. Capabilities are supposed to be independent of
> that.

I don't have a clue what you mean.  Posix capabilities on executables
are part of discretionary access control.  Whatever their rules posix
capabilities are just watered down versions of the permissions of
a setuid root exectuable.  I don't think anyone has ever actually run a
system with setuid root exectuables not being special.  If you are
thinking of what any other system call capabilities unix calls those are
file descriptors.

I don't think it is necessarily wrong that files that hold exectuable
programs need to be owned by a user that is trusted to install files
system wide.  So far that user in my limited sampling that user is
always root.  Given that installing a program like that is fundamentally
a very privileged role we may not be able to break it up successfully,
so root as the owner of program files seems to make a lot of sense.

How would you design a system wide program installer in a root less
system?

Does anyone know any linux based system that uses file capabilities on
executables without setting the executable to be owned by root?  One
example would be all that it takes to shut down this marvelous
simplification that I see.


I strongly suspect the reality is that all that exists are watered down
setuid root exectuables.  If that is indeed the case we can safely let
file caps only be valid if root owns the file.  That would be convinient
at it is much simpler to understand and implement and audit.


I really don't care either way except that I like simpler code, and I
like not breaking userspace.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
       [not found]             ` <CALQRfL7mfpyudWs4Z8W5Zi8CTG-9O0OvrCnRU7pk0MXtsLBd0A@mail.gmail.com>
  2016-05-03  4:50               ` Eric W. Biederman
@ 2016-05-03  5:19               ` Serge E. Hallyn
  2016-05-03  5:54                 ` Eric W. Biederman
  1 sibling, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-03  5:19 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Eric W. Biederman, Kees Cook, Serge E. Hallyn, Jann Horn,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

Quoting Andrew G. Morgan (morgan@kernel.org):
> On 2 May 2016 6:04 p.m., "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> >
> > > On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
> > >> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <serge@hallyn.com>
> wrote:
> > >> > Quoting Kees Cook (keescook@chromium.org):
> > >> >> On Fri, Apr 22, 2016 at 10:26 AM,  <serge.hallyn@ubuntu.com> wrote:
> > >> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> > > ...
> > >> >> This looks like userspace must knowingly be aware that it is in a
> > >> >> namespace and to DTRT instead of it being translated by the kernel
> > >> >> when setxattr is called under !init_user_ns?
> > >> >
> > >> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that.  If
> that
> > >> > shows you are in init_user_ns then it uses security.capability,
> otherwise
> > >> > it uses security.nscapability.
> > >> >
> > >> > I've occasionally considered having the xattr code do the quiet
> > >> > substitution if need be.
> > >> >
> > >> > In fact, much of this structure comes from when I was still trying to
> > >> > do multiple values per xattr.  Given what we're doing here, we could
> > >> > keep the xattr contents exactly the same, just changing the name.
> > >> > So userspace could just get and set security.capability;  if you are
> > >> > in a non-init user_ns, if security.capability is set then you cannot
> > >> > set it;  if security.capability is not set, then the kernel writes
> > >> > security.nscapability instead and returns success.
> > >> >
> > >> > I don't like magic, but this might be just straightforward enough
> > >> > to not be offensive.  Thoughts?
> > >>
> > >> Yeah, I think it might be better to have the magic in this case, since
> > >> it seems weird to just reject setxattr if a tool didn't realize it was
> > >> in a namespace. I'm not sure -- it is also nice to have an explicit
> > >> API here.
> > >>
> > >> I would defer to Eric or Michael on that. I keep going back and forth,
> > >> though I suspect it's probably best to do what you already have
> > >> (explicit API).
> > >
> > > Michael, Eric, what do you think?  The choice we're making here is
> > > whether we should
> > >
> > > 1. Keep a nice simple separate pair of xattrs, the pre-existing
> > > security.capability which can only be written from init_user_ns,
> > > and the new (in this patch) security.nscapability which you can
> > > write to any file where you are privileged wrt the file.
> > >
> > > 2. Make security.capability somewhat 'magic' - if someone in a
> > > non-initial user ns tries to write it and has privilege wrt the
> > > file, then the kernel silently writes security.nscapability instead.
> > >
> > > The biggest drawback of (1) would be any tar-like program trying
> > > to restore a file which had security.capability, needing to know
> > > to detect its userns and write the security.nscapability instead.
> > > The drawback of (2) is ~\o/~ magic.
> >
> > Apologies for not having followed this more closely before.
> >
> > I don't like either option.  I think we will be in much better shape if
> > we upgrade the capability xattr.  It seems totally wrong or at least
> > confusing for a file to have both capability xattrs.
> >
> > Just using security.capability allows us to confront any weird issues
> > with mixing both the old semantics and the new semantics.
> >
> > We had previously discussioned extending the capbility a little and
> > adding a uid who needed to be the root uid in a user namespace, to be
> > valid.  Using the owner of the file seems simpler, and even a little
> > more transparent as this makes the security.capability xattr a limited
> > form of setuid (which it semantically is).
> >
> > So I believe the new semantics in general are an improvement.
> >
> >
> > Given the expected use case let me ask as simple question: Are there any
> > known cases where the owner of a setcap exectuable is not root?
> >
> > I expect the pile of setcap exectuables is small enough we can go
> > through the top distros and look at all of the setcap executlables.
> >
> >
> > If there is not a need to support setcap executables owned by non-root,
> > I suspect the right play is to just change the semantics to always treat
> > the security.capability attribute this way.
> >
> 
> I guess I'm confused how we have strayed so far that this isn't an obvious
> requirement. Uid=0 as being the root of privilege was the basic problem
> that capabilities were designed to change.

The task executing the file can be any uid mapped into the namespace.  The
file only has to be owned by the root of the user_ns.  Which I agree is
unfortunate.  We can work around it by putting the root uid into the xattr
itself (which still isn't orthogonal but allows the file to at least by
owned by non-root), but the problem then is that a task needs to know its
global root k_uid just to write the xattr.

> Uid is an acl concept. Capabilities are supposed to be independent of that.
> 
> If we want to support NS file capabilities I would look at replacing the
> xattr syscall with a dedicated file capabilities modification syscall. Then

That was one ofthe possibilities I'd mentioned in my earlier proposal,
fwiw.  The problem is if we want tar to still work unmodified then
simple xattr operations still have to work.

Maybe there's workable semantics there though.  Worth thinking about.

> namespace partitioning and file capabilities can be managed in the kernel
> with respect to the prevailing namespace, and not by some hybrid userspace
> kernel convention.
> 
> Cheers
> 
> Andrew
> 
> > If there is a need to support setcap exectualbes owned by non-root,
> > then the current implementation is most likely unacceptable.  As that
> > problem case can not work in a container.
> >
> >
> >
> > My guess is that we can just reinterpret the current security.capable to
> > only be valid when root owns the file in the initial user namespace.  At
> > which point backwards compatibility becomes trivial as the
> > security.capable does not change, just the rules for setting it, and
> > interpreting it.
> >
> >
> > We should also ensure that the gid of the file is mapped into the user
> > namespace where the uid is the root of the user namespace.  So that we
> > are effectively testing capable_wrtuid_and_gid on execute as well a
> > read/write of the the xattr.
> >
> > Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-03  5:19               ` Serge E. Hallyn
@ 2016-05-03  5:54                 ` Eric W. Biederman
  2016-05-03 14:25                   ` Serge E. Hallyn
  2016-05-07 23:10                   ` Jann Horn
  0 siblings, 2 replies; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-03  5:54 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andrew G. Morgan, Kees Cook, Jann Horn, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers,
	LKML

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Andrew G. Morgan (morgan@kernel.org):
>> 
>> I guess I'm confused how we have strayed so far that this isn't an obvious
>> requirement. Uid=0 as being the root of privilege was the basic problem
>> that capabilities were designed to change.
>
> The task executing the file can be any uid mapped into the namespace.  The
> file only has to be owned by the root of the user_ns.  Which I agree is
> unfortunate.  We can work around it by putting the root uid into the xattr
> itself (which still isn't orthogonal but allows the file to at least by
> owned by non-root), but the problem then is that a task needs to know its
> global root k_uid just to write the xattr.

The root kuid is just make_kuids(user_ns, 0) so it is easy to find.

It might be a hair better to use the userns->owner instead of the root
uid.  That would allow user namespaces without a mapped root to still
use file capabilities.

>> Uid is an acl concept. Capabilities are supposed to be independent of that.
>> 
>> If we want to support NS file capabilities I would look at replacing the
>> xattr syscall with a dedicated file capabilities modification syscall. Then
>
> That was one ofthe possibilities I'd mentioned in my earlier proposal,
> fwiw.  The problem is if we want tar to still work unmodified then
> simple xattr operations still have to work.
>
> Maybe there's workable semantics there though.  Worth thinking about.

If the problem is compatibilty please look at
posix_acl_fix_xattr_from_user.  With something similar for the
security.capability attribute we can perform whatever transformation
makes sense.  I admit adding 4 bytes is a bit of a pain in that context
but not a big one.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-03  5:54                 ` Eric W. Biederman
@ 2016-05-03 14:25                   ` Serge E. Hallyn
  2016-05-10 19:03                     ` Serge E. Hallyn
  2016-05-07 23:10                   ` Jann Horn
  1 sibling, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-03 14:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Andrew G. Morgan, Kees Cook, Jann Horn,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Andrew G. Morgan (morgan@kernel.org):
> >> 
> >> I guess I'm confused how we have strayed so far that this isn't an obvious
> >> requirement. Uid=0 as being the root of privilege was the basic problem
> >> that capabilities were designed to change.
> >
> > The task executing the file can be any uid mapped into the namespace.  The
> > file only has to be owned by the root of the user_ns.  Which I agree is
> > unfortunate.  We can work around it by putting the root uid into the xattr
> > itself (which still isn't orthogonal but allows the file to at least by
> > owned by non-root), but the problem then is that a task needs to know its
> > global root k_uid just to write the xattr.
> 
> The root kuid is just make_kuids(user_ns, 0) so it is easy to find.
>
> It might be a hair better to use the userns->owner instead of the root
> uid.  That would allow user namespaces without a mapped root to still
> use file capabilities.

That's all fine if the kernel does it for us magically.  Which is what we're
talking about below.  Above I was talking about userspace putting it into
the xattr.

> >> Uid is an acl concept. Capabilities are supposed to be independent of that.
> >> 
> >> If we want to support NS file capabilities I would look at replacing the
> >> xattr syscall with a dedicated file capabilities modification syscall. Then
> >
> > That was one ofthe possibilities I'd mentioned in my earlier proposal,
> > fwiw.  The problem is if we want tar to still work unmodified then
> > simple xattr operations still have to work.
> >
> > Maybe there's workable semantics there though.  Worth thinking about.
> 
> If the problem is compatibilty please look at
> posix_acl_fix_xattr_from_user.  With something similar for the

All right.  Excellent.  I simply didn't think something like that would
be acceptable.  I tend to think of xattrs as just out of band file contents,
but generally under user control.  I guess that's not right.

> security.capability attribute we can perform whatever transformation
> makes sense.  I admit adding 4 bytes is a bit of a pain in that context
> but not a big one.

If we can do all the magic in the kernel behind the scenes, then I
absolutely do not mind adding a new security.capability version with 4
more bytes.  Userspace can just write the old xattr format with the new
version number, kernel fills in the userns owner kuid.  It's what I
originally wanted to do, but didn't think was acceptable.

Sounds great!

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-03  5:54                 ` Eric W. Biederman
  2016-05-03 14:25                   ` Serge E. Hallyn
@ 2016-05-07 23:10                   ` Jann Horn
  2016-05-11 21:02                     ` Serge E. Hallyn
  1 sibling, 1 reply; 33+ messages in thread
From: Jann Horn @ 2016-05-07 23:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Andrew G. Morgan, Kees Cook,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]

On Tue, May 03, 2016 at 12:54:40AM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Andrew G. Morgan (morgan@kernel.org):
> >> 
> >> I guess I'm confused how we have strayed so far that this isn't an obvious
> >> requirement. Uid=0 as being the root of privilege was the basic problem
> >> that capabilities were designed to change.
> >
> > The task executing the file can be any uid mapped into the namespace.  The
> > file only has to be owned by the root of the user_ns.  Which I agree is
> > unfortunate.  We can work around it by putting the root uid into the xattr
> > itself (which still isn't orthogonal but allows the file to at least by
> > owned by non-root), but the problem then is that a task needs to know its
> > global root k_uid just to write the xattr.
> 
> The root kuid is just make_kuids(user_ns, 0) so it is easy to find.
> 
> It might be a hair better to use the userns->owner instead of the root
> uid.  That would allow user namespaces without a mapped root to still
> use file capabilities.

Please don't do that. A user might want to create multiple containers with
isolated security properties, and in that case, it would be bad if binaries
that are capable in one container are also automatically valid in the user's
other containers.
Also, this would mean that in an owner!=root scenario, container root can't
setcap executables and needs to ask the administrator of the surrounding system
to do it.
(Of course, this could be worked around using a dummy userns layer between the
init ns and the container, but I don't like seeing new reasons for such a hack.)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-03  4:50               ` Eric W. Biederman
@ 2016-05-10 19:00                 ` Serge E. Hallyn
  0 siblings, 0 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-10 19:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew G. Morgan, Kees Cook, Serge E. Hallyn, Jann Horn,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Andrew G. Morgan" <morgan@kernel.org> writes:
> 
> > On 2 May 2016 6:04 p.m., "Eric W. Biederman" <ebiederm@xmission.com>
> > wrote:
> >>
> >> "Serge E. Hallyn" <serge@hallyn.com> writes:
> >>
> >> > On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
> >> >> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn
> > <serge@hallyn.com> wrote:
> >> >> > Quoting Kees Cook (keescook@chromium.org):
> >> >> >> On Fri, Apr 22, 2016 at 10:26 AM, <serge.hallyn@ubuntu.com>
> > wrote:
> >> >> >> > From: Serge Hallyn <serge.hallyn@ubuntu.com>
> >> > ...
> >> >> >> This looks like userspace must knowingly be aware that it is
> > in a
> >> >> >> namespace and to DTRT instead of it being translated by the
> > kernel
> >> >> >> when setxattr is called under !init_user_ns?
> >> >> >
> >> >> > Yes - my libcap2 patch checks /proc/self/uid_map to decide
> > that. If that
> >> >> > shows you are in init_user_ns then it uses security.capability,
> > otherwise
> >> >> > it uses security.nscapability.
> >> >> >
> >> >> > I've occasionally considered having the xattr code do the quiet
> >> >> > substitution if need be.
> >> >> >
> >> >> > In fact, much of this structure comes from when I was still
> > trying to
> >> >> > do multiple values per xattr. Given what we're doing here, we
> > could
> >> >> > keep the xattr contents exactly the same, just changing the
> > name.
> >> >> > So userspace could just get and set security.capability; if you
> > are
> >> >> > in a non-init user_ns, if security.capability is set then you
> > cannot
> >> >> > set it; if security.capability is not set, then the kernel
> > writes
> >> >> > security.nscapability instead and returns success.
> >> >> >
> >> >> > I don't like magic, but this might be just straightforward
> > enough
> >> >> > to not be offensive. Thoughts?
> >> >>
> >> >> Yeah, I think it might be better to have the magic in this case,
> > since
> >> >> it seems weird to just reject setxattr if a tool didn't realize
> > it was
> >> >> in a namespace. I'm not sure -- it is also nice to have an
> > explicit
> >> >> API here.
> >> >>
> >> >> I would defer to Eric or Michael on that. I keep going back and
> > forth,
> >> >> though I suspect it's probably best to do what you already have
> >> >> (explicit API).
> >> >
> >> > Michael, Eric, what do you think? The choice we're making here is
> >> > whether we should
> >> >
> >> > 1. Keep a nice simple separate pair of xattrs, the pre-existing
> >> > security.capability which can only be written from init_user_ns,
> >> > and the new (in this patch) security.nscapability which you can
> >> > write to any file where you are privileged wrt the file.
> >> >
> >> > 2. Make security.capability somewhat 'magic' - if someone in a
> >> > non-initial user ns tries to write it and has privilege wrt the
> >> > file, then the kernel silently writes security.nscapability
> > instead.
> >> >
> >> > The biggest drawback of (1) would be any tar-like program trying
> >> > to restore a file which had security.capability, needing to know
> >> > to detect its userns and write the security.nscapability instead.
> >> > The drawback of (2) is ~\o/~ magic.
> >>
> >> Apologies for not having followed this more closely before.
> >>
> >> I don't like either option. I think we will be in much better shape
> > if
> >> we upgrade the capability xattr. It seems totally wrong or at least
> >> confusing for a file to have both capability xattrs.
> >>
> >> Just using security.capability allows us to confront any weird
> > issues
> >> with mixing both the old semantics and the new semantics.
> >>
> >> We had previously discussioned extending the capbility a little and
> >> adding a uid who needed to be the root uid in a user namespace, to
> > be
> >> valid. Using the owner of the file seems simpler, and even a little
> >> more transparent as this makes the security.capability xattr a
> > limited
> >> form of setuid (which it semantically is).
> >>
> >> So I believe the new semantics in general are an improvement.
> >>
> >>
> >> Given the expected use case let me ask as simple question: Are there
> > any
> >> known cases where the owner of a setcap exectuable is not root?
> >>
> >> I expect the pile of setcap exectuables is small enough we can go
> >> through the top distros and look at all of the setcap executlables.
> >>
> >>
> >> If there is not a need to support setcap executables owned by
> > non-root,
> >> I suspect the right play is to just change the semantics to always
> > treat
> >> the security.capability attribute this way.
> >>
> >
> > I guess I'm confused how we have strayed so far that this isn't an
> > obvious requirement. Uid=0 as being the root of privilege was the
> > basic problem that capabilities were designed to change.
> 
> uid==0 as the owner of a file is slightly different from uid==0 of a
> running process.  Last I checked if it is installed as part of a
> distribution the programs are owned by root by default.

Note that this does mean that a user namespace without a mapping for a
uid 0 cannot use file capabilities.

But I'm not sure there is a way around that.  Even if we store the
userns identifier in an xattr instead of using the file owner, we still
need to uniquely identify the namespace somehow, and as Jann pointed
out using the namespace creator uid is non-ideal as it means all namespaces
even with disjoint uid mappings can mess with each other.

> > Uid is an acl concept. Capabilities are supposed to be independent of
> > that.
> 
> I don't have a clue what you mean.  Posix capabilities on executables
> are part of discretionary access control.  Whatever their rules posix
> capabilities are just watered down versions of the permissions of
> a setuid root exectuable.  I don't think anyone has ever actually run a
> system with setuid root exectuables not being special.  If you are

I actually suspect Andrew does, and I've done it though only as an
experiment.  (Oh - but I may mean something different than what you
mean, see below)

> thinking of what any other system call capabilities unix calls those are
> file descriptors.
> 
> I don't think it is necessarily wrong that files that hold exectuable
> programs need to be owned by a user that is trusted to install files
> system wide.  So far that user in my limited sampling that user is
> always root.  Given that installing a program like that is fundamentally
> a very privileged role we may not be able to break it up successfully,
> so root as the owner of program files seems to make a lot of sense.
> 
> How would you design a system wide program installer in a root less
> system?

The root id is still special, in the same sense as it is in plan 9 - it
is the uid which "owns" the hardware.  In linux setuid always still
works - it just can be configured to not raise/drop privileges on
setuid.  (You know all this, but someone reading along may have
forgotten)

The question then is what it should take to set and to use a file
capability.

So what we have right now is afaics a technical roadblock - in that I
really can't see a way to safely allow filecaps in a userns without a
uid 0 mapping.  It's unfortunate, and it's not a goal, it's an implementation
detail.  I personally think it is a huge improvement over what we have.
The question is does doing it this way now prevent us from doing it
the right way later if we can find a right way?

> Does anyone know any linux based system that uses file capabilities on
> executables without setting the executable to be owned by root?  One
> example would be all that it takes to shut down this marvelous
> simplification that I see.
> 
> 
> I strongly suspect the reality is that all that exists are watered down
> setuid root exectuables.  If that is indeed the case we can safely let
> file caps only be valid if root owns the file.  That would be convinient
> at it is much simpler to understand and implement and audit.
> 
> 
> I really don't care either way except that I like simpler code, and I
> like not breaking userspace.
> 
> Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-03 14:25                   ` Serge E. Hallyn
@ 2016-05-10 19:03                     ` Serge E. Hallyn
  0 siblings, 0 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-10 19:03 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Andrew G. Morgan, Kees Cook, Jann Horn,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

Quoting Serge E. Hallyn (serge@hallyn.com):
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> > 
> > > Quoting Andrew G. Morgan (morgan@kernel.org):
> > >> 
> > >> I guess I'm confused how we have strayed so far that this isn't an obvious
> > >> requirement. Uid=0 as being the root of privilege was the basic problem
> > >> that capabilities were designed to change.
> > >
> > > The task executing the file can be any uid mapped into the namespace.  The
> > > file only has to be owned by the root of the user_ns.  Which I agree is
> > > unfortunate.  We can work around it by putting the root uid into the xattr
> > > itself (which still isn't orthogonal but allows the file to at least by
> > > owned by non-root), but the problem then is that a task needs to know its
> > > global root k_uid just to write the xattr.
> > 
> > The root kuid is just make_kuids(user_ns, 0) so it is easy to find.
> >
> > It might be a hair better to use the userns->owner instead of the root
> > uid.  That would allow user namespaces without a mapped root to still
> > use file capabilities.
> 
> That's all fine if the kernel does it for us magically.  Which is what we're
> talking about below.  Above I was talking about userspace putting it into
> the xattr.
> 
> > >> Uid is an acl concept. Capabilities are supposed to be independent of that.
> > >> 
> > >> If we want to support NS file capabilities I would look at replacing the
> > >> xattr syscall with a dedicated file capabilities modification syscall. Then
> > >
> > > That was one ofthe possibilities I'd mentioned in my earlier proposal,
> > > fwiw.  The problem is if we want tar to still work unmodified then
> > > simple xattr operations still have to work.
> > >
> > > Maybe there's workable semantics there though.  Worth thinking about.
> > 
> > If the problem is compatibilty please look at
> > posix_acl_fix_xattr_from_user.  With something similar for the
> 
> All right.  Excellent.  I simply didn't think something like that would
> be acceptable.  I tend to think of xattrs as just out of band file contents,
> but generally under user control.  I guess that's not right.
> 
> > security.capability attribute we can perform whatever transformation
> > makes sense.  I admit adding 4 bytes is a bit of a pain in that context
> > but not a big one.
> 
> If we can do all the magic in the kernel behind the scenes, then I
> absolutely do not mind adding a new security.capability version with 4
> more bytes.  Userspace can just write the old xattr format with the new
> version number, kernel fills in the userns owner kuid.  It's what I
> originally wanted to do, but didn't think was acceptable.
> 
> Sounds great!

So I'm still mulling this over and still undecided as to whether we want to

1. leave the xattr as is and use a new pair of syscalls for setting/unsetting
filecaps.  This would truly let us hide the implementation detail of the
file having to be owned by root (apart from returning a perhaps-unexpected
EPERM when file isn't owned by uid 0, and documenting that as something that
can be changed later)

2. hide the magic in get/setxattr of security.capability.  And if we do
that, then whether to hide the security.nscapability (or newer-version
security.capbility if that's what we do).  probably not hide it...

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-07 23:10                   ` Jann Horn
@ 2016-05-11 21:02                     ` Serge E. Hallyn
  2016-05-16 21:15                       ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-11 21:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric W. Biederman, Serge E. Hallyn, Andrew G. Morgan, Kees Cook,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, LKML

Quoting Jann Horn (jann@thejh.net):
> On Tue, May 03, 2016 at 12:54:40AM -0500, Eric W. Biederman wrote:
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> > 
> > > Quoting Andrew G. Morgan (morgan@kernel.org):
> > >> 
> > >> I guess I'm confused how we have strayed so far that this isn't an obvious
> > >> requirement. Uid=0 as being the root of privilege was the basic problem
> > >> that capabilities were designed to change.
> > >
> > > The task executing the file can be any uid mapped into the namespace.  The
> > > file only has to be owned by the root of the user_ns.  Which I agree is
> > > unfortunate.  We can work around it by putting the root uid into the xattr
> > > itself (which still isn't orthogonal but allows the file to at least by
> > > owned by non-root), but the problem then is that a task needs to know its
> > > global root k_uid just to write the xattr.
> > 
> > The root kuid is just make_kuids(user_ns, 0) so it is easy to find.
> > 
> > It might be a hair better to use the userns->owner instead of the root
> > uid.  That would allow user namespaces without a mapped root to still
> > use file capabilities.
> 
> Please don't do that. A user might want to create multiple containers with
> isolated security properties, and in that case, it would be bad if binaries
> that are capable in one container are also automatically valid in the user's
> other containers.

But no, if the namespaces both created by uid 1000 have disjoint uid mappings,
say 100000-165535 and 200000-265536, then a file capability on a file owned
by 200000 would not be active when the exec()ing task has only 100000-165535
mapped.

If the uid mappings are not completely disjoint, then you cannot assume that
the user wanted the mappings to be disjoint.  In particular, while setting up
a rootfs for a container I'll frequently use small ad-hoc namespaces to chown
files.  For instance, my intended final container mapping may be 65536 uids
starting at 100000, but to chown a file to uid 5 in the container I may create
a ns with kuid 1000 as uid 0 and 100005 as uid 1.  Here the root uid doing the
writing is not even mapped into the final container namespace.

So a new approach might be to (a) note the kuid which created the
namespace in the xattr (magically written by the kernel at xattr write
time), then say that for the file capability to take effect, two things
must hold:

1. the kuid noted in the xattr must match the kuid which created the calling
task's user_ns (or any ancestor creator)

2. the file uid must map into the calling task's namespace

To write the filecap,

1. the task must be privileged over the uid which owns the file (in the
sense of capable_wrt_inode_uidgid)

2. the task must be privileged over his own user_ns

As Eric said, this should address Andrew Morgan's concern about requiring that
the file be owned by uid 0 in the namespace.

There's a problem though.  The above suffices to prevent an unprivileged user
in a user_ns from unsharing a user_ns to write a file capability and exploit
that capability in the ns where he is unprivileged.  With one exception, which
is the case where the unprivileged user is mapped to the same kuid which
created the namespace.  So if uid 1000 on the host creates a namespace
where uid 1000 maps to 1000 in the namespace, then 1000 in the namespace
can create a new user_ns, write the xattr, and exploit it from the
parent namespace.  This is not an uncommon case.  I'm not sure what to do about
it.

> Also, this would mean that in an owner!=root scenario, container root can't
> setcap executables and needs to ask the administrator of the surrounding system
> to do it.
> (Of course, this could be worked around using a dummy userns layer between the
> init ns and the container, but I don't like seeing new reasons for such a hack.)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-11 21:02                     ` Serge E. Hallyn
@ 2016-05-16 21:15                       ` Serge E. Hallyn
  2016-05-16 21:48                         ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-16 21:15 UTC (permalink / raw)
  To: LKML
  Cc: Jann Horn, Eric W. Biederman, Seth Forshee, Andrew G. Morgan,
	Kees Cook, Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers, serge

Quoting Serge E. Hallyn (serge@hallyn.com):
...
> There's a problem though.  The above suffices to prevent an unprivileged user
> in a user_ns from unsharing a user_ns to write a file capability and exploit
> that capability in the ns where he is unprivileged.  With one exception, which
> is the case where the unprivileged user is mapped to the same kuid which
> created the namespace.  So if uid 1000 on the host creates a namespace
> where uid 1000 maps to 1000 in the namespace, then 1000 in the namespace
> can create a new user_ns, write the xattr, and exploit it from the
> parent namespace.  This is not an uncommon case.  I'm not sure what to do about
> it.

Ok I think I've convinced myself that requiring a kuid 0 in the container
and storing that in the security.nscapability is best solution.  The DAC
objection is imo not really valid - we don't have to give uid 0 in the
container any special privilege, we just require that the ns have a uid 0
mapping.  I have not been able to think of any other reliable way to verify
that the writer of the capability is authorized to grant privilege to the
file when executed by current.

I'm going to proceed with another POC based on the following design:

1. no new syscalls at the moment.  You can choose to set/query
security.nscapability, but can also just set security.capability from
a user_ns and have the kernel transparently set a security.nscapability
entry for you.

2. For now just a single security.nscapability entry, but in a format
that turning it into an array will be a trivial change

3. When running file foo which has a security.nscapability for kuid 100000,
then any namespace where kuid 100000 is root - or which has an ancestor ns where
that is the case - will run the file with the listed capabilities.

4. When doing getxattr of security.capability from a user_ns, if there is a
security.capability entry, that will be returned;  else if there is a valid
security.nscapability for your ns, that will be returned.

5. when doing a setxattr of security.capability from a user_ns, if there is
a security.nscapability entry, you get EBUSY;  else a security.nscapability 
with your root kuid will be written provided that (a) you are privileged
over your namespace, (b) you are privileged over your root uid, (c) the
file owner maps into your namespace.

6. when doing a getxattr of security.nscapability, the entry will be shown
with kuid mapped into your namespace or -1 if the uid does not map into
your ns.

7. when doing a setxattr of security.nscapability, if an entry exists, you
get -EBUSY;  if you are not privileged over your ns, your root uid, and
the file owner, then you get -EPERM;  the xattr includes a uid field, which
must be either 0 or a value valid in your ns.  The value will be converted
to a kuid and stored on disk.  (Seth, I'm not sure offhand how that should
mesh with your patches, we can talk about it after I send the next patch,
which I'm quite certain will handle it wrongly)

8. If a security.capability exists, it will override any security.nscapability
at execve() (so, inverse of my previous two patches).

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] simplified security.nscapability xattr
  2016-05-16 21:15                       ` Serge E. Hallyn
@ 2016-05-16 21:48                         ` Serge E. Hallyn
  2016-05-18 21:57                           ` [PATCH RFC] user-namespaced file capabilities - now with more magic Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-16 21:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Jann Horn, Eric W. Biederman, Seth Forshee,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

On Mon, May 16, 2016 at 04:15:23PM -0500, Serge E. Hallyn wrote:
> Quoting Serge E. Hallyn (serge@hallyn.com):
> ...
> > There's a problem though.  The above suffices to prevent an unprivileged user
> > in a user_ns from unsharing a user_ns to write a file capability and exploit
> > that capability in the ns where he is unprivileged.  With one exception, which
> > is the case where the unprivileged user is mapped to the same kuid which
> > created the namespace.  So if uid 1000 on the host creates a namespace
> > where uid 1000 maps to 1000 in the namespace, then 1000 in the namespace
> > can create a new user_ns, write the xattr, and exploit it from the
> > parent namespace.  This is not an uncommon case.  I'm not sure what to do about
> > it.
> 
> Ok I think I've convinced myself that requiring a kuid 0 in the container
> and storing that in the security.nscapability is best solution.  The DAC
> objection is imo not really valid - we don't have to give uid 0 in the
> container any special privilege, we just require that the ns have a uid 0
> mapping.  I have not been able to think of any other reliable way to verify
> that the writer of the capability is authorized to grant privilege to the
> file when executed by current.
> 
> I'm going to proceed with another POC based on the following design:
> 
> 1. no new syscalls at the moment.  You can choose to set/query
> security.nscapability, but can also just set security.capability from
> a user_ns and have the kernel transparently set a security.nscapability
> entry for you.
> 
> 2. For now just a single security.nscapability entry, but in a format
> that turning it into an array will be a trivial change
> 
> 3. When running file foo which has a security.nscapability for kuid 100000,
> then any namespace where kuid 100000 is root - or which has an ancestor ns where
> that is the case - will run the file with the listed capabilities.
> 
> 4. When doing getxattr of security.capability from a user_ns, if there is a
> security.capability entry, that will be returned;  else if there is a valid
> security.nscapability for your ns, that will be returned.
> 
> 5. when doing a setxattr of security.capability from a user_ns, if there is
> a security.nscapability entry, you get EBUSY;  else a security.nscapability 
> with your root kuid will be written provided that (a) you are privileged
> over your namespace, (b) you are privileged over your root uid, (c) the
> file owner maps into your namespace.

Stéphane pointed out this isn't quite right.  The EBUSY will happen if
a security.nscapability is defined with a kuid over which the writer is
not privileged - else it will overwrite.  It will also happen if
security.capbility is set.

> 6. when doing a getxattr of security.nscapability, the entry will be shown
> with kuid mapped into your namespace or -1 if the uid does not map into
> your ns.
> 
> 7. when doing a setxattr of security.nscapability, if an entry exists, you
> get -EBUSY;  if you are not privileged over your ns, your root uid, and
> the file owner, then you get -EPERM;  the xattr includes a uid field, which
> must be either 0 or a value valid in your ns.  The value will be converted
> to a kuid and stored on disk.  (Seth, I'm not sure offhand how that should
> mesh with your patches, we can talk about it after I send the next patch,
> which I'm quite certain will handle it wrongly)
> 
> 8. If a security.capability exists, it will override any security.nscapability
> at execve() (so, inverse of my previous two patches).
> 
> -serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-16 21:48                         ` Serge E. Hallyn
@ 2016-05-18 21:57                           ` Serge E. Hallyn
  2016-05-19 20:53                             ` Mimi Zohar
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-18 21:57 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Jann Horn, Eric W. Biederman, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

This patch introduces a new security.nscapability xattr.  It
is mostly like security.capability, but also lists a 'rootid'.
This is the uid_t (in init_user_ns) of the root id (uid 0 in a
namespace) in whose namespaces the file capabilities may take
effect.

A privileged (cap_setfcap) process in the initial user ns may
set and read this xattr directly.  However, its real intent is
to be used as a transparent fallback in user namespaces.

Root in a user ns cannot be trusted to write security.capability
xattrs, because any user on the host could map his own uid to root
in a namespace, write the xattr, and execute the file with privilege
on the host.

With this patch, when root in a user ns asks to write security.capability,
the kernel will transparently write a security.nscapability xattr
instead, filling in the kuid of the calling user's root uid.  Subsequently,
any task executing the file which has the noted k_uid as its root uid,
or which is in a descendent user_ns of such a user_ns, will run the
file with capabilities.

When reading the security.capability xattr from a non-init user_ns, a valid
security.nscapability will be shown if it exists.  Such a task is not
allowed to read security.nscapability.  This could be accomodated, however
it requires the kernel to convert the kuid_t to a valid uid in the reader's
user_ns.  So for now it's simply not supported.

Only a single security.nscapability xattr may be written.  This patch
could be expanded to allow a list of capabilities and rootids, however
I do not believe that to be a worthwhile use case.

This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.

Note - listxattr is not being handled here.  So results of that can be
inconsistent with get/setxattr.  Fixing that will require yet more
deceit in fs/xattr.c.

Note2 - it may be less sneaky to hide all the magic behind the
security.nscapability xattr.  So userspace would need to know to
use that xattr name when needed, but with the same format as
security.capability.  The kuid_t rootid would be filled in by the
kernel.  That's a middle ground between my last patch and this one.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 fs/xattr.c                      |  18 ++-
 include/linux/capability.h      |   8 +-
 include/uapi/linux/capability.h |  19 +++
 include/uapi/linux/xattr.h      |   3 +
 security/commoncap.c            | 253 ++++++++++++++++++++++++++++++++++++++--
 5 files changed, 291 insertions(+), 10 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 4861322..5c0e7ae 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 {
 	struct inode *inode = dentry->d_inode;
 	int error = -EOPNOTSUPP;
+	void *wvalue = NULL;
+	size_t wsize = 0;
 	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
 				   XATTR_SECURITY_PREFIX_LEN);
 
-	if (issec)
+	if (issec) {
 		inode->i_flags &= ~S_NOSEC;
+		/* if root in a non-init user_ns tries to set
+		 * security.capability, write a security.nscapability
+		 * in its place */
+		if (!strcmp(name, "security.capability") &&
+				current_user_ns() != &init_user_ns) {
+			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
+			if (!wvalue)
+				return -EPERM;
+			value = wvalue;
+			size = wsize;
+			name = "security.nscapability";
+		}
+	}
 	if (inode->i_op->setxattr) {
 		error = inode->i_op->setxattr(dentry, name, value, size, flags);
 		if (!error) {
@@ -114,6 +129,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
 			fsnotify_xattr(dentry);
 	}
 
+	kfree(wvalue);
 	return error;
 }
 
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 00690ff..9376146 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
 #define _LINUX_CAPABILITY_H
 
 #include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
 
 #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
 #define _KERNEL_CAPABILITY_U32S    _LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
 	kernel_cap_t inheritable;
 };
 
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
 #define _USER_CAP_HEADER_SIZE  (sizeof(struct __user_cap_header_struct))
 #define _KERNEL_CAP_T_SIZE     (sizeof(kernel_cap_t))
 
@@ -240,4 +243,7 @@ extern bool file_ns_capable(const struct file *file, struct user_namespace *ns,
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+extern void cap_setxattr_make_nscap(struct dentry *dentry, const void *value,
+		size_t size, void **wvalue, size_t *wsize);
+
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..1f7e4c6 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION     1
+
 #define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
 #define VFS_CAP_U32             VFS_CAP_U32_2
 #define VFS_CAP_REVISION	VFS_CAP_REVISION_2
@@ -74,6 +77,22 @@ struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+#define VFS_NS_CAP_EFFECTIVE    0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+       __le32 magic_etc;
+       __le32 rootid;
+       struct {
+               __le32 permitted;    /* Little endian */
+               __le32 inheritable;  /* Little endian */
+       } data[VFS_CAP_U32];
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
 #define XATTR_POSIX_ACL_ACCESS  "posix_acl_access"
 #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
 #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 48071ed..53161bc 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -310,13 +310,18 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 	struct inode *inode = d_backing_inode(dentry);
 	int error;
 
-	if (!inode->i_op->getxattr)
+	if (!inode || !inode->i_op->getxattr)
 	       return 0;
 
+	error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+	if (error > 0)
+		return 1;
+
 	error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
-	if (error <= 0)
-		return 0;
-	return 1;
+	if (error > 0)
+		return 1;
+
+	return 0;
 }
 
 /**
@@ -330,11 +335,155 @@ int cap_inode_need_killpriv(struct dentry *dentry)
 int cap_inode_killpriv(struct dentry *dentry)
 {
 	struct inode *inode = d_backing_inode(dentry);
+	int ret1, ret2;
 
-	if (!inode->i_op->removexattr)
+	if (!inode || !inode->i_op->removexattr)
 	       return 0;
 
-	return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+	ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+	if (ret1 != 0 && ret1 != -ENODATA)
+		return ret1;
+	if (ret2 == -ENODATA)
+		return 0;
+	return ret2;
+}
+
+/*
+ * getsecurity: We are called if reading of security.capable
+ * failed.  Since that does not exist, check whether the
+ * security.nscapability exists.  If it does, convert the kuid
+ * into the caller's context and return it
+ */
+static int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer, bool alloc)
+{
+	int error, ret;
+	struct user_namespace *ns;
+	kuid_t kroot;
+	uid_t root;
+	char *tmpbuf = NULL;
+	bool foundroot = false;
+	struct vfs_ns_cap_data *nscap;
+	struct dentry *dentry;
+
+	if (!inode->i_op->getxattr)
+		return -EOPNOTSUPP;
+
+	/* TODO - do we want to return the capability with the rootid converted
+	 * if this is security.nscapability?  It's not critical, so just say
+	 * no for now. */
+	if (strcmp(name, "nscapability") == 0 && current_user_ns() != &init_user_ns)
+		return -EPERM;
+
+	if (strcmp(name, "capability") != 0)
+		return -EOPNOTSUPP;
+
+	dentry = d_find_alias(inode);
+	if (!dentry)
+		return -EINVAL;
+
+	ret = vfs_getxattr_alloc(dentry, "security.nscapability",
+			&tmpbuf, 0, GFP_NOFS);
+
+	if (ret != sizeof(struct vfs_ns_cap_data)) {
+		kfree(tmpbuf);
+		return -EOPNOTSUPP;
+	}
+
+	/* verify the uid maps to a ancestor root uid, if so convert
+	   this to a valid security.capability */
+	nscap = (struct vfs_ns_cap_data *) tmpbuf;
+	root = le32_to_cpu(nscap->rootid);
+	kroot = make_kuid(&init_user_ns, root);
+	for (ns = current_user_ns(); ; ns = ns->parent) {
+		if (from_kuid(ns, kroot) == 0) {
+			foundroot = true;
+			break;
+		}
+		if (ns == &init_user_ns)
+			break;
+	}
+	if (!foundroot) {
+		kfree(tmpbuf);
+		return -EOPNOTSUPP;
+	}
+
+	error = sizeof(struct vfs_cap_data);
+	if (alloc) {
+		*buffer = kmalloc(sizeof(struct vfs_cap_data), GFP_ATOMIC);
+		if (*buffer) {
+			struct vfs_ns_cap_data *cap = *buffer;
+			__le32 nsmagic, magic;
+			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+			nsmagic = le32_to_cpu(nscap->magic_etc);
+			magic = VFS_CAP_REVISION;
+			if (NS_CAPS_FLAGS(nsmagic) & VFS_NS_CAP_EFFECTIVE)
+				magic |= VFS_CAP_FLAGS_EFFECTIVE;
+			cap->magic_etc = cpu_to_le32(magic);
+		}
+	}
+	kfree(tmpbuf);
+	return error;
+}
+
+static bool dentry_has_nonns_capability(struct dentry *dentry)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	int error;
+
+	if (!inode || !inode->i_op->getxattr)
+		return false;
+	error = inode->i_op->getxattr(dentry, "security.capability", NULL, 0);
+	if (error >= 0)
+		return true;
+	return false;
+}
+
+/*
+ * Use requested a write of security.capability but is in a non-init
+ * userns.  So we construct and write a security.nscapability.
+ *
+ * If all is ok, wvalue has an allocated new value.  Otherwise, wvalue
+ * is NULL.
+ */
+void cap_setxattr_make_nscap(struct dentry *dentry, const void *value, size_t size,
+				    void **wvalue, size_t *wsize)
+{
+	struct vfs_ns_cap_data nscap;
+	const struct vfs_cap_data *cap = value;
+	__u32 magic, nsmagic;
+	struct user_namespace *ns = current_user_ns();
+	struct inode *inode = d_backing_inode(dentry);
+	kuid_t rootid;
+
+	if (!value || size != sizeof(struct vfs_cap_data))
+		return;
+	if (!inode || !capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+		return;
+
+	/* refuse if security.capability exists */
+	if (dentry_has_nonns_capability(dentry))
+		return;
+
+	rootid = make_kuid(ns, 0);
+	if (!uid_valid(rootid))
+		return;
+
+	nscap.rootid = cpu_to_le32(from_kuid(&init_user_ns, rootid));
+	nsmagic = VFS_NS_CAP_REVISION;
+	magic = le32_to_cpu(cap->magic_etc);
+	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
+		nsmagic |= VFS_NS_CAP_REVISION << 8;
+	nscap.magic_etc = cpu_to_le32(nsmagic);
+	memcpy(&nscap.data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+
+	*wsize = sizeof(struct vfs_ns_cap_data);
+	*wvalue = kmalloc(*wsize, GFP_ATOMIC);
+	if (!*wvalue)
+		return;
+	memcpy(*wvalue, &nscap, *wsize);
+	return;
 }
 
 /*
@@ -438,6 +587,68 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	return 0;
 }
 
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+	struct inode *inode = d_backing_inode(dentry);
+	unsigned i;
+	u32 magic_etc;
+	ssize_t size;
+	struct vfs_ns_cap_data nscap;
+	bool foundroot = false;
+	kuid_t kroot;
+	uid_t root;
+	struct user_namespace *ns;
+
+	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+	if (!inode || !inode->i_op->getxattr)
+		return -ENODATA;
+
+	size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+			&nscap, sizeof(nscap));
+	if (size == -ENODATA || size == -EOPNOTSUPP)
+		/* no data, that's ok */
+		return -ENODATA;
+	if (size < 0)
+		return size;
+	if (size != sizeof(nscap))
+		return -EINVAL;
+
+	root = le32_to_cpu(nscap.rootid);
+	kroot = make_kuid(&init_user_ns, root);
+	for (ns = current_user_ns(); ; ns = ns->parent) {
+		if (from_kuid(ns, kroot) == 0) {
+			foundroot = true;
+			break;
+		}
+		if (ns == &init_user_ns)
+			break;
+	}
+	if (!foundroot)
+		return -ENODATA;
+
+	magic_etc = le32_to_cpu(nscap.magic_etc);
+
+	if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
+		return -EINVAL;
+
+	cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+	if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
+		cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+	/* copy the entry */
+	CAP_FOR_EACH_U32(i) {
+		if (i >= VFS_CAP_U32_2)
+			break;
+		cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
+	}
+
+	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+	return 0;
+}
+
 /*
  * Attempt to get the on-exec apply capability sets for an executable file from
  * its xattrs and, if present, apply them to the proposed credentials being
@@ -457,10 +668,12 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 		return 0;
 
 	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+	if (rc == -ENODATA)
+		rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -657,6 +870,18 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
 	if (!strcmp(name, XATTR_NAME_CAPS)) {
+		if (current_user_ns() == &init_user_ns && !capable(CAP_SETFCAP))
+			return -EPERM;
+		/* for non-init userns we'll check permission later in
+		 * cap_setxattr_make_nscap() */
+		return 0;
+	}
+
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		/* only initial userns is allowed to set security.nscapability
+		 * directly.  We could be more flexible, but would need to
+		 * convert the rootid to target ns.  Defer.
+		 */
 		if (!capable(CAP_SETFCAP))
 			return -EPERM;
 		return 0;
@@ -682,12 +907,23 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+	struct inode *inode = d_backing_inode(dentry);
 	if (!strcmp(name, XATTR_NAME_CAPS)) {
 		if (!capable(CAP_SETFCAP))
 			return -EPERM;
 		return 0;
 	}
 
+	if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+		/* do allow root to clear this capability out if they really
+		 * want to */
+		if (!inode)
+			return -EINVAL;
+		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+			return -EPERM;
+		return 0;
+	}
+
 	if (!strncmp(name, XATTR_SECURITY_PREFIX,
 		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
 	    !capable(CAP_SYS_ADMIN))
@@ -1078,6 +1314,7 @@ struct security_hook_list capability_hooks[] = {
 	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
 	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
 	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
+	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
 	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
 	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
 	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-18 21:57                           ` [PATCH RFC] user-namespaced file capabilities - now with more magic Serge E. Hallyn
@ 2016-05-19 20:53                             ` Mimi Zohar
  2016-05-20  3:40                               ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Mimi Zohar @ 2016-05-19 20:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Jann Horn, Eric W. Biederman, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> This patch introduces a new security.nscapability xattr.  It
> is mostly like security.capability, but also lists a 'rootid'.
> This is the uid_t (in init_user_ns) of the root id (uid 0 in a
> namespace) in whose namespaces the file capabilities may take
> effect.
> 
> A privileged (cap_setfcap) process in the initial user ns may
> set and read this xattr directly.  However, its real intent is
> to be used as a transparent fallback in user namespaces.
> 
> Root in a user ns cannot be trusted to write security.capability
> xattrs, because any user on the host could map his own uid to root
> in a namespace, write the xattr, and execute the file with privilege
> on the host.
> 
> With this patch, when root in a user ns asks to write security.capability,
> the kernel will transparently write a security.nscapability xattr
> instead, filling in the kuid of the calling user's root uid.  Subsequently,
> any task executing the file which has the noted k_uid as its root uid,
> or which is in a descendent user_ns of such a user_ns, will run the
> file with capabilities.
> 
> When reading the security.capability xattr from a non-init user_ns, a valid
> security.nscapability will be shown if it exists.  Such a task is not
> allowed to read security.nscapability.  This could be accomodated, however

Add the word "directly" as "to read security.nscapability directly".

> it requires the kernel to convert the kuid_t to a valid uid in the reader's
> user_ns.  So for now it's simply not supported.

I really like the idea that the kernel transparently replaces
nscapability for capability.

> Only a single security.nscapability xattr may be written.  This patch
> could be expanded to allow a list of capabilities and rootids, however
> I do not believe that to be a worthwhile use case.

Ok

> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.
> 
> Note - listxattr is not being handled here.  So results of that can be
> inconsistent with get/setxattr.  Fixing that will require yet more
> deceit in fs/xattr.c.
> 
> Note2 - it may be less sneaky to hide all the magic behind the
> security.nscapability xattr.  So userspace would need to know to
> use that xattr name when needed, but with the same format as
> security.capability.  The kuid_t rootid would be filled in by the
> kernel.  That's a middle ground between my last patch and this one.

The less userspace needs to differentiate between running in a namespace
and not, the better.

Note3 - capability is currently protected by EVM, when enabled.  Should
ns_capability also be a protected xattr?

> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> ---
>  fs/xattr.c                      |  18 ++-
>  include/linux/capability.h      |   8 +-
>  include/uapi/linux/capability.h |  19 +++
>  include/uapi/linux/xattr.h      |   3 +
>  security/commoncap.c            | 253 ++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 291 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 4861322..5c0e7ae 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>  {
>  	struct inode *inode = dentry->d_inode;
>  	int error = -EOPNOTSUPP;
> +	void *wvalue = NULL;
> +	size_t wsize = 0;
>  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>  				   XATTR_SECURITY_PREFIX_LEN);
> 
> -	if (issec)
> +	if (issec) {
>  		inode->i_flags &= ~S_NOSEC;
> +		/* if root in a non-init user_ns tries to set
> +		 * security.capability, write a security.nscapability
> +		 * in its place */
> +		if (!strcmp(name, "security.capability") &&
> +				current_user_ns() != &init_user_ns) {
> +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> +			if (!wvalue)
> +				return -EPERM;
> +			value = wvalue;
> +			size = wsize;
> +			name = "security.nscapability";
> +		}

The call to capable_wrt_inode_uidgid() is hidden behind
cap_setxattr_make_nscap().  Does it make sense to call it here instead,
before the security.capability test?  This would lay the foundation for
doing something similar for IMA.

(Will continue reviewing ...)

Mimi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-19 20:53                             ` Mimi Zohar
@ 2016-05-20  3:40                               ` Serge E. Hallyn
  2016-05-20 11:19                                 ` Mimi Zohar
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-20  3:40 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Serge E. Hallyn, LKML, Jann Horn, Eric W. Biederman,
	Seth Forshee, LSM, Andrew G. Morgan, Kees Cook,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers

Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> > This patch introduces a new security.nscapability xattr.  It
> > is mostly like security.capability, but also lists a 'rootid'.
> > This is the uid_t (in init_user_ns) of the root id (uid 0 in a
> > namespace) in whose namespaces the file capabilities may take
> > effect.
> > 
> > A privileged (cap_setfcap) process in the initial user ns may
> > set and read this xattr directly.  However, its real intent is
> > to be used as a transparent fallback in user namespaces.
> > 
> > Root in a user ns cannot be trusted to write security.capability
> > xattrs, because any user on the host could map his own uid to root
> > in a namespace, write the xattr, and execute the file with privilege
> > on the host.
> > 
> > With this patch, when root in a user ns asks to write security.capability,
> > the kernel will transparently write a security.nscapability xattr
> > instead, filling in the kuid of the calling user's root uid.  Subsequently,
> > any task executing the file which has the noted k_uid as its root uid,
> > or which is in a descendent user_ns of such a user_ns, will run the
> > file with capabilities.
> > 
> > When reading the security.capability xattr from a non-init user_ns, a valid
> > security.nscapability will be shown if it exists.  Such a task is not
> > allowed to read security.nscapability.  This could be accomodated, however
> 
> Add the word "directly" as "to read security.nscapability directly".

Updated in my git tree.

> > it requires the kernel to convert the kuid_t to a valid uid in the reader's
> > user_ns.  So for now it's simply not supported.
> 
> I really like the idea that the kernel transparently replaces
> nscapability for capability.
> 
> > Only a single security.nscapability xattr may be written.  This patch
> > could be expanded to allow a list of capabilities and rootids, however
> > I do not believe that to be a worthwhile use case.
> 
> Ok
> 
> > This allows a simple setxattr to work, allows tar/untar to
> > work, and allows us to tar in one namespace and untar in
> > another while preserving the capability, without risking
> > leaking privilege into a parent namespace.
> > 
> > Note - listxattr is not being handled here.  So results of that can be
> > inconsistent with get/setxattr.  Fixing that will require yet more
> > deceit in fs/xattr.c.
> > 
> > Note2 - it may be less sneaky to hide all the magic behind the
> > security.nscapability xattr.  So userspace would need to know to
> > use that xattr name when needed, but with the same format as
> > security.capability.  The kuid_t rootid would be filled in by the
> > kernel.  That's a middle ground between my last patch and this one.
> 
> The less userspace needs to differentiate between running in a namespace
> and not, the better.
> 
> Note3 - capability is currently protected by EVM, when enabled.  Should
> ns_capability also be a protected xattr?

Hm - that would protect it from offline attacks, but allow the container
to update it, right?  That sounds good.

> > Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> > ---
> >  fs/xattr.c                      |  18 ++-
> >  include/linux/capability.h      |   8 +-
> >  include/uapi/linux/capability.h |  19 +++
> >  include/uapi/linux/xattr.h      |   3 +
> >  security/commoncap.c            | 253 ++++++++++++++++++++++++++++++++++++++--
> >  5 files changed, 291 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 4861322..5c0e7ae 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >  {
> >  	struct inode *inode = dentry->d_inode;
> >  	int error = -EOPNOTSUPP;
> > +	void *wvalue = NULL;
> > +	size_t wsize = 0;
> >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >  				   XATTR_SECURITY_PREFIX_LEN);
> > 
> > -	if (issec)
> > +	if (issec) {
> >  		inode->i_flags &= ~S_NOSEC;
> > +		/* if root in a non-init user_ns tries to set
> > +		 * security.capability, write a security.nscapability
> > +		 * in its place */
> > +		if (!strcmp(name, "security.capability") &&
> > +				current_user_ns() != &init_user_ns) {
> > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> > +			if (!wvalue)
> > +				return -EPERM;
> > +			value = wvalue;
> > +			size = wsize;
> > +			name = "security.nscapability";
> > +		}
> 
> The call to capable_wrt_inode_uidgid() is hidden behind
> cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> before the security.capability test?  This would lay the foundation for
> doing something similar for IMA.

Might make sense to move that.  Though looking at it with fresh eyes I wonder
whether adding less code here at __vfs_setxattr_noperm(), i.e.

		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
			return -EPERM;

would be cleaner.

> (Will continue reviewing ...)

Awesome, thanks Mimi.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20  3:40                               ` Serge E. Hallyn
@ 2016-05-20 11:19                                 ` Mimi Zohar
  2016-05-20 18:28                                   ` Eric W. Biederman
  0 siblings, 1 reply; 33+ messages in thread
From: Mimi Zohar @ 2016-05-20 11:19 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Jann Horn, Eric W. Biederman, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:

> > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > index 4861322..5c0e7ae 100644
> > > --- a/fs/xattr.c
> > > +++ b/fs/xattr.c
> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> > >  {
> > >  	struct inode *inode = dentry->d_inode;
> > >  	int error = -EOPNOTSUPP;
> > > +	void *wvalue = NULL;
> > > +	size_t wsize = 0;
> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> > >  				   XATTR_SECURITY_PREFIX_LEN);
> > > 
> > > -	if (issec)
> > > +	if (issec) {
> > >  		inode->i_flags &= ~S_NOSEC;
> > > +		/* if root in a non-init user_ns tries to set
> > > +		 * security.capability, write a security.nscapability
> > > +		 * in its place */
> > > +		if (!strcmp(name, "security.capability") &&
> > > +				current_user_ns() != &init_user_ns) {
> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> > > +			if (!wvalue)
> > > +				return -EPERM;
> > > +			value = wvalue;
> > > +			size = wsize;
> > > +			name = "security.nscapability";
> > > +		}
> > 
> > The call to capable_wrt_inode_uidgid() is hidden behind
> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> > before the security.capability test?  This would lay the foundation for
> > doing something similar for IMA.
> 
> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> 
> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> 			return -EPERM;
> 
> would be cleaner.

Yes, it would be cleaner,  but I'm suggesting you do all the hard work
making it generic.  Then the rest of us can follow your lead.  Its more
likely that you'll get it right.  At a high level, it might look like:

               /* Permit root in a non-init user_ns to modify the security
                 * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
                 */
                if ((current_user_ns() != &init_user_ns) &&
                        capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {

			if  security..capability
				call capability  /* set nscapability? */

			else if security.ima 
				call ima 	/* set ns_ima? */
		}

Mimi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 11:19                                 ` Mimi Zohar
@ 2016-05-20 18:28                                   ` Eric W. Biederman
  2016-05-20 19:09                                     ` Mimi Zohar
  2016-05-20 19:26                                     ` Serge E. Hallyn
  0 siblings, 2 replies; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-20 18:28 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Serge E. Hallyn, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

Mimi Zohar <zohar@linux.vnet.ibm.com> writes:

> On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
>> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
>> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
>
>> > > diff --git a/fs/xattr.c b/fs/xattr.c
>> > > index 4861322..5c0e7ae 100644
>> > > --- a/fs/xattr.c
>> > > +++ b/fs/xattr.c
>> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> > >  {
>> > >  	struct inode *inode = dentry->d_inode;
>> > >  	int error = -EOPNOTSUPP;
>> > > +	void *wvalue = NULL;
>> > > +	size_t wsize = 0;
>> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> > >  				   XATTR_SECURITY_PREFIX_LEN);
>> > > 
>> > > -	if (issec)
>> > > +	if (issec) {
>> > >  		inode->i_flags &= ~S_NOSEC;
>> > > +		/* if root in a non-init user_ns tries to set
>> > > +		 * security.capability, write a security.nscapability
>> > > +		 * in its place */
>> > > +		if (!strcmp(name, "security.capability") &&
>> > > +				current_user_ns() != &init_user_ns) {
>> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
>> > > +			if (!wvalue)
>> > > +				return -EPERM;
>> > > +			value = wvalue;
>> > > +			size = wsize;
>> > > +			name = "security.nscapability";
>> > > +		}
>> > 
>> > The call to capable_wrt_inode_uidgid() is hidden behind
>> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
>> > before the security.capability test?  This would lay the foundation for
>> > doing something similar for IMA.
>> 
>> Might make sense to move that.  Though looking at it with fresh eyes I wonder
>> whether adding less code here at __vfs_setxattr_noperm(), i.e.
>> 
>> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
>> 			return -EPERM;
>> 
>> would be cleaner.
>
> Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> making it generic.  Then the rest of us can follow your lead.  Its more
> likely that you'll get it right.  At a high level, it might look like:
>
>                /* Permit root in a non-init user_ns to modify the security
>                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
>                  */
>                 if ((current_user_ns() != &init_user_ns) &&
>                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
>
> 			if  security..capability
> 				call capability  /* set nscapability? */
>
> 			else if security.ima 
> 				call ima 	/* set ns_ima? */
> 		}

Hmm.  I am confused about this part of the strategy.

I don't understand the capability vs nscapability distinction.  It seems
to add complexity without benefit.

If I am in a nested user namespace and I try to write a capability on a
file and it has nscapability I can't be allowed to (as a more privileged
user namespace already wrote it).

Not rewriting an existing attribute seems to be the only benefit I can
see to have both a capability and a nscapability attribute vs having
a new version of the capability attribute.

Am I missing something here?

Mimi as for generalizing the code for handling IMA I expect it makes
sense to refactor the code to have shared library functions (or whatever
it takes to generalize the code) when you are ready to implement this
kind of IMA attribute.  That way in the first implementation we can
concentrate on getting the code clean and the sematics correct.

That alone seems to be taking a while.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 18:28                                   ` Eric W. Biederman
@ 2016-05-20 19:09                                     ` Mimi Zohar
  2016-05-20 19:11                                       ` Eric W. Biederman
  2016-05-20 19:26                                     ` Serge E. Hallyn
  1 sibling, 1 reply; 33+ messages in thread
From: Mimi Zohar @ 2016-05-20 19:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

On Fri, 2016-05-20 at 13:28 -0500, Eric W. Biederman wrote:
> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
> 
> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> >
> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
> >> > > index 4861322..5c0e7ae 100644
> >> > > --- a/fs/xattr.c
> >> > > +++ b/fs/xattr.c
> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >> > >  {
> >> > >  	struct inode *inode = dentry->d_inode;
> >> > >  	int error = -EOPNOTSUPP;
> >> > > +	void *wvalue = NULL;
> >> > > +	size_t wsize = 0;
> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
> >> > > 
> >> > > -	if (issec)
> >> > > +	if (issec) {
> >> > >  		inode->i_flags &= ~S_NOSEC;
> >> > > +		/* if root in a non-init user_ns tries to set
> >> > > +		 * security.capability, write a security.nscapability
> >> > > +		 * in its place */
> >> > > +		if (!strcmp(name, "security.capability") &&
> >> > > +				current_user_ns() != &init_user_ns) {
> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> >> > > +			if (!wvalue)
> >> > > +				return -EPERM;
> >> > > +			value = wvalue;
> >> > > +			size = wsize;
> >> > > +			name = "security.nscapability";
> >> > > +		}
> >> > 
> >> > The call to capable_wrt_inode_uidgid() is hidden behind
> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> >> > before the security.capability test?  This would lay the foundation for
> >> > doing something similar for IMA.
> >> 
> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> >> 
> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> >> 			return -EPERM;
> >> 
> >> would be cleaner.
> >
> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> > making it generic.  Then the rest of us can follow your lead.  Its more
> > likely that you'll get it right.  At a high level, it might look like:
> >
> >                /* Permit root in a non-init user_ns to modify the security
> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
> >                  */
> >                 if ((current_user_ns() != &init_user_ns) &&
> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
> >
> > 			if  security..capability
> > 				call capability  /* set nscapability? */
> >
> > 			else if security.ima 
> > 				call ima 	/* set ns_ima? */
> > 		}
> 
> Hmm.  I am confused about this part of the strategy.
> 
> I don't understand the capability vs nscapability distinction.  It seems
> to add complexity without benefit.

Only real root can write security xattrs, which prevents root in a
namespace from writing the included security xattrs in a tar package
from being installed.  Nobody is suggesting changing this behavior.
Serge's solution is to allow an equivalent xattr to be written.  For
capabilities, it would be ns_capability.   Similarly, IMA would write
security.ns_ima.  (I'm not sure about SELinux or Smack.)

> If I am in a nested user namespace and I try to write a capability on a
> file and it has nscapability I can't be allowed to (as a more privileged
> user namespace already wrote it).
> 
> Not rewriting an existing attribute seems to be the only benefit I can
> see to have both a capability and a nscapability attribute vs having
> a new version of the capability attribute.
> 
> Am I missing something here?

Only real root in the namespace can write the equivalent security xattr.
I'm hoping this can be done without having to modify userspace apps.  (I
hope that answers your question.)

> Mimi as for generalizing the code for handling IMA I expect it makes
> sense to refactor the code to have shared library functions (or whatever
> it takes to generalize the code) when you are ready to implement this
> kind of IMA attribute.  That way in the first implementation we can
> concentrate on getting the code clean and the sematics correct.
> 
> That alone seems to be taking a while.

There should be a generic solution that works for security xattrs, not
just capabilities.

Mimi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 19:09                                     ` Mimi Zohar
@ 2016-05-20 19:11                                       ` Eric W. Biederman
  0 siblings, 0 replies; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-20 19:11 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Serge E. Hallyn, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

Mimi Zohar <zohar@linux.vnet.ibm.com> writes:

> On Fri, 2016-05-20 at 13:28 -0500, Eric W. Biederman wrote:
>> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
>> 
>> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
>> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
>> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
>> >
>> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
>> >> > > index 4861322..5c0e7ae 100644
>> >> > > --- a/fs/xattr.c
>> >> > > +++ b/fs/xattr.c
>> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >> > >  {
>> >> > >  	struct inode *inode = dentry->d_inode;
>> >> > >  	int error = -EOPNOTSUPP;
>> >> > > +	void *wvalue = NULL;
>> >> > > +	size_t wsize = 0;
>> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
>> >> > > 
>> >> > > -	if (issec)
>> >> > > +	if (issec) {
>> >> > >  		inode->i_flags &= ~S_NOSEC;
>> >> > > +		/* if root in a non-init user_ns tries to set
>> >> > > +		 * security.capability, write a security.nscapability
>> >> > > +		 * in its place */
>> >> > > +		if (!strcmp(name, "security.capability") &&
>> >> > > +				current_user_ns() != &init_user_ns) {
>> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
>> >> > > +			if (!wvalue)
>> >> > > +				return -EPERM;
>> >> > > +			value = wvalue;
>> >> > > +			size = wsize;
>> >> > > +			name = "security.nscapability";
>> >> > > +		}
>> >> > 
>> >> > The call to capable_wrt_inode_uidgid() is hidden behind
>> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
>> >> > before the security.capability test?  This would lay the foundation for
>> >> > doing something similar for IMA.
>> >> 
>> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
>> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
>> >> 
>> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
>> >> 			return -EPERM;
>> >> 
>> >> would be cleaner.
>> >
>> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
>> > making it generic.  Then the rest of us can follow your lead.  Its more
>> > likely that you'll get it right.  At a high level, it might look like:
>> >
>> >                /* Permit root in a non-init user_ns to modify the security
>> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
>> >                  */
>> >                 if ((current_user_ns() != &init_user_ns) &&
>> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
>> >
>> > 			if  security..capability
>> > 				call capability  /* set nscapability? */
>> >
>> > 			else if security.ima 
>> > 				call ima 	/* set ns_ima? */
>> > 		}
>> 
>> Hmm.  I am confused about this part of the strategy.
>> 
>> I don't understand the capability vs nscapability distinction.  It seems
>> to add complexity without benefit.
>
> Only real root can write security xattrs, which prevents root in a
> namespace from writing the included security xattrs in a tar package
> from being installed.  Nobody is suggesting changing this behavior.
> Serge's solution is to allow an equivalent xattr to be written.  For
> capabilities, it would be ns_capability.   Similarly, IMA would write
> security.ns_ima.  (I'm not sure about SELinux or Smack.)

nscapability is an xattr in the security namespace.  So root in a
namespace can write to a specific security xattr.

I am saying that at least at a quick examinination that having two
attributes that control the same things appears to be a mistake, as it
just makes it easy for them to get out of sync and generally be
confusing.

>> If I am in a nested user namespace and I try to write a capability on a
>> file and it has nscapability I can't be allowed to (as a more privileged
>> user namespace already wrote it).
>> 
>> Not rewriting an existing attribute seems to be the only benefit I can
>> see to have both a capability and a nscapability attribute vs having
>> a new version of the capability attribute.
>> 
>> Am I missing something here?
>
> Only real root in the namespace can write the equivalent security xattr.
> I'm hoping this can be done without having to modify userspace apps.  (I
> hope that answers your question.)

But it is possible and actually desirable to have user namespaces nested
in user namespaces nested in user namespaces.  Which means the existing
xattr must be examined before being written.

At which point I don't see a gain by having both a capability and an
nscapability attribute.

>> Mimi as for generalizing the code for handling IMA I expect it makes
>> sense to refactor the code to have shared library functions (or whatever
>> it takes to generalize the code) when you are ready to implement this
>> kind of IMA attribute.  That way in the first implementation we can
>> concentrate on getting the code clean and the sematics correct.
>> 
>> That alone seems to be taking a while.
>
> There should be a generic solution that works for security xattrs, not
> just capabilities.

However we don't have to start there.  We can think about it but we
probably should not start there.

We certainly need infrastructure to support this case.  Security modules
deliberately can be very unique so in general it is not possible to come
up with a design for all cases.  Especially as nesting of user
namespaces does not imply nesting of security module contexts.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 18:28                                   ` Eric W. Biederman
  2016-05-20 19:09                                     ` Mimi Zohar
@ 2016-05-20 19:26                                     ` Serge E. Hallyn
  2016-05-20 19:42                                       ` Eric W. Biederman
  1 sibling, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-20 19:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Mimi Zohar, Serge E. Hallyn, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
> 
> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> >
> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
> >> > > index 4861322..5c0e7ae 100644
> >> > > --- a/fs/xattr.c
> >> > > +++ b/fs/xattr.c
> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >> > >  {
> >> > >  	struct inode *inode = dentry->d_inode;
> >> > >  	int error = -EOPNOTSUPP;
> >> > > +	void *wvalue = NULL;
> >> > > +	size_t wsize = 0;
> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
> >> > > 
> >> > > -	if (issec)
> >> > > +	if (issec) {
> >> > >  		inode->i_flags &= ~S_NOSEC;
> >> > > +		/* if root in a non-init user_ns tries to set
> >> > > +		 * security.capability, write a security.nscapability
> >> > > +		 * in its place */
> >> > > +		if (!strcmp(name, "security.capability") &&
> >> > > +				current_user_ns() != &init_user_ns) {
> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> >> > > +			if (!wvalue)
> >> > > +				return -EPERM;
> >> > > +			value = wvalue;
> >> > > +			size = wsize;
> >> > > +			name = "security.nscapability";
> >> > > +		}
> >> > 
> >> > The call to capable_wrt_inode_uidgid() is hidden behind
> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> >> > before the security.capability test?  This would lay the foundation for
> >> > doing something similar for IMA.
> >> 
> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> >> 
> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> >> 			return -EPERM;
> >> 
> >> would be cleaner.
> >
> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> > making it generic.  Then the rest of us can follow your lead.  Its more
> > likely that you'll get it right.  At a high level, it might look like:
> >
> >                /* Permit root in a non-init user_ns to modify the security
> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
> >                  */
> >                 if ((current_user_ns() != &init_user_ns) &&
> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
> >
> > 			if  security..capability
> > 				call capability  /* set nscapability? */
> >
> > 			else if security.ima 
> > 				call ima 	/* set ns_ima? */
> > 		}
> 
> Hmm.  I am confused about this part of the strategy.
> 
> I don't understand the capability vs nscapability distinction.  It seems
> to add complexity without benefit.

...  Well, yes, we could simply make a new version of security.capability
xattr, and make rootid == 0 mean it was written by the init_user_ns.  Is
that what you mean?

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 19:26                                     ` Serge E. Hallyn
@ 2016-05-20 19:42                                       ` Eric W. Biederman
  2016-05-20 19:59                                         ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Eric W. Biederman @ 2016-05-20 19:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Mimi Zohar, LKML, Jann Horn, Seth Forshee, LSM, Andrew G. Morgan,
	Kees Cook, Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
>> 
>> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
>> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
>> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
>> >
>> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
>> >> > > index 4861322..5c0e7ae 100644
>> >> > > --- a/fs/xattr.c
>> >> > > +++ b/fs/xattr.c
>> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
>> >> > >  {
>> >> > >  	struct inode *inode = dentry->d_inode;
>> >> > >  	int error = -EOPNOTSUPP;
>> >> > > +	void *wvalue = NULL;
>> >> > > +	size_t wsize = 0;
>> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
>> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
>> >> > > 
>> >> > > -	if (issec)
>> >> > > +	if (issec) {
>> >> > >  		inode->i_flags &= ~S_NOSEC;
>> >> > > +		/* if root in a non-init user_ns tries to set
>> >> > > +		 * security.capability, write a security.nscapability
>> >> > > +		 * in its place */
>> >> > > +		if (!strcmp(name, "security.capability") &&
>> >> > > +				current_user_ns() != &init_user_ns) {
>> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
>> >> > > +			if (!wvalue)
>> >> > > +				return -EPERM;
>> >> > > +			value = wvalue;
>> >> > > +			size = wsize;
>> >> > > +			name = "security.nscapability";
>> >> > > +		}
>> >> > 
>> >> > The call to capable_wrt_inode_uidgid() is hidden behind
>> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
>> >> > before the security.capability test?  This would lay the foundation for
>> >> > doing something similar for IMA.
>> >> 
>> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
>> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
>> >> 
>> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
>> >> 			return -EPERM;
>> >> 
>> >> would be cleaner.
>> >
>> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
>> > making it generic.  Then the rest of us can follow your lead.  Its more
>> > likely that you'll get it right.  At a high level, it might look like:
>> >
>> >                /* Permit root in a non-init user_ns to modify the security
>> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
>> >                  */
>> >                 if ((current_user_ns() != &init_user_ns) &&
>> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
>> >
>> > 			if  security..capability
>> > 				call capability  /* set nscapability? */
>> >
>> > 			else if security.ima 
>> > 				call ima 	/* set ns_ima? */
>> > 		}
>> 
>> Hmm.  I am confused about this part of the strategy.
>> 
>> I don't understand the capability vs nscapability distinction.  It seems
>> to add complexity without benefit.
>
> ...  Well, yes, we could simply make a new version of security.capability
> xattr, and make rootid == 0 mean it was written by the init_user_ns.  Is
> that what you mean?

Yes.

That would seem to simplify the logic to ensure the policy we enforce is
consistent with what is on disk.

Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 19:42                                       ` Eric W. Biederman
@ 2016-05-20 19:59                                         ` Serge E. Hallyn
  2016-05-20 23:23                                           ` Mimi Zohar
  0 siblings, 1 reply; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-20 19:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Mimi Zohar, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
> >> 
> >> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> >> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> >> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> >> >
> >> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
> >> >> > > index 4861322..5c0e7ae 100644
> >> >> > > --- a/fs/xattr.c
> >> >> > > +++ b/fs/xattr.c
> >> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> >> >> > >  {
> >> >> > >  	struct inode *inode = dentry->d_inode;
> >> >> > >  	int error = -EOPNOTSUPP;
> >> >> > > +	void *wvalue = NULL;
> >> >> > > +	size_t wsize = 0;
> >> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> >> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
> >> >> > > 
> >> >> > > -	if (issec)
> >> >> > > +	if (issec) {
> >> >> > >  		inode->i_flags &= ~S_NOSEC;
> >> >> > > +		/* if root in a non-init user_ns tries to set
> >> >> > > +		 * security.capability, write a security.nscapability
> >> >> > > +		 * in its place */
> >> >> > > +		if (!strcmp(name, "security.capability") &&
> >> >> > > +				current_user_ns() != &init_user_ns) {
> >> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> >> >> > > +			if (!wvalue)
> >> >> > > +				return -EPERM;
> >> >> > > +			value = wvalue;
> >> >> > > +			size = wsize;
> >> >> > > +			name = "security.nscapability";
> >> >> > > +		}
> >> >> > 
> >> >> > The call to capable_wrt_inode_uidgid() is hidden behind
> >> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> >> >> > before the security.capability test?  This would lay the foundation for
> >> >> > doing something similar for IMA.
> >> >> 
> >> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> >> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> >> >> 
> >> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> >> >> 			return -EPERM;
> >> >> 
> >> >> would be cleaner.
> >> >
> >> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> >> > making it generic.  Then the rest of us can follow your lead.  Its more
> >> > likely that you'll get it right.  At a high level, it might look like:
> >> >
> >> >                /* Permit root in a non-init user_ns to modify the security
> >> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
> >> >                  */
> >> >                 if ((current_user_ns() != &init_user_ns) &&
> >> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
> >> >
> >> > 			if  security..capability
> >> > 				call capability  /* set nscapability? */
> >> >
> >> > 			else if security.ima 
> >> > 				call ima 	/* set ns_ima? */
> >> > 		}
> >> 
> >> Hmm.  I am confused about this part of the strategy.
> >> 
> >> I don't understand the capability vs nscapability distinction.  It seems
> >> to add complexity without benefit.
> >
> > ...  Well, yes, we could simply make a new version of security.capability
> > xattr, and make rootid == 0 mean it was written by the init_user_ns.  Is
> > that what you mean?
> 
> Yes.
> 
> That would seem to simplify the logic to ensure the policy we enforce is
> consistent with what is on disk.

I'll give that a shot.  I think the reason I did it this way was that I'm
still kind of stuck in the not-magic way of thinking about it.  But yeah
with the kernel magically writing inthe kuid there's probably no reason not
to.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 19:59                                         ` Serge E. Hallyn
@ 2016-05-20 23:23                                           ` Mimi Zohar
  2016-05-20 23:32                                             ` Serge E. Hallyn
  0 siblings, 1 reply; 33+ messages in thread
From: Mimi Zohar @ 2016-05-20 23:23 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, LKML, Jann Horn, Seth Forshee, LSM,
	Andrew G. Morgan, Kees Cook, Michael Kerrisk-manpages,
	Serge E. Hallyn, Linux API, Andy Lutomirski, Linux Containers

On Fri, 2016-05-20 at 14:59 -0500, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> > 
> > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > >> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
> > >> 
> > >> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> > >> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> > >> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> > >> >
> > >> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
> > >> >> > > index 4861322..5c0e7ae 100644
> > >> >> > > --- a/fs/xattr.c
> > >> >> > > +++ b/fs/xattr.c
> > >> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> > >> >> > >  {
> > >> >> > >  	struct inode *inode = dentry->d_inode;
> > >> >> > >  	int error = -EOPNOTSUPP;
> > >> >> > > +	void *wvalue = NULL;
> > >> >> > > +	size_t wsize = 0;
> > >> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> > >> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
> > >> >> > > 
> > >> >> > > -	if (issec)
> > >> >> > > +	if (issec) {
> > >> >> > >  		inode->i_flags &= ~S_NOSEC;
> > >> >> > > +		/* if root in a non-init user_ns tries to set
> > >> >> > > +		 * security.capability, write a security.nscapability
> > >> >> > > +		 * in its place */
> > >> >> > > +		if (!strcmp(name, "security.capability") &&
> > >> >> > > +				current_user_ns() != &init_user_ns) {
> > >> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> > >> >> > > +			if (!wvalue)
> > >> >> > > +				return -EPERM;
> > >> >> > > +			value = wvalue;
> > >> >> > > +			size = wsize;
> > >> >> > > +			name = "security.nscapability";
> > >> >> > > +		}
> > >> >> > 
> > >> >> > The call to capable_wrt_inode_uidgid() is hidden behind
> > >> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> > >> >> > before the security.capability test?  This would lay the foundation for
> > >> >> > doing something similar for IMA.
> > >> >> 
> > >> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> > >> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> > >> >> 
> > >> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> > >> >> 			return -EPERM;
> > >> >> 
> > >> >> would be cleaner.
> > >> >
> > >> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> > >> > making it generic.  Then the rest of us can follow your lead.  Its more
> > >> > likely that you'll get it right.  At a high level, it might look like:
> > >> >
> > >> >                /* Permit root in a non-init user_ns to modify the security
> > >> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
> > >> >                  */
> > >> >                 if ((current_user_ns() != &init_user_ns) &&
> > >> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
> > >> >
> > >> > 			if  security..capability
> > >> > 				call capability  /* set nscapability? */
> > >> >
> > >> > 			else if security.ima 
> > >> > 				call ima 	/* set ns_ima? */
> > >> > 		}
> > >> 
> > >> Hmm.  I am confused about this part of the strategy.
> > >> 
> > >> I don't understand the capability vs nscapability distinction.  It seems
> > >> to add complexity without benefit.
> > >
> > > ...  Well, yes, we could simply make a new version of security.capability
> > > xattr, and make rootid == 0 mean it was written by the init_user_ns.  Is
> > > that what you mean?
> > 
> > Yes.
> > 
> > That would seem to simplify the logic to ensure the policy we enforce is
> > consistent with what is on disk.
> 
> I'll give that a shot.  I think the reason I did it this way was that I'm
> still kind of stuck in the not-magic way of thinking about it.  But yeah
> with the kernel magically writing inthe kuid there's probably no reason not
> to.

Totally confused.  Will this method allow multiple instances of the
xattr on disk? 

Mimi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC] user-namespaced file capabilities - now with more magic
  2016-05-20 23:23                                           ` Mimi Zohar
@ 2016-05-20 23:32                                             ` Serge E. Hallyn
  0 siblings, 0 replies; 33+ messages in thread
From: Serge E. Hallyn @ 2016-05-20 23:32 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Serge E. Hallyn, Eric W. Biederman, LKML, Jann Horn,
	Seth Forshee, LSM, Andrew G. Morgan, Kees Cook,
	Michael Kerrisk-manpages, Serge E. Hallyn, Linux API,
	Andy Lutomirski, Linux Containers

Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> On Fri, 2016-05-20 at 14:59 -0500, Serge E. Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > "Serge E. Hallyn" <serge@hallyn.com> writes:
> > > 
> > > > Quoting Eric W. Biederman (ebiederm@xmission.com):
> > > >> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:
> > > >> 
> > > >> > On Thu, 2016-05-19 at 22:40 -0500, Serge E. Hallyn wrote:
> > > >> >> Quoting Mimi Zohar (zohar@linux.vnet.ibm.com):
> > > >> >> > On Wed, 2016-05-18 at 16:57 -0500, Serge E. Hallyn wrote:
> > > >> >
> > > >> >> > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > >> >> > > index 4861322..5c0e7ae 100644
> > > >> >> > > --- a/fs/xattr.c
> > > >> >> > > +++ b/fs/xattr.c
> > > >> >> > > @@ -94,11 +94,26 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
> > > >> >> > >  {
> > > >> >> > >  	struct inode *inode = dentry->d_inode;
> > > >> >> > >  	int error = -EOPNOTSUPP;
> > > >> >> > > +	void *wvalue = NULL;
> > > >> >> > > +	size_t wsize = 0;
> > > >> >> > >  	int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
> > > >> >> > >  				   XATTR_SECURITY_PREFIX_LEN);
> > > >> >> > > 
> > > >> >> > > -	if (issec)
> > > >> >> > > +	if (issec) {
> > > >> >> > >  		inode->i_flags &= ~S_NOSEC;
> > > >> >> > > +		/* if root in a non-init user_ns tries to set
> > > >> >> > > +		 * security.capability, write a security.nscapability
> > > >> >> > > +		 * in its place */
> > > >> >> > > +		if (!strcmp(name, "security.capability") &&
> > > >> >> > > +				current_user_ns() != &init_user_ns) {
> > > >> >> > > +			cap_setxattr_make_nscap(dentry, value, size, &wvalue, &wsize);
> > > >> >> > > +			if (!wvalue)
> > > >> >> > > +				return -EPERM;
> > > >> >> > > +			value = wvalue;
> > > >> >> > > +			size = wsize;
> > > >> >> > > +			name = "security.nscapability";
> > > >> >> > > +		}
> > > >> >> > 
> > > >> >> > The call to capable_wrt_inode_uidgid() is hidden behind
> > > >> >> > cap_setxattr_make_nscap().  Does it make sense to call it here instead,
> > > >> >> > before the security.capability test?  This would lay the foundation for
> > > >> >> > doing something similar for IMA.
> > > >> >> 
> > > >> >> Might make sense to move that.  Though looking at it with fresh eyes I wonder
> > > >> >> whether adding less code here at __vfs_setxattr_noperm(), i.e.
> > > >> >> 
> > > >> >> 		if (!cap_setxattr_makenscap(dentry, &value, &size, &name))
> > > >> >> 			return -EPERM;
> > > >> >> 
> > > >> >> would be cleaner.
> > > >> >
> > > >> > Yes, it would be cleaner,  but I'm suggesting you do all the hard work
> > > >> > making it generic.  Then the rest of us can follow your lead.  Its more
> > > >> > likely that you'll get it right.  At a high level, it might look like:
> > > >> >
> > > >> >                /* Permit root in a non-init user_ns to modify the security
> > > >> >                  * namespace xattr equivalents (eg. nscapability, ns_ima, etc). 
> > > >> >                  */
> > > >> >                 if ((current_user_ns() != &init_user_ns) &&
> > > >> >                         capable_wrt_inode_uidgid(inode, CAP_SETFCAP)) {
> > > >> >
> > > >> > 			if  security..capability
> > > >> > 				call capability  /* set nscapability? */
> > > >> >
> > > >> > 			else if security.ima 
> > > >> > 				call ima 	/* set ns_ima? */
> > > >> > 		}
> > > >> 
> > > >> Hmm.  I am confused about this part of the strategy.
> > > >> 
> > > >> I don't understand the capability vs nscapability distinction.  It seems
> > > >> to add complexity without benefit.
> > > >
> > > > ...  Well, yes, we could simply make a new version of security.capability
> > > > xattr, and make rootid == 0 mean it was written by the init_user_ns.  Is
> > > > that what you mean?
> > > 
> > > Yes.
> > > 
> > > That would seem to simplify the logic to ensure the policy we enforce is
> > > consistent with what is on disk.
> > 
> > I'll give that a shot.  I think the reason I did it this way was that I'm
> > still kind of stuck in the not-magic way of thinking about it.  But yeah
> > with the kernel magically writing inthe kuid there's probably no reason not
> > to.
> 
> Totally confused.  Will this method allow multiple instances of the
> xattr on disk? 

No, but we don't actually want that anyway.  The current behavior for
security.capability is that it works in all user namespaces.  So we
want to continue the behavior that if root in the init_user_ns sets a
capability, that works in all namespaces.  Allowing other namespaces
to set the capability would only be confusing.

So in the patchset I had, security.capability can only be set by
init_user_ns but works in all namespaces.  security.nscapability
cannot be set if secrity.capability is set.  And security.nscapability
works in all child namespaces of the root uid which set the cap.

-serge

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-05-20 23:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-22 17:26 namespaced file capabilities serge.hallyn
2016-04-22 17:26 ` [PATCH 1/1] simplified security.nscapability xattr serge.hallyn
2016-04-26 19:46   ` Seth Forshee
2016-04-26 21:59   ` Kees Cook
2016-04-26 22:26     ` Serge E. Hallyn
2016-04-26 22:39       ` Kees Cook
2016-04-27  4:39         ` Serge E. Hallyn
2016-04-27  8:09         ` Jann Horn
2016-05-02  3:54         ` Serge E. Hallyn
2016-05-02 18:31           ` Michael Kerrisk (man-pages)
2016-05-02 21:31           ` Eric W. Biederman
     [not found]             ` <CALQRfL7mfpyudWs4Z8W5Zi8CTG-9O0OvrCnRU7pk0MXtsLBd0A@mail.gmail.com>
2016-05-03  4:50               ` Eric W. Biederman
2016-05-10 19:00                 ` Serge E. Hallyn
2016-05-03  5:19               ` Serge E. Hallyn
2016-05-03  5:54                 ` Eric W. Biederman
2016-05-03 14:25                   ` Serge E. Hallyn
2016-05-10 19:03                     ` Serge E. Hallyn
2016-05-07 23:10                   ` Jann Horn
2016-05-11 21:02                     ` Serge E. Hallyn
2016-05-16 21:15                       ` Serge E. Hallyn
2016-05-16 21:48                         ` Serge E. Hallyn
2016-05-18 21:57                           ` [PATCH RFC] user-namespaced file capabilities - now with more magic Serge E. Hallyn
2016-05-19 20:53                             ` Mimi Zohar
2016-05-20  3:40                               ` Serge E. Hallyn
2016-05-20 11:19                                 ` Mimi Zohar
2016-05-20 18:28                                   ` Eric W. Biederman
2016-05-20 19:09                                     ` Mimi Zohar
2016-05-20 19:11                                       ` Eric W. Biederman
2016-05-20 19:26                                     ` Serge E. Hallyn
2016-05-20 19:42                                       ` Eric W. Biederman
2016-05-20 19:59                                         ` Serge E. Hallyn
2016-05-20 23:23                                           ` Mimi Zohar
2016-05-20 23:32                                             ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).